The article focuses on verifying the SCAR assumption in positive-unlabeled (PU) learning. PU learning is a machine learning task where the training data contains positive and unlabeled instances, and the goal is to train a binary classifier.
The key highlights and insights are:
The SCAR assumption states that the propensity score function, which describes the probability of a positive observation being labeled, is constant. This is a simpler assumption compared to the more realistic Selected at Random (SAR) assumption, where the propensity score depends on the feature vector.
The authors propose a two-step testing procedure to verify the SCAR assumption. In the first step, they estimate the set of positive observations among the unlabeled data. In the second step, they generate artificial labels conforming to the SCAR case, which allows them to mimic the distribution of the test statistic under the null hypothesis of SCAR.
The authors consider four different test statistics to measure the divergence between the feature distributions of labeled and positive observations: Kullback-Leibler (KL) divergence, KL divergence with covariance estimation (KLCOV), Kolmogorov-Smirnov (KS) statistic, and a classifier-based statistic (NB AUC).
Theoretical results justify the method of estimating the set of positive observations and show that if it is estimated correctly, controlling the type I error (probability of rejecting the null hypothesis when it is true) is possible.
Experiments on both artificial and real-world datasets demonstrate that the proposed test successfully detects deviations from the SCAR scenario, while effectively controlling the type I error for most datasets. Among the tested statistics, KS and NB AUC are recommended as they properly control the type I error.
The proposed test can be used as a pre-processing step to decide which PU learning algorithm to choose, as SCAR-based algorithms are much simpler and computationally faster compared to SAR-based algorithms.
To Another Language
from source content
arxiv.org
Deeper Inquiries