The resources are divided into several repositories.
- bocal-evaluation: Contains the scripts to run experiments and to analyze the results. The package readme is a step-by-step guide to reproduce the experiments described in the companion paper. Running the benchmark is compute intensive and takes many CPU hours. Therefore, we also provide the results to download (1.1 GB).
- OneClassActiveLearning.jl: A Julia package that implements various batch query strategies and the full active learning cycle.
- SVDD.jl: A Julia package for Support Vector Data Description.
For an overview and a benchmark on one-class active learning visit the ocal project website
The code is licensed under a MIT License and the result data under a Creative Commons Attribution 4.0 International License.
Active learning incorporates external feedback in the learning process of a classifier. When multiple annotators are available, or classifier retraining is slow, it is useful to annotate batches of observations in parallel. Selecting a good batch is difficult because there are several trade-offs between classifier training, batch selection, annotation effort and classification accuracy. In the one-class setting for outlier detection, batch selection has not received any attention so far and queries are generally selected sequentially. In our article, we strive to find a sweet spot between the cost of batch mode active learning and classification accuracy by formalizing batch selection as an optimization problem. We the propose and evaluate several batch strategies for the one-class setting.
Literature has identified three criteria for batch selection: informativeness, representativeness and diversity. The figure shows a comparison of the three criteria in the one-class setting. Here, the inliers are white circles, outliers gray squares and the selected batch queries are red diamonds. The black line is the decision boundary of a one-class classifier. The batch selected with the representativenss criterion clumps up in a small region and has high similarity. With diversity the selected queries are well spread but some queries lie in sparse regions where feedback might only influence the classification of very few observations. Informativeness selects observations close to the decision boundary, which results in a batch that is already diverse and representative, without considering these criteria explicitly.
This figures shows the impact of the three batch selection criteria on the performance of the classifier. The ternary plot shows the median end quality for the different criteria in isolation (the corners) and their weighted combinations. In the one-class setting, ignoring the representativeness criterion yields the best results. Any weighted combination of informativeness and diversity and even both of them in isolation achieve the best results – the red dots on the bottom line of the plot.
The sweet spot is to use the top-k observations ranked by informativeness with batch sizes between 8 and 16 observations. In this case, computational cost by up to an order of magnitude compared to sequential queries, i.e., selecting one observation at a time, without sacrificing classification accuracy.
For further results and details, we refer to the companion paper.
We welcome contributions to the packages and bug reports on Github.
For questions and comments, please contact email@example.com