Finding the Sweet: Spot Batch Selection for One-Class Active Learning

Adrian Englhardt, Holger Trittenbach, Dennis Vetter, Klemens Böhm

This is the companion website for the manuscript

Adrian Englhardt, Holger Trittenbach, Dennis Vetter, Klemens Böhm, “Finding the Sweet Spot: Batch Selection for One-Class Active Learning”. In: Proceedings of the 2020 SIAM International Conference on Data Mining (SDM), May 7-9, Cincinnati, Ohio, USA. [PDF]

@inproceedings{englhardt2020bocal,
  title={Finding the Sweet Spot: Batch Selection for One-Class Active Learning},
  author={Englhardt, Adrian and Trittenbach, Holger and Vetter, Dennis and B{\"o}hm, Klemens},
  booktitle={Proceedings of the 2020 SIAM International Conference on Data Mining},
  year={2020},
  organization={SIAM}
}

This website provides code and description to reproduce experiments and analyses. The description covers the full experimental pipeline, from preprocessing the raw data and to generating the plots and tables shown in the paper.

Resources

The resources are divided into several repositories.

For an overview and a benchmark on one-class active learning visit the ocal project website

The code is licensed under a MIT License and the result data under a Creative Commons Attribution 4.0 International License.

Overview

Active learning incorporates external feedback in the learning process of a classifier. When multiple annotators are available, or classifier retraining is slow, it is useful to annotate batches of observations in parallel. Selecting a good batch is difficult because there are several trade-offs between classifier training, batch selection, annotation effort and classification accuracy. In the one-class setting for outlier detection, batch selection has not received any attention so far and queries are generally selected sequentially. In our article, we strive to find a sweet spot between the cost of batch mode active learning and classification accuracy by formalizing batch selection as an optimization problem. We the propose and evaluate several batch strategies for the one-class setting.

Literature has identified three criteria for batch selection: informativeness, representativeness and diversity. The figure shows a comparison of the three criteria in the one-class setting. Here, the inliers are white circles, outliers gray squares and the selected batch queries are red diamonds. The black line is the decision boundary of a one-class classifier. The batch selected with the representativenss criterion clumps up in a small region and has high similarity. With diversity the selected queries are well spread but some queries lie in sparse regions where feedback might only influence the classification of very few observations. Informativeness selects observations close to the decision boundary, which results in a batch that is already diverse and representative, without considering these criteria explicitly.

Evaluation Example

This figures shows the impact of the three batch selection criteria on the performance of the classifier. The ternary plot shows the median end quality for the different criteria in isolation (the corners) and their weighted combinations. In the one-class setting, ignoring the representativeness criterion yields the best results. Any weighted combination of informativeness and diversity and even both of them in isolation achieve the best results – the red dots on the bottom line of the plot.

The sweet spot is to use the top-k observations ranked by informativeness with batch sizes between 8 and 16 observations. In this case, computational cost by up to an order of magnitude compared to sequential queries, i.e., selecting one observation at a time, without sacrificing classification accuracy.

For further results and details, we refer to the companion paper.

Contact

We welcome contributions to the packages and bug reports on Github.

For questions and comments, please contact adrian.englhardt@kit.edu