Supplementary material concerning repeatability of
CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection

by Nguyen H. V., Müller E., Vreeken J., Keller F., and Böhm K.
published in Proc. SIAM International Conference on Data Mining (SDM 2013), Austin, Texas, USA (2013)

This page provides additional information about our experiments and assists in reproducing the results with our CMI algorithm. We provide this in addition to our publication [download full text PDF] published at SDM 2013 conference.
 

Computation of CMI Results


In order to enable repeatability of our results we provide the CMI algorithm and all data sets used in our experiments. In order to execute CMI you can use the following command line structure:

Parameter Meaning
-FILE_INPUT name of input file
-FILE_SUB_OUTPUT name of output file for subspaces
-FILE_OSCORES_OUTPUT name of output file for outlier scores
-NUM_ROWS number of records
-NUM_MEASURE_COLS number of columns
-FIELD_DELIMITER field delimiter of the input file
-NUM_NEIGHBORS number of nearest neighbors
-USE_DUSO_SEED set to 'true' to use CMIC
-MAX_NUM_SUBSPACES top subspaces used
-ALPHA subsample's size
-NUM_SUBSAMPLING number of subsamples
-NUM_SEEDS number of clusters
-CANDIDATE_CUTOFF size of beam
-MIN_PTS used for clustering
-EPSILON used for clustering
 

Default Parameter Settings

  • CMI:
    • Beam size M = 400 (set M = 32 for data set with less than 10 dimensions)
    • Number of clusters Q = 10
    • Expected subsample size ε = 0.1
    • Number of subspaces = 100
  • LOF
    • Number of nearest neighbors: 0.8 * number of outliers ≤ k ≤ 1.75 * number of outliers
  • DBSCAN
    • minPts = 6 and 25 ≤ epsilon ≤ 35
 

Parameter Settings for Synthetic Data:

Number of nearest neighbors = 150

Parameter Settings for Real World Data:

Data Set Number of Nearest Neighbors Size of Beam Number of Subspaces
Ann-thyroid* 100 32 16
WBCD 30 400 100
Diabetes 268 32 16
Glass 15 32 16
Ion 150 400 100
Lymphography 10 400 100
Madelon 38 400 100
Segment 35 400 100
Pendigits 30 400 100
(*) 15 categorical attributes are removed
 

Download


The reviewer is expected to agree to confidentiality requirements with respect to non-disclosure of data on this website, as the reviewer does for any paper under review. Usage is limited to repeating and exploring the experimental results of this paper. Until this work has not been published, no other use is allowed, especially not for other publications. This website conveniently documents the experimental setup used in the evaluation described in our manuscript. We will provide additional experimental data, setup, and software, which will be made available when the manuscript is published.

All benchmark data sets can be found in the following file: data.zip
It contains all synthetic data sets as well as the used benchmark data from the UCI Machine Learning Repository

In order to reproduce our results for CMI we provide the executables and source code of our method: CMI.jar

For evaluation of outlier results we provide an additional executable assisting in the calculation of AUC measures: run.jar (two input parameters: input file name and output file name)