HiCS: High Contrast
Subspaces for Density-Based Outlier
Fabian Keller, Emmanuel
Müller, Klemens Böhm
All datasets used in the following experiments can be downloaded here:
Experiments on Synthetic Data
- The MinPts setting for the
underlying LOF algorithm was always
set to 100 for experiments on synthetic data
- All subspace outlier mining
approaches were limit to the top 100
of the subspace-search result.
In summary, this leads to a total number of 21 * 42 = 882 experiments
(corresponding to a processing time of about 5 days, mainly due to RIS).
- In total, 21 synthetic
datasets were generated: 3 datasets for
each dimensionality in [10, 20, 30, 40, 50, 75, 100].
- We performed serveral
algorithmic configurations on all these
datasets. The algorithm configurations were:
- LOF (1 configuration, same
MinPts setting as all subspace
- PCALOF1 (1 configuration)
with a dimensionality reduction to
50% of the dataset dimensionality.
- PCALOF2 (1 configuration)
with a fixed dimensionality reduction
to exactly 10 dimensions.
- ENCLUS (27 configuration)
with the following parameter
- Number of bins per
dimension ∈ [5, 7, 10]
- ω ∈
[7.0, 8.0, 9.0]
- ε ∈
[0.001, 0.01, 0.02]
- RIS (9 configuration) with
the following parameter combinations:
- ε ∈
[0.05, 0.1, 0.2]
- MinPts value for core
objects ∈ [10, 20, 30]
- RANDSUB (1 configuration)
using 100 random subspaces.
- HiCS (2 configuration):
Only the type of statistical test was
varied. According to the results of preliminary experiments, a specific
search for parameter values is not necessary for HiCS. We were using
the default values:
- Number of runs = 50
- α = 0.1
- Candidate cutoff = 400
Experiments on Real World Data
The following experiments were performed with each best algorithm
configuration from the experiment on synthetic data. We applied a
standardized preprocessing procedure to all datasets (rescaling all
attributes, removing categorical attributes or attributes
show strong discretization effects). The arff-files included in the
downloadable zip-archive include all results from these preprocessing
steps. Furthermore we also stored our outlier definition in these files
(always in the last attribute, 0 = no outlier, 1 = outlier). The
following figures show the ROC plots of all experiments.
The reviewer is expected to
agree to confidentiality requirements
with respect to non-disclosure of data on this website, as
the reviewer does for any paper under review. Usage is limited to
repeating and exploring the experimental results of this paper. Until
this work has not been published, no other use is allowed, especially
not for other publications. This website conveniently documents the
experimental setup used in the evaluation described in our manuscript.
We will provide additional experimental data, setup, and software,
which will be made available when the manuscript is published.
Public access to this website
After publication of this work,
we encourage researchers in this
area to use the proposed algorithm for their own publications as
competitor. Our implementation will then be available for
to use. Thus, all algorithms, data sets and parameter setting will be
available for the community.