September 1999
Important note: I have no other information about this topic. All information available to me is either below, or on a web page linked to this one.
For the results of the other KDD'99 contest, the knowledge discovery contest, see here.
The task for the classifier learning contest organized in conjunction with the KDD'99 conference was to learn a predictive model (i.e. a classifier) capable of distinguishing between legitimate and illegitimate connections in a computer network. Here is a detailed description of the task. The training and test data were generously made available by Prof. Sal Stolfo of Columbia University and Prof. Wenke Lee of North Carolina State University. (Update: The training and test datasets are now available in the UC Irvine KDD archive.)
In total 24 entries were submitted for the contest. There was a data quality issue with the labels of the test data, which fortunately was discovered by Ramesh Agarwal (IBM Fellow) and Mahesh Joshi (University of Minnesota Ph.D. candidate) before results were announced publicly. Ramesh Agarwal and Mahesh Joshi have analyzed the data quality issue with great detail and precision, so we are confident that the test data with corrected labels is now correct. Other participants also detected and analyzed the data quality issue, including Itzhak Levin of LLSoft, Inc.
It is important to note that the data quality issue affected only the labels of the test examples. The training data was unaffected, as was the unlabeled test data. Therefore it was not necessary to ask participants to submit recomputed entries.
Each entry was scored against the corrected test data by a scoring awk script
using the published cost matrix (see below) and the true labels of the test
examples.
Second-place performance was achieved by Itzhak Levin from LLSoft, Inc. using the tool Kernel Miner. For details see here.
Third-place performance was achieved by Vladimir Miheev, Alexei Vopilov, and Ivan Shabalin of the company MP13 in Moscow, Russia. For details see here.
The difference in performance between the three best entries is only of
marginal statistical significance; see below for a discussion of this
question. Congratulations to all the winners and thanks to all the
participants!
predicted 0
1 2
3 4 %correct
actual
--------------------------------------------------
0
| 60262 243
78 4
6 99.5%
1 |
511 3471 184
0 0 83.3%
2
| 5299 1328 223226
0 0 97.1%
3
| 168
20 0
30 10 13.2%
4
| 14527 294
0 8
1360 8.4%
|
%correct |
74.6% 64.8% 99.9% 71.4% 98.8%
In the table above the five attack categories are numbered as follows:
| 0 | normal |
| 1 | probe |
| 2 | denial of service (DOS) |
| 3 | user-to-root (U2R) |
| 4 | remote-to-local (R2L) |
Individual attack types were placed in the five categories using a categorization awk script.
The top-left entry in the confusion matrix shows that 60262 of the actual
"normal" test examples were predicted to be normal by this entry. The last
column indicates that in total 99.5% of the actual "normal" examples were
recognized correctly. The bottom row shows that 74.6% of test examples
said to be "normal" were indeed "normal" in reality.
| 0.2331 0.2356 0.2367 0.2411 0.2414 0.2443 |
0.2474 0.2479 0.2523 0.2530 0.2531 0.2545 |
0.2552 0.2575 0.2588 0.2644 0.2684 0.2952 |
0.3344 0.3767 0.3854 0.3899 0.5053 0.9414 |
It is difficult to evaluate exactly the statistical significance of differences between entries. However it is important not to read too much into differences that may well be statistically insignificant, i.e. due to randomness in the choice of training and test examples.
Statistical significance can be evaluated roughly as follows. The mean score of the winning entry as measured on the particular test set used is 0.2331, with a measured standard deviation of 0.8834. Assuming that the mean is computed from N independent observations, its standard error is 0.8334/sqrt(N). The test dataset contains 311,029 examples, but these are not all independent. An upper bound on the number of independent test examples is the number of distinct test examples, which is 77291. The standard error of the winning mean score is then at least 0.8334/sqrt(77291) = 0.0030.
If the threshold for statistical significance is taken to be two standard errors, then the winning entry is significantly superior to all others except the second and third best.
The first significant difference between entries with adjacent ranks is
between the 17th and 18th best entries. This difference is very large:
0.2952 - 0.2684 = 0.0268, which is about nine standard errors. One can
conclude that the best 17 entries all performed well, while the worst 7 entries
were definitely inferior.
predicted 0
1 2
3 4 %correct
actual
--------------------------------------------------
0
| 60322 212
57 1
1 99.6%
1 |
697 3125 342
0 2 75.0%
2
| 6144 76
223633 0
0 97.3%
3 |
209 5
1 8
5 3.5%
4 |
15785 308
1 0
95 0.6%
|
%correct |
72.5% 83.9% 99.8% 88.9% 92.2%
Only nine entries scored better than 1-nearest neighbor, of which only six
were statistically significantly better. Compared to 1-nearest neighbor,
the main achievement of the winning entry is to recognize correctly many more
"remote-to-local" attacks: 1360 compared to 95. This success is
impressive, since only 0.23% of training examples were in this category compared
to 5.20% of the test examples. The detailed task description did
point out that "It is important to note that the test data is not from the same
probability distribution as the training data".
| normal | probe | DOS | U2R | R2L | |
| normal | 0 | 1 | 2 | 2 | 2 |
| probe | 1 | 0 | 2 | 2 | 2 |
| DOS | 2 | 1 | 0 | 2 | 2 |
| U2R | 3 | 2 | 2 | 0 | 2 |
| R2L | 4 | 2 | 2 | 2 | 0 |
Here, as in the confusion matrices above, columns correspond to predicted categories, while rows correspond to actual categories. The cost matrix says that the cost incurred by classifying all examples as "probe" is not much over 1.0, if the categories U2R and R2L are rare. These two categories are in fact rare in the training dataset, and the mean cost incurred by classifying all examples as "probe" is 0.994 on the training dataset.
Some basic domain knowledge about network intrusions suggests that the U2R and R2L categories are intrinsically rare. The actual distributions of attack types in the training and test 10% datasets are:
Together, the U2R and R2L attacks constitute 5.27% of the test dataset, which is a substantial increase compared to the training dataset, but still a small fraction. The mean cost incurred by classifying all test examples as "probe" is 0.5220 on the test dataset, which is better than the average cost achieved by the worst entry submitted.training test0: 19.69% 19.48%
1: 0.83% 1.34%
2: 79.24% 73.90%
3: 0.01% 0.07%
4: 0.23% 5.20%