A Survey on Data Clasification with KDD-cup-99 Intrusion Detectin Dataset

by Akira Imada (March 2006)


Essencially good paper to start with.

From a data visualization point of view.

Some of these collections are synthetic, that is, they were designed a priori to stress prediction algorithms in predetermined ways.

. .1 Dependence of Performance on User Know edge and Expertise To illustrate the importance of factoring in user knowledge and expertise into a benchmarking effort, we relate some of our experiences with the nformation Exploration Shootout's [Grins97] first exercise, which involved the detection of intrusions into a computer network. The two major challenges for the participants were the complexity of the problem domain and the size of the database. Details of the internet protocol and internet operations are arcane, and to adequately address, let alone solve the problem, an expert in the field is necessary. The central need for a domain expert is typically a common feature of real-world knowledge discovery problems. The skills that we found to be required in our approach were: 1) domain knowledge of computer network security; 2) experience with visualization software; and 3) statistical expertise.

The first task of the shootout was a large preprocessing activity. We grouped the individual packet records into natural clusters of communications sessions. The resulting reduction in size was substantial. For example, the baseline data set contains over 350,000 records. The corresponding session-level data set contains approximately 16,000 records. We then analyzed the processed data sets predominantly with visualization techniques. For many we used parallel coordinate plots and conditioning. Even in this step, the visualization is driven by domain knowledge. With this approach, several anomalies were identified, and these turned out to be network intrusions when interpreted by a system administrator aware of various network attacks.

This experience showed that there are a number of aspects of the process that have to be evaluated, much depends on domain expertise, and on the amount of data involved. Even with this large but not huge set, the visualization required a scaling down of the problem. Any benchmark testing methodology must consider these complex requirements. Testing must also include a good understanding of the perceptual issues involved, as discussed in this next section.

For example, one could measure the output of the task of finding the outliers in a set.

In Section 4, a basic set of measures, such as ability to identify outliers and clusters has been applied in the comparative evaluations done for the prototype.

This invites a tradeoff in using synthetic vs. real data. Synthetic data is harder to construct, but the "correct" answers are known. Real data is easier to collect, but it is harder to evaluate performance, because it is nearly impossible to "know" the correct output to a task in any reasonably sized data set.

This surveys analyzed ten diferent data sets (Simple Seven; Balloons; Contact Lenses; Shuttle O-rings; Monks problem; Iris Flower; Congress Voting; Liver Disorders; Cars; Wines) using five visualization techiques (Parallel Coordinates; Scatter Plot Matrices; Survey Plots; Circle Segments; and Radviz). All of the data sets except the Simple Seven set, are from the UC Urvine Machine Learning Repository.

This invites a tradeoff in using synthetic vs. real data. Synthetic data is harder to construct, but the "correct" answers are known. Real data is easier to collect, but it is harder to evaluate performance, because it is nearly impossible to know the correct output to a task in any reasonably sized data set.

The Iris Database is perhaps the most often used data set in pattern recognition, statistics, data analysis, and machine learning. The task is to predict the class of the flower based on the 4 physical attribute measurements. There are 150 instances, 3 classes, and 4 numeric attributes. The data set is #46 from the UC collection. The dimensions are: class (Setosa, Versicolour, Virginica); sepal-length; sepal-width; petal length; and petal-width. The cardinality is 35, 23, 43, 22, 3, and 150 (cases) respectively, the PoC is equal to 342688500, and the log of PoC equals 19.65. One class is linearly separable from the other two, but the other two are not linearly separable from each other.

Local normalization means that each dimension is scaled from its maximum and minimum to between 1 and 0. Global normalization scales all values from an overall max and min to values between 1 and 0.

Authors claimed, "Most data mining approaches are based on machinelearning techniques, numerical analysis, or statistical modeling. They use human interaction and visualization only minimally. Such automatic methods can miss some important features of the data. Incorporating human perception into the data mining process through interactive visualization can help us better understand the complex behaviors of computer network systems. This article describes visual-analytics-based solutions and outlines a visual exploration process for log analysis.

Because visual data mining differs in essence from automated methods,visual methods can discover information that complements the knowledge found by more commonly used statistical approaches.This is particularly useful when users want to explore the data only to learn about it that is,they don't know exactly what they need to discover from the data.Many realworld problems,including intrusion detection fit into this category.

From a Dimension Reduction point of view

The goal is to reduce dimensionality of KDD Cup 1999 intrusion detection dataset. The dataset contains 41 features of TCP trafic. Each feature vector is labeled as an attack or a non-attack. There are 22 types of attacks in the dataset. We divide the dataset into training data subset (containing 25% of randomly selected representative samples from the original dataset) and reserve the rest for testing.

Assert that what authors called NNPCA (Neural Network PCA) and NLCA (Non-linear Componet Analysis) successfully find the first 19 and 12, respectively, principal component from the originally contained 41 features more effectively than using standard PCA which also reduces dimension into 19.

We present the performance of the Non-linear classifier (NC) and the decision tree classifier (DC) in terms of detection accuracy and false positive rate.

The detection accuracy of an IDS is the percentage of attack samples detected The detection accuracy of an IDS is the percentage of attack samples detected

The detection accuracies and false positive rates of DC were observed on four different decision tree models for each dataset. The models were obtained by pruning the complete decision tree until it shrinks to approximately 50% fo its original size. This reduction in size was achieved by varying the complexity parameter. The average detection accuracy of the DT classi er is 99:6438 over all the datasets. Varying the complexity parameter (0:0 to 0:000735) for experiments on ORIGDATA, the size of the decision tree models varied from 118 to 41 nodes with a minimum false positive rate of 0:2268 for the full decision tree model (with 118 nodes). For PCADATA, the size of the decision tree models varied from 113 to 41 nodes (by varying the complexity parameter from 0:0 to 0:000711) with a minimum false positives rate of 0:2609 for the decision tree model with 113 nodes. Similarly, for NNPCADATA the size of decision tree models varied from 190 to 75 nodes (by varying the complexity parameter from 0:0 to 0:0002) with a minimum false positive rate of 0:4922 for ethe decision tree with 190 nodes. For NLDATA, the size of decision tree models varied from 260 to 87 nodes (by varying the complexity parameter from 0:0 to 0:000145) with a minimum false positive rate of 0:8227 for decision tree with 260 nodes. Our experiments with the decision tree models on NLDATA showed that the average detection accuracy on ve decision tree models was very high (99:589%) even after prunning the complete decision tree to one-third of its orginal size.

Exploiting C4.5

"In a real application, one can never be sure that a set of available labeled examples covers all possible attacks. If a new attack appears, examples of it may not have been seen in training data."

A detailed description of the available features and attack instances can be found in [14, 6].

The supervised algorithms in general exhibit excellent classification accuracy on the data with known attacks. The best results have been achieved by the C4.5 algorithm which attains the 95% true positive rate at 1% false-positive rate. The next two best algorithms are the MLP and the SVM, both non-linear, followed by the local k-Nearest Neighbor algorithm.

The accuracy of supervised algorithms deteriorates significantly if unknown attacks are present in the test data,

Our experiments demonstrate that the supervised learning methods significantly outperform the unsupervised ones if the test data contains no unknown attacks. Furthermore, among the supervised methods, the best performance is achieved by the non-linear methods, such as SVM, multi-layer perceptrons, and the rule- based methods. In the presence of unknown attacks in the test data, the performance of all supervised methods drops significantly, SVM being the most robust to the presence of unknown attacks.

KDD dataset covers four major categories of attacks: Probing attacks (information gathering attacks), Denial-of-Service (DoS) attacks (deny legitimate requests to a system), user-to-root (U2R) attacks (unauthorized access to local super-user or root), and remote-to-local (R2L) attacks (unauthorized local access from a remote machine).

KDD dataset is divided into labeled and unlabeled records. Each labeled record consisted of 41 attributes (features) and one target value. Target value indicated the attack category name.

There are around 5 million (4,898,430) records in the labeled dataset, which was used for training all classifier models discussed in this paper. A second unlabeled dataset (311,029 records) is provided as testing data.

The total number of records in the original labeled training dataset is 972,780 for Normal, 41,102 for Probe, 3,883,370 for DoS, 52 for U2R, and 1,126 for R2L attack classes. After filtering out the duplicate records, there were a total of 812,813 records for Normal, 13,860 for Probe, 247,267 for DoS, 52 for U2R, and 999 for R2L attack classes.

As a survey: (1) Agarwal and Joshi [4] proposed a two-stage general-to- specific framework for learning a rule-based model (PNrule) to learn classifier models on a data set that has widely different class distributions in the training data. The PNrule technique was evaluated on the KDD testing data set, which contained many new R2L attacks not present in the KDD training dataset. The proposed model was able to detect 73.2% of probing attacks, 96.9% of denial of service attacks, 6.6% of U2R attacks, and 10.7% of attacks in R2L attack category. False alarms were generated at a level of less than 10% for all attack categories except for U2R: an unacceptably high level of 89.5% false alarm rate was reported for the U2R category.

(2) "Using Kernel Miner tool and the KDD data set [5], Levin created a set of locally optimal decision trees (called the decision forest) from which optimal subset of trees (called the sub-forest) was selected for predicting new cases. Levin used only 10% of the KDD training data randomly sampled from the entire training data set. Multi-class detection approach was used to detect different attack categories in the KDD data set. The final trees gave very high detection rates for all classes including the R2L in the entire training data set. The proposed classifier achieved 84.5% detection for probing, 97.5% detection for denial of service, 11.8% detection for U2R, and only 7.32% detection for R2L attack category in the KDD testing data set. False alarm rates of 21.6%, 73.1%, 36.4%, and 1.7% were achieved for probing, DoS, U2R and R2L attack categories, respectively."

(3) Ertoz and Steinbach [6] used shared nearest neighbor (SNN) technique, which is particularly suited for finding clusters in data of different sizes, density, and shapes, mainly when the data contains large amount of noise and outliers. All attack records were selected from the KDD training and testing data sets with a cap of 10,000 records from each attack type: there are a total of 36 attack types from 4 attack categories. Also 10,000 records were randomly picked from both the training and the testing data sets. In total, around 97,000 records were selected from the entire KDD data set. After removing duplicate KDD records, the data set size reduced to 45,000 records. This set was then used to train two clustering algorithms: K-means and the proposed SNN technique, which were compared in terms of detection rates achieved. K-means algorithm, with number of clusters equal to 300, could detect 91.88% of probing, 97.85% of DoS, 5.6% of U2R, and 77.04% of the R2L attack records. The proposed SNN technique was able to detect 73.43% of probing, 77.76% of DoS, 37.82% of U2R, and 68.15% of the R2L records, while noting that no discussion on false alarm rates were reported by the authors. Results suggested that SNN performed better than K-means for U2R attack category.

(4) Yeung and Chow [7] proposed a novelty detection approach using non-parametric density estimation based on Parzen-window estimators with Gaussian kernels to build an intrusion detection system using normal data only. This novelty detection approach was employed to detect attack categories in the KDD data set. 30,000 randomly sampled normal records from the KDD training data set were used as training data set to estimate the density of the model. Another 30,000 randomly sampled normal records (also from the KDD training data set) formed the threshold determination set, which had no overlap with the training data set. The technique could detect 99.17% of probing, 96.71% of DoS, 93.57% of U2R, and 31.17% of R2L attacks in the KDD testing dataset as intrusive patterns: authors did not report any information on false alarm rates.

Then proceeding to the method propsed in this paper "Attributes in the KDD datasets had all forms - continuous, discrete, and symbolic, with significantly varying resolution and ranges. Most pattern classification methods are not able to process data in such a format."

Fuzzy Rule approaces

Denial of service (DoS) is a class of attack where an attacker makes a computing or memoryresource too busyor too full to handle legitimate requests,thus denying legitimate users access to a machine; A remote to user (R2L) attack -- unauthorized access from a remote machine -- is a class of attack where an attacker sends packets to a machine over a network,then exploits the machine's vulnerability to illegally gain local access as a user; User to root (U2Su) -- unauthorized access to local super user (root) -- exploits are a class of attacks where an attacker starts out with access to a normal user account on the system and is able to exploit vulnerabilityto gain root access to the system; Probing -- surveillance and other probing -- is a class of attack where an attacker scans a network to gather information or nd known vulnerabilities.An attacker with a map of machines and services that are available on a network can use the information to look for exploits.

This data set has five different classes, namely Normal, DoS, R2L, U2R and Probes. The training and test comprises of 5092 and 6890 records, respectively. Using all 41 variables could result in a big IDS model, which would result in substantial overhead for online detection. All the training data were scaled to (0, 1). The decision tree approach described before resulted in to reducing the number of variables to 12 significant variables or features.

Using 12 attributes most of the classifiers performed very well except the fuzzy classifiers (FR1 and FR2). For detecting U2R attacks FR2 gave the best accuracy. However, due to the tremendous reduction in the number of attributes (about 70% less), we are able to design a computational efficient (lightweight) IDS. Since a particular classifier could not provide accurate results for all the classes, we propose to use a combination of different classifiers to detect different attacks. The D-SCIDS architecture using 41 attributes (heavyweight) and 12 attributes (lightweight) are depicted in Fig.4. The proposed heavyweight model could detect with 100% accuracy while the lightweight model could detect all the attacks with high accuracy (lowest accuracybeing 94.11% for U2R class using FR1 ). It is to be noted that FR1 classifier is preferred for the lightweight D-SCIDS (due to fewer number of rules) even though it has slightly lower accuracy when compared to FR2. In some classes the accuracy figures tend to be very small and may not be statistically significant, especially in view of the fact that the 5 classes of patterns differ in their sizes tremendously. For example only 27 data sets were available for training the U2R class. More definitive conclusions can onlybe made after analyzing more comprehensive sets of network traffic.

A Immuno-Fuzzy approaces

"This paper presents a new technique for generating a set of fuzzy rules that can characterize the non-self space (abnormal) using only self (normal) samples."

"We generated a reduced version of the 10% data set including only the numerical attributes, i.e., the categorical attributes were removed from the data set. Therefore, the reduced 10% data set is composed by thirty-three attributes. The attributes were normalized between 0 and 1 using the maximum and minimum values found. 80% of the normal samples were picked randomly and used as training data set, while the remaining 20% was used along with the abnormal samples as a testing set. Five fuzzy sets were defined for the 33 attributes."

Authors sumarize the result as while Evolving Fuzzy Rules Detectors (EFR), A non fuzzy Evolving Rule Detectors(FRD), and Parallel Hill Climbing of Ruzzy Rules Detectors (PHC) are compared (For reducing the time complexity of the ERD algorithm, 1% of the normal data set (randomly generated), was used as a training data set.) and result was (Detection rate and False Alarm rate) of EFR and PHC were ((98.30 +- 0.001)% and 2.0%)) and ((93.09 +- 0.209)% and 2.0%)) respectively.

Artificial Immune system

K-nearest-neigbor

The authors wrote in the abstract "Thus, with a system that seeks only to define and categorize normalcy, there is the potential to detect new types of network attacks without any prior knowledge of their existence."

Neural Network Approach

Presented "an intrusion detection system based on Self-Organizing Maps and Resilient Propagation Neural Network for visualizing and classifying intrusion and normal patterns. 3-layer network,consisting of 70 neurons in first hidden layer,14 neurons in second hidden layer and 6 neurons in the output layer,resulting to a 41-70-14-6 feed-forward neural network. Each of the hidden nodes and the output node applied a Sigmoid transfer function to the various connection weights.

We select normal dataset,Neptune attack,Portsweep attack,satan attack,buffer_overflow and guess_passwd datasets to train and test our IDS prototype.intrusion detection through data reduction, classification,clustering the unlabeled system log data, and the each process of identifying intruders.

We get 29313 train data patterns from 10% training set and 13112 test data patterns from Test set which has attack patterns that are not present in the training data. Therefore Problem is more realistic.

The training of the neural networks was conducted using feed forward back propagation algorithm using scaled conjugate BP network structure for IDS gradient decent for learning.The network was set to train until the desired mean square error of 0.001 was met or 1500 epochs was reached.

The test result indicates that in total 99.6%of the actual normal examples were recognized correctly. For DOS and Probing intrusion (neptune, satan, portsweep), we can obtain the average detect rate of 96.6% and the false positive rate of 0.049%.

Because all buffer_overflow and guess_passwd attacks fail to be classified by BP network, we only obtain the average detect rate of 64.9%and the false positive rate is 26.7%for all the five kinds of attacks. However,these two kinds of attacks can be accurately detected by rule-based detector. Expert system can detect the R2L and U2R intrusions more accurately than neural network.

For example, Using the detect rules ... the buffer_overflow will be detected.Thus, taking advantage of the performance of each module for various attacks, the model improves the detection rate for all intrusions. The data in the experiment was acquired from the 1998 DARPA intrusion detection evaluation program.They set up an environment to acquire raw TCP/IP dump data for a local-area network(LAN) simulating a typical U.S.Air Force LAN. More than 200 instances of 58 attack types were launched against victim UNIX and Windows NT hosts in three weeks of training data and two week of test data. For each TCP/IP connection, 41 various quantitative and qualitative features were extracted.

Each connection is labeled as either normal, or as an attack, with exactly one specific attack type.

Each connection record consists of about 100 bytes. Attacks fall into four main categories: DOS: denial of service R2L: unauthorized access from a remote machine U2R: unauthorized access to local super user(root) privileges Probing: surveillance and other probing

Multi-layer, feed-forward networks are used. In the study we use 3-layer network, consisting of 70 neurons in first hidden layer,14 neurons in second hidden layer and 6 neurons in the output layer, resulting to a 70-14-6 feed-forward neural network.

The training of the neural networks was conducted using feed forward back propagation algorithm using scaled conjugate gradient decent for learning. The network was set to train until the desired mean square error of 0.001 was met or 1500 epochs was reached. During the training process, the performance is 0.00157434 at 1500 epochs.

99.6% of the actual normal test examples were predicted to be normal by this entry ... 73.3% of test examples said to be normal were indeed normal in reality... the average detect rate of 64.9%. the false positive rate is 26.7%(1-73.3%).

According to the table, we can see the BP network can't detect the buffer_overflow and guess_passwd attacks.

correctly predicted (normal, neptune, satan, portsweep, buffer_overflow, guess_passwd) was 73.3 99.2 94.6 94.2 0, 0, respectively.

(With hybrid BP and C4.5, on the other hand,) 85.01% correctly detecting rate, 19.7%(1-80.3%) false positive rate (Furthermore) for buffer_overflow detect rate 72.7% and 48.2% for guess_passwd attacks. Authors denoted "Detected outliers are candidates for aberrant data. ... Fraud detection is a classic example where attention focuses on the outliers because these are more likely to represent cases of fraud. ... Detected outliers can indicate individuals ... that have behaviour outside the range of what is considered normal." We doubt it.

Evolutionary Computations

Authors wrote in the abstract " This work investigates the performance of XCS on the 1999 KDD Cup intrusion detection dataset, a real world dataset approximately five million records, more than 40 fields and multiple classes with non-uniform distribution."

Percentage of each categories in training data and in testing data (NORMAL, DoS, PROBE, U2R, and R2L) is ((75.6111, 19.4815), (1.2893, 73.9008), (23.0017, 1.3394), (0.0048, 0.0733), (0.0929 5.2050))

MEAN AND STANDARD DEVIATION OF PREDICTION ACCURACY OF CONVENTIONAL XCS IN 30 RUNS (Class Prediction Accuracy) was NORMAL 57.2%; DOS 0.00%; PROBE 0.00%; U2R 0.02%; R2L 6.37 Whil XCS improved by the autors shows NORMAL 95.7%; DOS 49.1; PROBE 93.0%; U2R 8.50%; R2L 3.90%

How to handle huge database

In abstrace denoted, " The recently proposed Polynomial MPMC Cascade (PMC) algorithm is a nonparametric classifier for high-dimensional non-linear binary classification, ... the algorithm has linear-time complexity with respect to both training-set size and dimensionality of the problem. ... Experimental results are given on datasets ranging in sizes between 170,000 and 4.8 million examples. ... We empirically verify the linear time dependence of the algorithm, and explore how the stability of the classifier is influenced by sample size. ... The techniques discussed in this paper should allow nonlinear binary classifiers to be efficiently learned from tens of millions of training data, with feature selection being a added byproduct."

KDD Cup 1999. The KDDCUP 1999 dataset is a very large dataset, with 5.1 million examples and includes a wide variety of network intrusions simulated in a military network environment. The task is to construct an intrusion detector, a predictive model capable of distinguishing between various attacks like probes or privilege elevation and good normal sessions. To create a binary classi cation, we only distinguish between good and bad examples. The training set contains 4.8 million examples, and the test set contains 311.000 examples. Nominal attributes in the data are converted to binary at- tributes by assigning one extra dimension per possible attribute. Additionally, we summarized attributes that were indicative for only one class in the training set into a single attribute. The final data set had 75 features. Note that the test data is not from the same probability distribution as the training data: to make the task more realistic specific attack types not in the training data were included.

Other Methods

The simulated attacks fell in one of the following four categories : DOS - Denial of Service, R2L - Unauthorized access from a remote machine (e.g. password guessing), U2R - unauthorized access to superuser or root functions, and Probing - surveillance and other probing for vulnerabilities There were a total of 24 attack types

In the case of the system call data, the each of the algorithms performed perfectly. What this means is that at a certain threshold, there was at least one process trace from each of the attacks identiPed as being malicious without any false positives.

For the network connections, the data is not nearly as regular as the system call traces. From the experiments, it was found that there were some types of attacks that the system was able to detect well and other types of attacks that it were not able to detect. This is reasonable because some of the attacks using the feature space were in the same region as normal data.

the problem of unsupervised anomaly detection is significantly harder because we do not have access to the labels or have a guaranteed clean training set. For our experiment,we have used the KDD Cup 1999 intrusion detection data set prepared by Lee et al .[9 ]. The data set contains 41 features representing selected measurements of normal and intrusive TCP sessions. Each labeled TCP session is either normal or a member of one of the 22 attack classes in the dataset. The first 5 features are selected,namely (1)Src_Bytes, (2)Dst_Bytes,(3)Duration,(4)Is_Host_Login,and (5) Is_Guest_Login,based on the results of Gopi et al .[8 ] which are based on the Screen Test and Critical Eigenvalue test.The objective of selection of a subset out of all the features is to assess the robustness of our system. ([8] Kuchimanchi,G.,Phoha,V.V.,Balagani, K.S.and Gaddam,S.R., (2004) "Dimension Reduction using Feature Extraction Methods for Real-time Misuse Detection Systems." In Workshop on Information Assurance, United States Military Academy,West Point, NY.)

The best results was 79 %accurate i.e.given unknown sequence of TCP sessions this Anomaly Detection system could accurately verify that it is a normal or having anomaly with 79%accuracy.The remaining 21%is accounted for false positive rate (i.e. classifying anomaly as a normal TCP session)and false negative rate (i.e.classifying normal as an attack type TCP session).