A Survey on Data Clasification with KDD-cup-99 Intrusion Detectin Dataset
by Akira Imada (March 2006)
Essencially good paper to start with.
From a data visualization point of view.
Some of these collections are synthetic, that is,
they were designed a priori to stress prediction algorithms in predetermined ways.
. .1 Dependence of Performance on User Know edge and Expertise
To illustrate the importance of factoring in user knowledge and expertise into a
benchmarking effort, we relate some of our experiences with the nformation
Exploration Shootout's [Grins97] first exercise, which involved the detection of
intrusions into a computer network. The two major challenges for the participants
were the complexity of the problem domain and the size of the database. Details of
the internet protocol and internet operations are arcane, and to adequately address, let
alone solve the problem, an expert in the field is necessary. The central need for a
domain expert is typically a common feature of real-world knowledge discovery
problems. The skills that we found to be required in our approach were: 1) domain
knowledge of computer network security; 2) experience with visualization software;
and 3) statistical expertise.
The first task of the shootout was a large preprocessing activity. We grouped the
individual packet records into natural clusters of communications sessions. The
resulting reduction in size was substantial. For example, the baseline data set contains
over 350,000 records. The corresponding session-level data set contains
approximately 16,000 records. We then analyzed the processed data sets
predominantly with visualization techniques. For many we used parallel coordinate
plots and conditioning. Even in this step, the visualization is driven by domain
knowledge. With this approach, several anomalies were identified, and these turned
out to be network intrusions when interpreted by a system administrator aware of
various network attacks.
This experience showed that there are a number of aspects of the process that have to
be evaluated, much depends on domain expertise, and on the amount of data involved.
Even with this large but not huge set, the visualization required a scaling down of the
problem. Any benchmark testing methodology must consider these complex
requirements. Testing must also include a good understanding of the perceptual issues
involved, as discussed in this next section.
For example, one could measure the output of the task of finding the outliers in a set.
In Section 4, a basic set of measures, such as ability to identify outliers and clusters has been applied in the comparative evaluations done for the prototype.
This invites a tradeoff in using synthetic vs. real data. Synthetic data is harder to construct, but the "correct" answers are known. Real data is easier to collect, but it is harder to evaluate performance, because it is nearly impossible to "know" the correct output to a task in any reasonably sized data set.
This surveys analyzed
ten diferent data sets (Simple Seven; Balloons; Contact Lenses; Shuttle O-rings;
Monks problem; Iris Flower; Congress Voting; Liver Disorders; Cars; Wines)
using five visualization techiques (Parallel Coordinates; Scatter Plot Matrices;
Survey Plots; Circle Segments; and Radviz).
All of the data sets except the Simple Seven set, are from the UC Urvine Machine
Learning Repository.
This invites a tradeoff in using synthetic vs. real data.
Synthetic data is harder to construct, but the "correct" answers are known.
Real data is easier to collect, but it is harder to evaluate performance,
because it is nearly impossible to know the correct output to a task
in any reasonably sized data set.
The Iris Database is perhaps the most often used data set in pattern recognition,
statistics, data analysis, and machine learning. The task is to predict the class of the
flower based on the 4 physical attribute measurements. There are 150 instances, 3
classes, and 4 numeric attributes. The data set is #46 from the UC collection. The
dimensions are: class (Setosa, Versicolour, Virginica); sepal-length; sepal-width; petal
length; and petal-width. The cardinality is 35, 23, 43, 22, 3, and 150 (cases)
respectively, the PoC is equal to 342688500, and the log of PoC equals 19.65. One
class is linearly separable from the other two, but the other two are not linearly
separable from each other.
Local normalization means that each dimension is scaled from
its maximum and minimum to between 1 and 0. Global normalization scales all values
from an overall max and min to values between 1 and 0.
Authors claimed, "Most data mining approaches are based on machinelearning techniques, numerical analysis, or statistical modeling. They use human interaction and visualization only minimally. Such automatic methods can miss some important features of the data. Incorporating human perception into the data mining process through interactive visualization can help us better understand the complex behaviors of computer network systems. This article describes visual-analytics-based solutions and outlines a visual exploration process for log analysis.
Because visual data mining differs in essence from automated methods,visual methods can discover information that complements the knowledge found by more commonly used statistical approaches.This is particularly useful when users want to explore the data only to learn about it that is,they don't know exactly what they need to discover from the data.Many realworld problems,including intrusion detection fit into this category.
From a Dimension Reduction point of view
The goal is to reduce dimensionality of KDD Cup 1999 intrusion detection dataset.
The dataset contains 41 features of TCP trafic.
Each feature vector is labeled as an attack or a non-attack.
There are 22 types of attacks in the dataset.
We divide the dataset into training data subset
(containing 25% of randomly selected representative samples from the original dataset)
and reserve the rest for testing.
Assert that what authors called NNPCA (Neural Network PCA) and NLCA (Non-linear Componet
Analysis) successfully find the first 19 and 12, respectively, principal component
from the originally contained 41 features more effectively than using standard PCA which
also reduces dimension into 19.
We present the performance of the Non-linear classifier (NC) and the decision tree classifier (DC) in terms of detection accuracy and false positive rate.
The detection accuracy of an IDS is the percentage of attack samples detected The detection accuracy of an IDS is the percentage of attack samples detected
The detection accuracies and false positive rates of DC were observed on four different decision tree models for each dataset. The models were obtained by pruning the complete decision tree until it shrinks to approximately 50% fo its original size. This reduction in size was achieved by varying the complexity parameter. The average detection accuracy of the DT classier is 99:6438 over all the datasets. Varying the complexity parameter (0:0 to 0:000735) for experiments on ORIGDATA, the size of the decision tree models varied from 118 to 41 nodes with a minimum false positive rate of 0:2268 for the full decision tree model (with 118 nodes). For PCADATA, the size of the decision tree models varied from 113 to 41 nodes (by varying the complexity parameter from 0:0 to 0:000711) with a minimum false positives rate of 0:2609 for the decision tree model with 113 nodes. Similarly, for NNPCADATA the size of decision tree models varied from 190 to 75 nodes (by varying the complexity parameter from 0:0 to 0:0002) with a minimum false positive rate of 0:4922 for ethe decision tree with 190 nodes. For NLDATA, the size of decision tree models varied from 260 to 87 nodes (by varying the complexity parameter from 0:0 to 0:000145) with a minimum false positive rate of 0:8227 for decision tree with 260 nodes. Our experiments with the decision tree models on NLDATA showed that the average detection accuracy on ve decision tree models was very high (99:589%) even after prunning the complete decision tree to one-third of its orginal size.
Exploiting C4.5
"In a real application, one can never be sure that a set of available labeled
examples covers all possible attacks. If a new attack appears, examples of it
may not have been seen in training data."
A detailed description of the available features and attack
instances can be found in [14, 6].
The supervised algorithms in general exhibit excellent classification accuracy
on the data with known attacks. The best results have been achieved by the
C4.5 algorithm which attains the 95% true positive rate at 1% false-positive rate.
The next two best algorithms are the MLP and the SVM, both non-linear,
followed by the local k-Nearest Neighbor algorithm.
The accuracy of supervised algorithms deteriorates significantly if unknown
attacks are present in the test data,
Our experiments demonstrate that the supervised learning methods significantly
outperform the unsupervised ones if the test data contains no unknown attacks.
Furthermore, among the supervised methods, the best performance is achieved
by the non-linear methods, such as SVM, multi-layer perceptrons, and the rule-
based methods. In the presence of unknown attacks in the test data, the performance
of all supervised methods drops significantly, SVM being the most robust
to the presence of unknown attacks.
KDD dataset covers four major categories of attacks:
Probing attacks (information gathering attacks),
Denial-of-Service (DoS) attacks (deny legitimate requests to a system),
user-to-root (U2R) attacks (unauthorized access to local super-user or root), and
remote-to-local (R2L) attacks (unauthorized local access from a remote machine).
KDD dataset is divided into labeled and unlabeled records.
Each labeled record consisted of 41 attributes (features) and one target value.
Target value indicated the attack category name.
There are around 5 million (4,898,430) records in the labeled dataset, which
was used for training all classifier models discussed in this paper.
A second unlabeled dataset (311,029 records) is provided as testing data.
The total number of records in the original
labeled training dataset is 972,780 for Normal, 41,102 for
Probe, 3,883,370 for DoS, 52 for U2R, and 1,126 for R2L
attack classes. After filtering out the duplicate records,
there were a total of 812,813 records for Normal, 13,860
for Probe, 247,267 for DoS, 52 for U2R, and 999 for R2L
attack classes.
As a survey: (1)
Agarwal and Joshi [4] proposed a two-stage general-to-
specific framework for learning a rule-based model
(PNrule) to learn classifier models on a data set that has
widely different class distributions in the training data.
The PNrule technique was evaluated on the KDD testing
data set, which contained many new R2L attacks not
present in the KDD training dataset. The proposed model
was able to detect 73.2% of probing attacks, 96.9% of
denial of service attacks, 6.6% of U2R attacks, and 10.7%
of attacks in R2L attack category. False alarms were
generated at a level of less than 10% for all attack
categories except for U2R: an unacceptably high level of
89.5% false alarm rate was reported for the U2R category.
(2)
"Using Kernel Miner tool and the KDD data set [5],
Levin created a set of locally optimal decision trees
(called the decision forest) from which optimal subset of
trees (called the sub-forest) was selected for predicting
new cases. Levin used only 10% of the KDD training
data randomly sampled from the entire training data set.
Multi-class detection approach was used to detect
different attack categories in the KDD data set. The final
trees gave very high detection rates for all classes
including the R2L in the entire training data set. The
proposed classifier achieved 84.5% detection for probing,
97.5% detection for denial of service, 11.8% detection for
U2R, and only 7.32% detection for R2L attack category in
the KDD testing data set. False alarm rates of 21.6%,
73.1%, 36.4%, and 1.7% were achieved for probing, DoS,
U2R and R2L attack categories, respectively."
(3)
Ertoz and Steinbach [6] used shared nearest neighbor
(SNN) technique, which is particularly suited for finding
clusters in data of different sizes, density, and shapes,
mainly when the data contains large amount of noise and
outliers. All attack records were selected from the KDD
training and testing data sets with a cap of 10,000 records
from each attack type: there are a total of 36 attack types
from 4 attack categories. Also 10,000 records were
randomly picked from both the training and the testing
data sets. In total, around 97,000 records were selected
from the entire KDD data set. After removing duplicate
KDD records, the data set size reduced to 45,000 records.
This set was then used to train two clustering algorithms:
K-means and the proposed SNN technique, which were
compared in terms of detection rates achieved. K-means
algorithm, with number of clusters equal to 300, could
detect 91.88% of probing, 97.85% of DoS, 5.6% of U2R,
and 77.04% of the R2L attack records. The proposed
SNN technique was able to detect 73.43% of probing,
77.76% of DoS, 37.82% of U2R, and 68.15% of the R2L
records, while noting that no discussion on false alarm
rates were reported by the authors. Results suggested that
SNN performed better than K-means for U2R attack
category.
(4)
Yeung and Chow [7] proposed a novelty detection
approach using non-parametric density estimation based
on Parzen-window estimators with Gaussian kernels to
build an intrusion detection system using normal data
only. This novelty detection approach was employed to
detect attack categories in the KDD data set. 30,000
randomly sampled normal records from the KDD training
data set were used as training data set to estimate the
density of the model. Another 30,000 randomly sampled
normal records (also from the KDD training data set)
formed the threshold determination set, which had no
overlap with the training data set. The technique could
detect 99.17% of probing, 96.71% of DoS, 93.57% of
U2R, and 31.17% of R2L attacks in the KDD testing
dataset as intrusive patterns: authors did not report any
information on false alarm rates.
Then proceeding to the method propsed in this paper
"Attributes in the KDD datasets had all forms -
continuous, discrete, and symbolic, with significantly
varying resolution and ranges. Most pattern classification
methods are not able to process data in such a format."
Fuzzy Rule approaces
Denial of service (DoS) is a class of attack where an attacker makes a computing or memoryresource too busyor too full to handle legitimate requests,thus denying legitimate users access to a machine;
A remote to user (R2L) attack -- unauthorized access from a remote machine -- is a class of attack where an attacker sends packets to a machine over a network,then exploits the machine's vulnerability to illegally gain local access as a user;
User to root (U2Su) -- unauthorized access to local super user (root) -- exploits are a class of attacks where an attacker starts out
with access to a normal user account on the system and is able to exploit
vulnerabilityto gain root access to the system;
Probing -- surveillance and other probing -- is a class of attack where an attacker scans a network to gather
information or nd known vulnerabilities.An attacker with a map of machines and
services that are available on a network can use the information to look for exploits.
This data set has five different classes, namely Normal, DoS, R2L, U2R and Probes. The training and test comprises of 5092 and 6890 records, respectively. Using all 41 variables could result in a big IDS model, which would result in substantial overhead for online detection. All the training data were scaled to (0, 1). The decision tree approach described before resulted in to reducing the number of variables to 12 significant variables or features.
Using 12 attributes most of the classifiers performed very well except the fuzzy classifiers (FR1 and FR2). For detecting U2R attacks FR2 gave the best accuracy.
However, due to the tremendous reduction in the number of attributes (about 70% less), we are able to design a computational efficient (lightweight) IDS.
Since a particular classifier could not provide accurate results for all the classes,
we propose to use a combination of different classifiers to detect different attacks.
The D-SCIDS architecture using 41 attributes (heavyweight) and 12 attributes (lightweight)
are depicted in Fig.4.
The proposed heavyweight model could detect with 100% accuracy while the lightweight model could detect all the attacks with high accuracy (lowest accuracybeing 94.11% for U2R class using FR1 ).
It is to be noted that FR1 classifier is preferred for the lightweight D-SCIDS (due to fewer number of rules) even though it has slightly lower accuracy when compared to FR2.
In some classes the accuracy figures tend to be very small and may not be
statistically significant, especially in view of the fact that the 5 classes of patterns
differ in their sizes tremendously. For example only 27 data sets were available for
training the U2R class. More definitive conclusions can onlybe made after analyzing
more comprehensive sets of network traffic.
A Immuno-Fuzzy approaces
"This paper presents a new technique for generating a set of fuzzy rules
that can characterize the non-self space (abnormal) using only self (normal) samples."
"We generated a reduced version of the 10% data set including only the numerical attributes,
i.e., the categorical attributes were removed from the data set.
Therefore, the reduced 10% data set is composed by thirty-three attributes.
The attributes were normalized between 0 and 1 using the maximum and minimum values found.
80% of the normal samples were picked randomly and used as training data set,
while the remaining 20% was used along with the abnormal samples as a testing set.
Five fuzzy sets were defined for the 33 attributes."
Authors sumarize the result as
while Evolving Fuzzy Rules Detectors (EFR), A non fuzzy Evolving Rule Detectors(FRD),
and Parallel Hill Climbing of Ruzzy Rules Detectors (PHC) are compared
(For reducing the time complexity of
the ERD algorithm, 1% of the normal data set (randomly
generated), was used as a training data set.)
and result was (Detection rate and False Alarm rate) of EFR and PHC were
((98.30 +- 0.001)% and 2.0%)) and ((93.09 +- 0.209)% and 2.0%)) respectively.
Artificial Immune system
K-nearest-neigbor
The authors wrote in the abstract "Thus, with a system that seeks only
to define and categorize normalcy, there is the potential to detect new
types of network attacks without any prior knowledge of their existence."
Neural Network Approach
Presented "an intrusion detection system based on Self-Organizing Maps
and Resilient Propagation Neural Network for visualizing and classifying
intrusion and normal patterns.
3-layer network,consisting of 70 neurons in first hidden layer,14 neurons in second hidden layer and 6 neurons in the output layer,resulting to a 41-70-14-6 feed-forward neural network. Each of the hidden nodes and the output node applied a Sigmoid transfer function to the various connection weights.
We select normal dataset,Neptune
attack,Portsweep attack,satan attack,buffer_overflow
and guess_passwd datasets to train and test our IDS
prototype.intrusion detection through data reduction,
classification,clustering the unlabeled system log data,
and the each process of identifying intruders.
We get 29313 train data patterns from 10% training set and 13112 test data patterns from Test set which has attack patterns that are not present in the training data. Therefore Problem is more realistic.
The training of the neural networks was conducted using feed forward back propagation algorithm using scaled conjugate BP network structure for IDS gradient decent for learning.The network was set to train until the desired mean square error of 0.001 was met or 1500 epochs was reached.
The test result indicates that in total 99.6%of the actual normal examples were recognized correctly. For DOS and Probing intrusion (neptune, satan, portsweep), we can obtain the average detect rate of 96.6% and the false positive rate of 0.049%.
Because all buffer_overflow and guess_passwd attacks fail to be classified by BP network, we only obtain the average detect rate of 64.9%and the false positive rate is 26.7%for all the five kinds of attacks. However,these two kinds of attacks can be accurately detected by rule-based detector. Expert system can detect the R2L and U2R intrusions more accurately than neural network.
For example, Using the detect rules ... the buffer_overflow will be detected.Thus, taking advantage of the performance of each module for various attacks, the model improves the detection rate for all intrusions.
The data in the experiment was acquired from the 1998 DARPA intrusion detection evaluation program.They set up an environment to acquire raw TCP/IP dump data for a local-area network(LAN)
simulating a typical U.S.Air Force LAN. More than 200 instances of 58 attack types were launched against victim UNIX and Windows NT hosts in three weeks of training data and two week of test data. For each TCP/IP connection, 41 various quantitative and qualitative features were extracted.
Each connection is labeled as either normal, or as an attack, with exactly one specific attack type.
Each connection record consists of about 100 bytes.
Attacks fall into four main categories:
DOS: denial of service
R2L: unauthorized access from a remote machine
U2R: unauthorized access to local super user(root) privileges
Probing: surveillance and other probing
Multi-layer, feed-forward networks are used. In the study we use 3-layer network, consisting of 70 neurons in first hidden layer,14 neurons in second hidden layer and 6 neurons in the output layer, resulting to a 70-14-6 feed-forward neural network.
The training of the neural networks was conducted
using feed forward back propagation algorithm using
scaled conjugate gradient decent for learning. The network
was set to train until the desired mean square error of 0.001
was met or 1500 epochs was reached. During the training
process, the performance is 0.00157434 at 1500 epochs.
99.6% of the actual normal test examples were predicted to be normal by this entry ... 73.3% of test examples said to be normal were indeed normal in reality... the average detect rate of 64.9%. the false positive rate is 26.7%(1-73.3%).
According to the table, we can see the BP network can't detect the buffer_overflow and guess_passwd attacks.
correctly predicted (normal, neptune, satan, portsweep, buffer_overflow, guess_passwd)
was 73.3 99.2 94.6 94.2 0, 0, respectively.
(With hybrid BP and C4.5, on the other hand,)
85.01% correctly detecting rate, 19.7%(1-80.3%) false positive rate
(Furthermore) for buffer_overflow detect rate 72.7%
and 48.2% for guess_passwd attacks.
Authors denoted "Detected outliers are candidates for aberrant data.
... Fraud detection is a classic example where attention focuses on the outliers
because these are more likely to represent cases of fraud.
... Detected outliers can indicate individuals ... that
have behaviour outside the range of what is considered normal." We doubt it.
Evolutionary Computations
Authors wrote in the abstract " This work investigates the performance of XCS on the 1999 KDD Cup intrusion detection dataset, a real world dataset approximately five million records, more than 40 fields and multiple classes with non-uniform distribution."
Percentage of each categories in training data and in testing data
(NORMAL, DoS, PROBE, U2R, and R2L) is
((75.6111, 19.4815), (1.2893, 73.9008), (23.0017, 1.3394), (0.0048, 0.0733), (0.0929 5.2050))
MEAN AND STANDARD DEVIATION OF PREDICTION ACCURACY OF CONVENTIONAL XCS IN 30 RUNS
(Class Prediction Accuracy) was
NORMAL 57.2%; DOS 0.00%; PROBE 0.00%; U2R 0.02%; R2L 6.37
Whil XCS improved by the autors shows
NORMAL 95.7%; DOS 49.1; PROBE 93.0%; U2R 8.50%; R2L 3.90%
How to handle huge database
In abstrace denoted, "
The recently proposed Polynomial MPMC Cascade (PMC)
algorithm is a nonparametric classifier for high-dimensional non-linear
binary classification,
...
the algorithm has linear-time complexity with respect to both training-set size
and dimensionality of the problem.
...
Experimental results are given on datasets ranging in sizes between 170,000 and
4.8 million examples.
...
We empirically verify the linear time dependence of the algorithm,
and explore how the stability of the classifier is influenced by sample size.
...
The techniques discussed in this
paper should allow nonlinear binary classifiers to be efficiently learned
from tens of millions of training data, with feature selection being a
added byproduct."
KDD Cup 1999. The KDDCUP 1999 dataset is a very large dataset, with 5.1
million examples and includes a wide variety of network intrusions simulated in
a military network environment.
The task is to construct an intrusion detector, a predictive model capable of
distinguishing between various attacks like probes or privilege elevation and good
normal sessions. To create a binary classication, we only distinguish between
good and bad examples.
The training set contains 4.8 million examples, and the test set contains
311.000 examples. Nominal attributes in the data are converted to binary at-
tributes by assigning one extra dimension per possible attribute. Additionally,
we summarized attributes that were indicative for only one class in the training
set into a single attribute. The final data set had 75 features. Note that the
test data is not from the same probability distribution as the training data: to
make the task more realistic specific attack types not in the training data were
included.
Other Methods
The simulated attacks fell in one of the following four categories :
DOS - Denial of Service,
R2L - Unauthorized access from a remote machine (e.g. password guessing),
U2R - unauthorized access to superuser or root functions, and
Probing - surveillance and other probing for vulnerabilities
There were a total of 24 attack types
In the case of the system call data, the each of the algorithms performed
perfectly. What this means is that at a certain threshold, there was at least one
process trace from each of the attacks identiPed as being malicious without any
false positives.
For the network connections, the data is not nearly as regular as the system
call traces. From the experiments, it was found that there were some types
of attacks that the system was able to detect well and other types of attacks
that it were not able to detect. This is reasonable because some of the attacks
using the feature space were in the same region as normal data.
the problem of unsupervised anomaly detection
is significantly harder because we do not have access to the labels or have a
guaranteed clean training set.
For our experiment,we have used the KDD Cup 1999
intrusion detection data set prepared by Lee et al .[9 ].
The data set contains 41 features representing selected
measurements of normal and intrusive TCP sessions.
Each labeled TCP session is either normal or a
member of one of the 22 attack classes in the dataset.
The first 5 features are selected,namely (1)Src_Bytes,
(2)Dst_Bytes,(3)Duration,(4)Is_Host_Login,and (5)
Is_Guest_Login,based on the results of Gopi et al .[8 ]
which are based on the Screen Test and Critical
Eigenvalue test.The objective of selection of a subset
out of all the features is to assess the robustness of our
system.
([8] Kuchimanchi,G.,Phoha,V.V.,Balagani,
K.S.and Gaddam,S.R., (2004) "Dimension
Reduction using Feature Extraction Methods
for Real-time Misuse Detection Systems."
In Workshop on Information Assurance,
United States Military Academy,West Point,
NY.)
The best
results was 79 %accurate i.e.given unknown
sequence of TCP sessions this Anomaly Detection
system could accurately verify that it is a normal or
having anomaly with 79%accuracy.The remaining
21%is accounted for false positive rate (i.e.
classifying anomaly as a normal TCP session)and
false negative rate (i.e.classifying normal as an attack
type TCP session).