Fisher:s iris

Z. Ji and D. Dasgupata (2004)
"Augmented Negative Selection Algorithm with Variable-Coverage Detectors."
Proceedings of the Congress on Evolutionary Computation</A>

Ayara, M., J. Timmis, R. D. Lemos, L. N. D. Castro, and R. Duncan (2002) 
"Negative Selection: How to Generate Detectors." 
Proceedings of 1st International Conference on Artificial Immune Systems (ICARIS)
pp. 89--98. </A>


ayara

6.1.1 Generating self data
As earlier stated, the 8-bit data used were randomly
generated. The pseudorandom number generator of the
Java 2 Platform (Standard Edition version 1.3) API was
used to generate integer numbers between 0 and 255,
which were then converted to 8-bit binary strings.
During the experiments, there was a need to generate
different sizes of self set. This was carried out by
creating separate files for different population sizes of
self sets.
6.1.2 Setting the matching threshold
The affinity between these binary strings (for the self-set,
detector set and test data) was determined using the
r-contiguous bits matching rule. The optimal value for
matching threshold ( r ) had to be obtained by changing
values of r from 1 to l. This process was done in order
to obtain the combined values of correct and incorrect
classification by detectors generated using a specific
threshold. Correct classification value is derived from
the sum of true positive (rate at which non-self is
correctly detected) and true negative (rate at which self
is correctly not detected). While incorrect classification
is the sum of false positive (rate at which self is
incorrectly detected) and false negative (rate at which
non-self is not detected). Both the correct and incorrect
classification values are used to determine the
appropriate values of r. This is different from the
approach used by (Kim and Bentley 2001) as well as
the suggested method in (D'haeseleer, Forrest et al.
1996). In (Kim and Bentley 2001), the value of r was
determined from the equations in (Forrest, Perelson et
al. 1994), which yielded poor values of matching
threshold for the corresponding data. While
(D'haeseleer, Forrest et al. 1996) proposed an approach
based on the greedy algorithm. Both approaches reveal
that there is no hard-and-fast rule for setting this
parameter, rather various values can be tested in order
to select the optimal one. The following procedure was
carried out to determine this parameter:


J. Gomez, F. Gonzalez, and  D. Dasgupta (2003)
"An Immuno-Fuzzy Approach to Anomaly Detection"
proceedings of the 12th IEEE International Conference on Fuzzy Systems, 
Vol. 2, pp. 1219-1224.

KDD Cup 99

This data set is a version of the 1998 DARPA intrusion
detection evaluation data set prepared and managed by MIT
Lincoln Labs

Experiments were conducted with the ten percent that is available at the University of 
Irvine Machine Learning repository 1. Forty-two attributes, that usually characterize
network traffic behavior, compose each record of the 10% data set (twenty-two of them 
numerical). Also, the number of records in the 10% is huge (492021).

1) Experimental settings: We generated a reduced version of the 10% data set including 
only the numerical attributes, i.e., the categorical attributes were removed from the data 
set.

Therefore, the reduced 10% data set is composed by thirty-three
attributes. The attributes were normalized between 0 and
1 using the maximum and minimum values found. 80% of the
normal samples were picked randomly and used as training
data set, while the remaining 20% was used along with the
abnormal samples as a testing set. Five fuzzy sets were defined
for the 33 attributes. For reducing the time complexity of
the ERD algorithm, 1% of the normal data set (randomly
generated), was used as a training data set.


C. Darpa 99
This data set, is also obtained from the MIT-Lincoln Lab [28]. It represents both normal
and abnormal information collected in a test network, where simulated attacks were 
performed. The data set is composed of network traffic data
(tcpdump, inside and outside network traffic), audit data (bsm),
and file systems data. We used the outside tcpdump network
data for a specific computer (e.g., hostname: marx), and then
we applied the tool tcpstat to get traffic statistics. The first
week's data was used for training (attack free), and the second
week's data for testing (this includes some attacks). We only
considered the network attacks in our experiments.

1) Experimental Settings: Three parameters were selected
(bytes per second, packets per second and ICMP packets
per second), to detect some specific type of attacks. These
parameters were sampled each minute (using tcpstat) and
normalized. Because each parameter can be seen as a time
series function, the features were extracted using a sliding
overlapping window of size
n=3. Therefore, two sets of 9-dimensional feature vectors were
generated: one as training
data set and the other as testing data set. Ten fuzzy sets were
defined for each feature extracted.

Author cite

[28] MIT lincoln labs. 1999 darpa intrusion detection evaluation.
In http://www.ll.mit.edu/IST/ideval/index.html, 1999.

the page assearts
------
Intrusion detection systems monitor network state looking for unauthorized usage, 
denial of service, and anomalous behavior.

Such systems have never been formally evaluated ... until now.

The Information Systems Technology Group (IST) of MIT Lincoln Laboratory, under Defense 
Advanced Research Projects Agency (DARPA ITO) and Air Force Research Laboratory (AFRL/SNHS)
sponsorship, has collected and distributed the first standard corpora for evaluation of 
computer network intrusion detection systems. We have also coordinated, with the Air Force 
Research Laboratory, the first formal, repeatable, and statistically-significant evaluations
of intrusion detection systems. Such evaluation efforts have been carried out in 1998 and 
1999. 

These evaluations measure probability of detection and probability of false-alarm for each 
system under test.  These evaluations are contributing significantly to the intrusion 
detection research field by providing direction for research efforts and an objective 
calibration of the current technical state-of-the-art.  They are of interest to all 
researchers working on the general problem of workstation and network intrusion detection.

The evaluation is designed to be simple, to focus on core technology issues, and to 
encourage the widest possible participation by eliminating security and privacy concerns, 
and by providing data types that are used commonly by the majority of intrusion detection 
systems.

Downloads
Off-line data sets are available to provide researchers with extensive examples of attacks 
and background traffic.

Two data sets are the result of the DARPA Intrusion Detection Evaluations.

1998 DARPA Intrusion Detection Evaluation Data Sets 
1999 DARPA Intrusion Detection Evaluation Data Sets 

Three additional data sets are the result of experiments run in 2000 to address specific 
scenarios.

2000 DARPA Intrusion Detection Scenario Specific Data Sets 


=====
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Abstract
This is the data set used for The Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with KDD-99 The Fifth International Conference on Knowledge Discovery and Data Mining. The competition task was to build a network intrusion detector, a predictive model capable of distinguishing between ``bad'' connections, called intrusions or attacks, and ``good'' normal connections. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. 

back,
buffer_overflow,
ftp_write,
guess_passwd,
imap,
ipsweep,
land,
loadmodule,
multihop,
neptune,
nmap,
normal,
perl,
phf,
pod,
portsweep,
rootkit,
satan,
smurf,
spy,
teardrop,
warezclient,
warezmaster.


duration: continuous.
protocol_type: symbolic.
service: symbolic.
flag: symbolic.
src_bytes: continuous.
dst_bytes: continuous.
land: symbolic.
wrong_fragment: continuous.
urgent: continuous.
hot: continuous.

num_failed_logins: continuous.
logged_in: symbolic.
num_compromised: continuous.
root_shell: continuous.
su_attempted: continuous.
num_root: continuous.
num_file_creations: continuous.
num_shells: continuous.
num_access_files: continuous.
num_outbound_cmds: continuous.

is_host_login: symbolic.
is_guest_login: symbolic.
count: continuous.
srv_count: continuous.
serror_rate: continuous.
srv_serror_rate: continuous.
rerror_rate: continuous.
srv_rerror_rate: continuous.
same_srv_rate: continuous.
diff_srv_rate: continuous.

srv_diff_host_rate: continuous.
dst_host_count: continuous.
dst_host_srv_count: continuous.
dst_host_same_srv_rate: continuous.
dst_host_diff_srv_rate: continuous.
dst_host_same_src_port_rate: continuous.
dst_host_srv_diff_host_rate: continuous.
dst_host_serror_rate: continuous.
dst_host_srv_serror_rate: continuous.
dst_host_rerror_rate: continuous.

dst_host_srv_rerror_rate: continuous.

---
training_attack_types A list of intrusion types. 
back dos
buffer_overflow u2r
ftp_write r2l
guess_passwd r2l
imap r2l
ipsweep probe
land dos
loadmodule u2r
multihop r2l
neptune dos
nmap probe
perl u2r
phf r2l
pod dos
portsweep probe
rootkit u2r
satan probe
smurf dos
spy r2l
teardrop dos
warezclient r2l
warezmaster r2l

the 1 st data
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.