A Survey on Data Clasification with IRIS flower data set
by Akira Imada (March 2006)
From a context of Artificial Immune system.
Recent work has investigated the combination of clonal selection with negative selection in the AIS. Detectors (in the form of classification rules) are evolved using the clonal selection algorithm: parent detectors compete in groups, with only the rule that matches a non-self antigen having its fitness increased. The fittest parents are randomly picked for reproduction, and if child detectors match any 'self' antigens, they are removed (negative selection), with parents generating new children. The combination of these two processes results in the evolution of a population of detectors that are clustered into niches, and that can together distinguish between self and non-self data. For full details, refer to (Kim and Bentley, 2001).
the iris data set (50examples each of setosa, virginia and versicolour).
A tenfold cross-validation method was employed to prepare
training sets for the AIS to evolve and test sets to evaluate
detection of previously unseen non-self patterns. A detector
population size of 300 was used. Each experiment was run
for a maximum 50 generations.
(TP, FP) = (99.8%, 1.2%), (95%, 5%) and (95.6%, 1%) for
Setosa, Versicolor, and Virginia respectivley.
... these results illustrate the benefits of exploiting
the natural capabilities of evolution. The combination of
immune processes produced niches of distributed detectors,
which together discovered patterns in data that
distinguished 'self' from 'non-self'.
An IDS is usually comprised of two main components: an anomaly detector and a misuse detector (Mykerjee et al, 1994). The anomaly detector establishes the profiles of normal activities of users, systems, system resources, network traffic and/or services and detects intrusions by identifying significant deviations from the normal behaviour patterns observed from profiles.
The misuse detector defines suspicious misuse signatures based on known system vulnerabilities and a security policy. This component probes whether these misuse signatures are present or not in the auditing trails.
This is perhaps because
the given problem of iris data is relatively easier and
thus the minimum sample size is good enough to show a
good detection rate. In other words, fairly general
detectors can detect all existing non-self antigen patterns
in the iris data set.
K-means
The accuracy is the proportion of the total number of correct classification,... AC is not sufficient to evaluate the classifier's performance when the number of instances of one class is overwhelmingly greater than the other [18]. For example, there are 10 000 instances, 9 990 of which are negative and 10 of which are positive. If all of them are classified as negative, the accuracy is 99.9% even though all of the positive instances are misclassified. For binary classifiers, true positive rate (TP), and false positive rate (FP), are also used to reinforce the accuracy [18]. True
Two-fold cross-validation is used for evaluating the classification methods. The
Iris data are divided into two halves. One half is used for training data while another
half for testing.
This surveys analyzed
ten diferent data sets (Simple Seven; Balloons; Contact Lenses; Shuttle O-rings;
Monks problem; Iris Flower; Congress Voting; Liver Disorders; Cars; Wines)
using five visualization techiques (Parallel Coordinates; Scatter Plot Matrices;
Survey Plots; Circle Segments; and Radviz).
All of the data sets except the Simple Seven set, are from the UC Urvine Machine
Learning Repository.
This invites a tradeoff in using synthetic vs. real data.
Synthetic data is harder to construct, but the "correct" answers are known.
Real data is easier to collect, but it is harder to evaluate performance,
because it is nearly impossible to know the correct output to a task
in any reasonably sized data set.
The Iris Database is perhaps the most often used data set in pattern recognition,
statistics, data analysis, and machine learning. The task is to predict the class of the
flower based on the 4 physical attribute measurements. There are 150 instances, 3
classes, and 4 numeric attributes. The data set is #46 from the UC collection. The
dimensions are: class (Setosa, Versicolour, Virginica); sepal-length; sepal-width; petal
length; and petal-width. The cardinality is 35, 23, 43, 22, 3, and 150 (cases)
respectively, the PoC is equal to 342688500, and the log of PoC equals 19.65. One
class is linearly separable from the other two, but the other two are not linearly
separable from each other.