A Survey on Data Clasification with IRIS flower data set

by Akira Imada (March 2006)


From a context of Artificial Immune system.

Recent work has investigated the combination of clonal selection with negative selection in the AIS. Detectors (in the form of classification rules) are evolved using the clonal selection algorithm: parent detectors compete in groups, with only the rule that matches a non-self antigen having its fitness increased. The fittest parents are randomly picked for reproduction, and if child detectors match any 'self' antigens, they are removed (negative selection), with parents generating new children. The combination of these two processes results in the evolution of a population of detectors that are clustered into niches, and that can together distinguish between self and non-self data. For full details, refer to (Kim and Bentley, 2001).

the iris data set (50examples each of setosa, virginia and versicolour). A tenfold cross-validation method was employed to prepare training sets for the AIS to evolve and test sets to evaluate detection of previously unseen non-self patterns. A detector population size of 300 was used. Each experiment was run for a maximum 50 generations.

(TP, FP) = (99.8%, 1.2%), (95%, 5%) and (95.6%, 1%) for Setosa, Versicolor, and Virginia respectivley.

... these results illustrate the benefits of exploiting the natural capabilities of evolution. The combination of immune processes produced niches of distributed detectors, which together discovered patterns in data that distinguished 'self' from 'non-self'. An IDS is usually comprised of two main components: an anomaly detector and a misuse detector (Mykerjee et al, 1994). The anomaly detector establishes the profiles of normal activities of users, systems, system resources, network traffic and/or services and detects intrusions by identifying significant deviations from the normal behaviour patterns observed from profiles. The misuse detector defines suspicious misuse signatures based on known system vulnerabilities and a security policy. This component probes whether these misuse signatures are present or not in the auditing trails. This is perhaps because the given problem of iris data is relatively easier and thus the minimum sample size is good enough to show a good detection rate. In other words, fairly general detectors can detect all existing non-self antigen patterns in the iris data set.

K-means

The accuracy is the proportion of the total number of correct classification,... AC is not sufficient to evaluate the classifier's performance when the number of instances of one class is overwhelmingly greater than the other [18]. For example, there are 10 000 instances, 9 990 of which are negative and 10 of which are positive. If all of them are classified as negative, the accuracy is 99.9% even though all of the positive instances are misclassified. For binary classifiers, true positive rate (TP), and false positive rate (FP), are also used to reinforce the accuracy [18]. True Two-fold cross-validation is used for evaluating the classification methods. The Iris data are divided into two halves. One half is used for training data while another half for testing. This surveys analyzed ten diferent data sets (Simple Seven; Balloons; Contact Lenses; Shuttle O-rings; Monks problem; Iris Flower; Congress Voting; Liver Disorders; Cars; Wines) using five visualization techiques (Parallel Coordinates; Scatter Plot Matrices; Survey Plots; Circle Segments; and Radviz). All of the data sets except the Simple Seven set, are from the UC Urvine Machine Learning Repository.

This invites a tradeoff in using synthetic vs. real data. Synthetic data is harder to construct, but the "correct" answers are known. Real data is easier to collect, but it is harder to evaluate performance, because it is nearly impossible to know the correct output to a task in any reasonably sized data set.

The Iris Database is perhaps the most often used data set in pattern recognition, statistics, data analysis, and machine learning. The task is to predict the class of the flower based on the 4 physical attribute measurements. There are 150 instances, 3 classes, and 4 numeric attributes. The data set is #46 from the UC collection. The dimensions are: class (Setosa, Versicolour, Virginica); sepal-length; sepal-width; petal length; and petal-width. The cardinality is 35, 23, 43, 22, 3, and 150 (cases) respectively, the PoC is equal to 342688500, and the log of PoC equals 19.65. One class is linearly separable from the other two, but the other two are not linearly separable from each other.