A Survey on Database Search as Needles
- Database Search as Needles
-
K. Crammer, and G. Chechik (2004)
"A Needle in a Haystack: Local One-Class Optimization."
(8 pages)
- "This paper addresses the problem of finding a small and coherent subset of points
in a given data."
- "This problem, sometimes referred to as one-class or set covering, requires to find
a small-radius ball that covers as many data points as possible."
- "Most previous approaches to this problem focus on identifying and discarding a possible
set of outliers. In this paper we adopt an opposite approach which directly aims to find a
small set of coherently structured regions, by using a loss function that focuses on local
properties of the data."
- "The goal of unsupervised learning is to extract concise descriptions
of data given empirical samples. However, one
is often only interested in modeling small parts of the data
and ignore its remaining irrelevant parts. This is for example
the case when the data consists of small groups of
coherently structured data points, and the rest of the data
are irrelevant or mere noise."
- "It is a common scenario in
a variety of applications from detecting regions of interest
in images to finding subsets of co-expressed genes in
genome-wide experiments."
- "A related problem is demonstrated by the task of information
retrieval by search engines. In this application, given
a query, a handful of documents is sought out of billions
of potential web pages."
- ""In these two general scenarios, one looks for a small but
coherent subsets of data items, which can be achieved by
finding a small-radius ball that covers as many data points
as possible. This problem is known as one-class classification
(Tax & Duin, 1999), and has two opposite approaches
for its solution: Most previous approaches formulate the
task as a problem of outlier and novelty detection, in which
most of the samples are identified as relevant. Here we take
the opposite approach and try to identify a small subset of
relevant samples, rather than keep all but few outliers. As
we will see below, this gneedle in a haystackh approach requires
a different formulation of the detection problem.
-
M. V. Joshi, R. C. Agarwal, V. Kumar (????)
"Mining Needles in a Haystack: Classifying Rare Classes via Two-Pase Rule Induction"
(12 pages)
- "Learning models to classify rarely occurring target classes is an important problem
with applications in network intrusion detection, fraud detection,
or deviation detection in general."
-
"One of the important problems in data mining, commonly known as deviation detection,
is that of modeling rarely occurring phenomena in the data."
-
"Critical observation of these two problems leads us to the focal point of this paper,
which is to analyze the effectiveness of our recently proposed two-phase rule-induction
approach [1], especdially in the context of building models for rare classes."
[1] R.C.Agarwal, and M.V.Joshi, (2001)
PNrule: A new framework for learning classifier models in data mining
(A case-study in network intrusion detection)
Proceedings of 1st SIAM Conference on Data Mining
-
"The traditional evaluation metric of accuracy is not adequate when the target-class is rare.
If the class is very rare, say 0.5%, then precdicting everything to be of non-target-class
can also achieve very high accuracy level of 99.5%"
-
"We use one such metric, called F-measure, widely used by the information retrieval
community [14]."
-
We start our comparative study with a simple model for generating <>, ...
-
G.M.Weiss
"Mining with Rarity: A Unifying Framework"
(13 pages)
- "This article
discusses the role that rare classes and rare cases play in data mining."
- "Examples abound and include ......and detecting oil spills from satellite images [34]."
- "this approach
identifies regions likely to contain needles in the first
phase and then learns to discard the strands of hay within these
regions in the second phase."
-
a. Whitaker, R. S. Cox, and S.D. Gribble
"Configuration Debugging as Search: Finding the Needle in the hatystack"
(14 pages)
-
"The goal of this work is to reduce the burden on
human experts by partially automating problem diagnosis.
In particular, we analyze the applicability of search
techniques for diagnosing configuration errors."
"Our insight is that although computers cannot compete with
human intuition, they are very effective at exploring a
large configuration space. Our diagnosis tool, which we
call Chronus, uses search to identify the specific time in
the past when a system transitioned from a working to
a non-working state,
-
MIROSLAV KUBAT, ROBERT C. HOLTE, STAN MATWIN
"Machine Learning for the Detection of Oil Spills in Satellite Radar Images"
(24 pages)
-
"The full image consists of 8,000x8,000 pixels, with
each pixel representing a square of 30x30m; the fragment shown here is approxi-
mately 70 50 kilometers. The oil slick is the prominent elongated dark region in
the upper right of the picture. The dark regions in the middle of the picture and the
lower left are lookalikes, most probably wind slicks (winds with speeds exceeding
10m/sec decrease the re
ectance of the radar, hence the aected area looks darker
in a radar image).""
-
"The first critical feature is the scarcity of data. Although satellites are continu-
ally producing images, most of these images contain no oil spills"
-
"In addition to the genuine infrequency of oil spills and the
limited time of the expert, the data available was restricted by nancial consider-
ations: images cost hundreds, sometimes thousands of dollars each. We currently
have 9 carefully selected images containing a total of 41 oil slicks. While many
applications work with large amounts of available data (Catlett, 1991), our domain
application is certainly not unique in its data scarcity. For example, in the drug
activity application reported by Dietterich et al. (1997) the two datasets contain
47 and 39 positive examples respectively."
"The second critical feature of the oil spill domain can be called an imbalanced
training set: there are very many more negative examples (lookalikes) than posi-
tive examples (oil slicks). Against the 41 positive examples we have 896 negative
examples; the majority class thus comprises almost 96% of the data. Highly im-
balanced training sets occur in applications where the classier is to detect a rare
but important event, such as fraudulent telephone calls (Fawcett & Provost, 1996),
unreliable telecommunications customers (Ezawa et al., 1996), failures or delays
in a manufacturing process (Riddle et al., 1994), or rare diagnoses (e.g., the thy-
roid diseases in the UCI repository (Murphy & Aha, 1994). Extremely imbalanced
classes also arise in information retrieval and ltering tasks: in the domain studies
by Lewis & Catlett (1994), only 0.2% (1 in 500) examples are positive. In the
high-energy physics learning problem reported by Clearwater & Stern (1991), only
1 example in a million is positive."
"The third critical feature is
that examples are naturally grouped in batches: ...
Whenever data
is collected in batches, there is a possibility that the batches systematically dier
from one another, or that there is a much greater similarity of examples within
a batch than between batches. ...
In our domain, for example, the exact parameter
settings of the radar imaging system or low level image processing are necessarily
the same for examples within a batch but could be dierent for dierent batches.
Clearly, in our case, the classier will be learned from one set of images, and it will
be applied on images that were not part of this set. This fact should be taken into
account in the evaluation of the system
"
"The nal critical feature relates to the performance task. The classier will be used
as a lter | it will decide which images to present to a human. This requirement is
quite pervasive in real-world applications. Fraud detection, credit scoring, targeted
marketing, evaluation of EEG signals | all of these domains require that a human
expert be able to decide how many \suspicious" cases to pursue. The system must
provide the user with a convenient means of varying \specicity" (higher specicity
means fewer false alarms at the cost of increased risk of missing a genuine oil spill)."