Survey on Effective Search for A-Needle-in-a-Haystack
Since 27 July 2006
BY Akira Imada
Those papers below are by exploring google with key-words "pdf effective-search needle"
Needle search in a real sence of it
-
Lov K. Grover
"Quantum Mechanics helps in searching for a needle in a haystack."
(4 pages).
The search problem is
this: there is an unsorted database containing N items out
of which just one item satisfies a given condition -- that
one item has to be retrieved. Once an item is examined,
it is possible to tell whether or not it satisfies the condition
in one step. However, there does not exist any sorting on
the database that would aid its selection. The most efficient
classical algorithm for this is to examine the items in
the database one by one.
If an item satisfies the required
condition, stop; if it does not, keep track of this item so
that it is not examined again. It is easily seen that this
algorithm will need to examine an average of 0.5N items
before finding the desired item.
In a system in which the states are described by n bits (it has N=2^n
possible states), we can perform the transformation M on
each bit independently in sequence thus changing the state
of the system.
"Problem (Planted Motif Search (PMS)): Input are t sequences of length n each. Input also are two integers l and d. The problem is to find a motif (i.e., a sequence) M of length l. It is given that each input sequence contains a variant of M. The variants of interest are sequences that are at a hamming distance of d from M.
"Problem 1: A pattern is a string of symbols (also called residues) and ?'s. A "?" refers to a wild card character. A pattern cannot begin or end with ?. AB?D, EB??DS?R, etc. are examples of patterns. The length of a pattern is the number of characters in it (including the wildcard characters). This problem takes as input a database DB of sequences. The goal is to identify all the patterns of length at most P (with anywhere from 0 to integer not exceeding P/2 wild card characters). In particular, the output should be all the patterns together with a count of how many times each pattern occurs. Optionally a threshold value for the number of occurrences could be supplied."
(In Conclusion Section they wrote: )
"We have presented the design and evaluation of an online system for identifying important BGP routing changes in an IP network. Using the concise r-vector data structure to capture BGP routing changes, we identified five categories of BGP routing disruptions that vary in the severity of the impact on the traffic."
(The above suggests that the problem is BGP routing changes in an IP network. And the result was)
"Applying the tool to eight weeks of routing and traffic data from a tier-1 ISP network, we identified several ways for operators to improve the routing stability of the network."
(They claimed that)
"... our tool surprisingly discovered a large number of updates from persistently flapping prefixes and identified three causes. Meanwhile, we found that hot-potato routing changes and eBGP session resets were responsible for many of the large routing disruptions... Finally, we plan to explore automated techniques for responding to disruptions by reconfiguring the routing protocols to improve network performance."
(Then to know the motivation or what is BGP = Border Gatway Protocol, let's see the Abstract, which reads: )
"The performance of a backbone network is vulnerable to interdomain routing changes that affect how traffic travels to destinations in other Autonomous Systems (ASes). Despite having poor visibility into these routing changes, operators often need to react quickly by tuning the network configuration to alleviate congestion or by notifying other ASes about serious reachability problems. Fortunately, operators can improve their visibility by monitoring the Border Gateway Protocol (BGP) decisions of the routers at the periphery of their AS. However, the volume of measurement data is very large and extracting the important information is challenging. In this paper, we present the design and evaluation of an online system that converts millions of BGP update messages a day into a few dozen actionable reports about significant routing disruptions."
"In this paper, we address the challenge of analyzing a large volume of BGP update messages from multiple routers in real time to produce a small number of meaningful alerts for the operators."
(This is guessed to be a-needle-in-a-haystack)
(But how?)
"We define a category called internal disruption that pinpoints the events caused by internal topology changes."
"... our system significantly reduces the volume of data and produces only a few dozen large routing disruptions from millions of BGP updates per day from the periphery of the network."
"We define an event as a sequence of BGP updates for the same prefix from any border router where the inter-arrival time is less than a predefined event timeout."
(In short)
"we transform raw BGP update messages into routing events."
By applying our RouteTracker module to eight weeks
of measurement data, we generated reports for about 23
prefixes per day, on average,
Since each of the n elements in the r-vector can have five different types of changes, routing events could fall into 5^n different categories, which would be extremely unwieldy for generating reports for network operators. Instead, we classify the events based on the severity of the impact on the traffic, leading to five disjoint categories:
(1) Distant/transient disruption -- if each element of the r-vector has ;
(2) Internal disruption -- if the change of each of the elements in its r-vector is either of type or of type , with at least one element undergoing an ;
(3) Single external disruption -- if only one r-vector element has a change of type , , or ;
(4) Multiple external disruptions -- if multiple r-vector elements have a change of type , , or , and the r-vector includes at least one eBGP-learned route before and after the event; and
(5) Loss/gain of reachability -- <> if every r-vector element with an external route experiences a and <> if initially no eBGP-learned routes exist and at least one r-vector element experiences a .
"By clustering events across time for the same prefix, we identify destination prefixes that have unstable routes. By clustering events of the same type across prefixes, we group events that appear to have a common cause.
What is r-vector
hot-potate routing changes
eBGP session resets
To start with
For finding errors in very large state spaces, our experiments show that a genetic search using simple heuristics can significantly outperform random and systematic searches.
When a problem is computationally too hard to solve using an exact and complete algorithm, it is common in computer science to explore the use of heuristics in order to find approximate solutions to the problem, or to converge faster towards some solutions.
To start with
Langdon et al. [20] shows that for these parameters the
search space for a tree is very large and the problem is
essentially a needle-in-a-haystack problem. Hence other
search mechanisms like random search and exhaustive
search would take inordinate time [20]. GP has been shown
to perform well under such conditions.
Sceduling
... Monte Carlo methods. These methods, first introduced by von Neumann and Ulam,
An approach based at least in part on random searches was considered a promising place to begin our investiga-tions. Intuitive justification for this decision lies largely in the realization that many highly effective algorithms, such as genetic ones, admit randomness as an urgently needed es-cape hatch from local (but non-global) optima (Gen and Cheng 2000).
Figure 3 shows that the Monte Carlo Search outperforms other algorithms, while the local
search seems to have the worst performance of all. (It is observed that if the computer run
time is our constraint, Modified Monte Carlo Search out-
performs Local search. Otherwise, in a long run, the Local
Search may obtain a better schedule than the Modified
Monte Carlo Search does.)
the Modified Monte Carlo Search performs almost equivalently to the Monte Carlo Search, if not better.