ICNNAI-2006 Call for Participation

Survey on Effective Search for A-Needle-in-a-Haystack

Since 27 July 2006
BY Akira Imada

Those papers below are by exploring google with key-words "pdf effective-search needle"

Needle search in a real sence of it

Lov K. Grover "Quantum Mechanics helps in searching for a needle in a haystack." (4 pages).

Lov K. Grover "Quantum Mechanics helps in searching for a needle in a haystack." (4 pages). The search problem is this: there is an unsorted database containing N items out of which just one item satisfies a given condition -- that one item has to be retrieved. Once an item is examined, it is possible to tell whether or not it satisfies the condition in one step. However, there does not exist any sorting on the database that would aid its selection. The most efficient classical algorithm for this is to examine the items in the database one by one. If an item satisfies the required condition, stop; if it does not, keep track of this item so that it is not examined again. It is easily seen that this algorithm will need to examine an average of 0.5N items before finding the desired item.

In a system in which the states are described by n bits (it has N=2^n possible states), we can perform the transformation M on each bit independently in sequence thus changing the state of the system.

S. Rajasekaran, S. Balla, and C. Huang (2004) "Exact Algorithms for Planted Motif Challenge Problems." (11 pages).

"Problem (Planted Motif Search (PMS)): Input are t sequences of length n each. Input also are two integers l and d. The problem is to find a motif (i.e., a sequence) M of length l. It is given that each input sequence contains a variant of M. The variants of interest are sequences that are at a hamming distance of d from M.

S. Rajasekaran, S. Balla, C. Huang, V. Thapar, M Gryk, M. Machijewski, and M. Schiller "Exact Algorithms for Motif Search." (11 pages).

"Problem 1: A pattern is a string of symbols (also called residues) and ?'s. A "?" refers to a wild card character. A pattern cannot begin or end with ?. AB?D, EB??DS?R, etc. are examples of patterns. The length of a pattern is the number of characters in it (including the wildcard characters). This problem takes as input a database DB of sequences. The goal is to identify all the patterns of length at most P (with anywhere from 0 to integer not exceeding P/2 wild card characters). In particular, the output should be all the patterns together with a count of how many times each pattern occurs. Optionally a threshold value for the number of occurrences could be supplied."

J. Wu, Z. M. Mao, J. Rexford, and J. Wang "Finding a Needle in a Haystack: Pinpointing Significant BGP Routing Changes in an IP Network." (14 pages).

(In Conclusion Section they wrote: ) "We have presented the design and evaluation of an online system for identifying important BGP routing changes in an IP network. Using the concise r-vector data structure to capture BGP routing changes, we identified five categories of BGP routing disruptions that vary in the severity of the impact on the traffic."

(The above suggests that the problem is BGP routing changes in an IP network. And the result was)

"Applying the tool to eight weeks of routing and traffic data from a tier-1 ISP network, we identified several ways for operators to improve the routing stability of the network."

(They claimed that) "... our tool surprisingly discovered a large number of updates from persistently flapping prefixes and identified three causes. Meanwhile, we found that hot-potato routing changes and eBGP session resets were responsible for many of the large routing disruptions... Finally, we plan to explore automated techniques for responding to disruptions by reconfiguring the routing protocols to improve network performance."

(Then to know the motivation or what is BGP = Border Gatway Protocol, let's see the Abstract, which reads: ) "The performance of a backbone network is vulnerable to interdomain routing changes that affect how traffic travels to destinations in other Autonomous Systems (ASes). Despite having poor visibility into these routing changes, operators often need to react quickly by tuning the network configuration to alleviate congestion or by notifying other ASes about serious reachability problems. Fortunately, operators can improve their visibility by monitoring the Border Gateway Protocol (BGP) decisions of the routers at the periphery of their AS. However, the volume of measurement data is very large and extracting the important information is challenging. In this paper, we present the design and evaluation of an online system that converts millions of BGP update messages a day into a few dozen actionable reports about significant routing disruptions."

"In this paper, we address the challenge of analyzing a large volume of BGP update messages from multiple routers in real time to produce a small number of meaningful alerts for the operators." (This is guessed to be a-needle-in-a-haystack)

(But how?)
"We define a category called internal disruption that pinpoints the events caused by internal topology changes."
"... our system significantly reduces the volume of data and produces only a few dozen large routing disruptions from millions of BGP updates per day from the periphery of the network."
"We define an event as a sequence of BGP updates for the same prefix from any border router where the inter-arrival time is less than a predefined event timeout."
(In short) "we transform raw BGP update messages into routing events."
By applying our RouteTracker module to eight weeks of measurement data, we generated reports for about 23 prefixes per day, on average,

Since each of the n elements in the r-vector can have five different types of changes, routing events could fall into 5^n different categories, which would be extremely unwieldy for generating reports for network operators. Instead, we classify the events based on the severity of the impact on the traffic, leading to five disjoint categories: (1) Distant/transient disruption -- if each element of the r-vector has ; (2) Internal disruption -- if the change of each of the elements in its r-vector is either of type or of type , with at least one element undergoing an ; (3) Single external disruption -- if only one r-vector element has a change of type , , or ; (4) Multiple external disruptions -- if multiple r-vector elements have a change of type , , or , and the r-vector includes at least one eBGP-learned route before and after the event; and (5) Loss/gain of reachability -- <> if every r-vector element with an external route experiences a and <> if initially no eBGP-learned routes exist and at least one r-vector element experiences a .

"By clustering events across time for the same prefix, we identify destination prefixes that have unstable routes. By clustering events of the same type across prefixes, we group events that appear to have a common cause.

What is r-vector hot-potate routing changes eBGP session resets

To start with

P. Godefroid and S. Khurshid (2002) "Exploring Very Large State Spaces using Genetic algorithms." TACAS 2002,LNCS 2280, pp. 266--280. (15 pages).

For finding errors in very large state spaces, our experiments show that a genetic search using simple heuristics can significantly outperform random and systematic searches.

When a problem is computationally too hard to solve using an exact and complete algorithm, it is common in computer science to explore the use of heuristics in order to find approximate solutions to the problem, or to converge faster towards some solutions.

To start with

W. Fan, Ming Luo, L. Wang, W. Xi, and E. A. Fox "Tuning Before Feedback: Combining Ranking Discovery and Blind Feedback for Robust Retrieval." (15 pages).

Langdon et al. [20] shows that for these parameters the search space for a tree is very large and the problem is essentially a needle-in-a-haystack problem. Hence other search mechanisms like random search and exhaustive search would take inordinate time [20]. GP has been shown to perform well under such conditions.

Sceduling

S. Chantaravarapan, A Gunal and E.J.Williams (2004) "On Using Monte Carlo Methods for Scheduling." Proceedings of the 2004 Winter Simulation Conference. pp 1870- (6 pages).

... Monte Carlo methods. These methods, first introduced by von Neumann and Ulam,

An approach based at least in part on random searches was considered a promising place to begin our investiga-tions. Intuitive justification for this decision lies largely in the realization that many highly effective algorithms, such as genetic ones, admit randomness as an urgently needed es-cape hatch from local (but non-global) optima (Gen and Cheng 2000).

Figure 3 shows that the Monte Carlo Search outperforms other algorithms, while the local search seems to have the worst performance of all. (It is observed that if the computer run time is our constraint, Modified Monte Carlo Search out- performs Local search. Otherwise, in a long run, the Local Search may obtain a better schedule than the Modified Monte Carlo Search does.)

the Modified Monte Carlo Search performs almost equivalently to the Monte Carlo Search, if not better.