Information Retrieval is one of
these disciplines that is pursued by a number of communities: it exists as an
important subject both within computing science and information science. In its
pursuit a number of topics come together: logic, probability, linguistics,
artificial intelligence and cognitive science. They all converge for the purpose
of designing and building a large-scale system that will store, manipulate,
retrieve, and display electronic information of any kind. Thus it is concerned
with objects that may be made up of text, audio, image, video, and graphics that
are stored in such a way that they are available for interaction by man, woman,
or machine. Its study brings together theory, experiment and practice: theory to
specify models for retrieval, experiment to test and evaluate the performance of
systems, and practice to take note of, and to consider existing manual or
automatic retrieval systems.
This page:
Up: Course
Contents
Previous: Persistent and
Distributed Systems
Next: Building
Interactive Information Systems
The Information Retrieval
module consists of the following topics:
- Information retrieval models
- Implementation issues
- Natural language processing
- Cognitive models
- Evaluation and experimental design
- Structured data and hypermedia
- Logical models
Imagine if all the
recorded information in the world had been stored electronically, and that it
was possible to search it. So if you were interested in reading your favourite
bit from Alice in Wonderland, or wanted to listen to a part from the
second act of Tosca, or were keen to see the farewell scene from
Casablanca, then all you would have to do would be to find it, and
enjoy it. And here in a nutshell is the central problem of Information Retrieval
(IR), how to find it? Do you start reading from the beginning of the text, do
you scan the entire movie, do you skip down the audio track, or is there a tool
that will enable you to specify approximately what you are looking for which
will take you directly to the item of information sought?
The traditional
response to this problem has been to index the information manually; an expert
(usually a trained librarian) would assign index terms or keywords to individual
items of information. A searcher would then be required to use the words from
this vocabulary of index terms to express their need, or what it is they are
looking for. Doing it this way, that is indexing and searching, there is little
difference whether the items of information are stored electronically or kept in
a large physical stores such as libraries. The main advantage in using computers
would be speed. But of course most users do not want to express their request
for information in a controlled vocabulary, neither is the manual indexing of
all the recorded information produced feasible any more. Think of all the
picture and audio archives alone which have been built in the last few decades!
Hence the quest for indexing information automatically and for allowing users to
express their need in the language and media of their choice. In this way
information can be accumulated and stored electronically, and presented so that
a user can find an item of interest without much prior knowledge of the mode of
representation. Ultimately we would want to represent an item in our store by
the information it contains.
In order that we can
build new IR systems that will operate on large-scale data and will make use of
the highly interactive technology that we now have, researchers have proposed
various models for the IR problem. A very simple analogy to a haystack will give
you some idea of the early models that were proposed. Many search problems in IR
can be likened to looking for a needle in a haystack. If you were given the
problem of finding a needle in a haystack you would very quickly find a way to
do it. To begin with you would think about the problem in the following way.
- the thing I am looking for is rare, so a random search is likely to be
useless
- its characteristics are very different from the thing I do not want (hay)
- I can exploit this difference if I know more about it, I could use a
magnet or burn down the stack.
This analogy highlights some of the
features of information retrieval: looking for a rare event whose
characteristics are uncertain but can be learnt through interaction and
feedback. It is an interesting thought experiment to imagine how a Martian
ignorant of earth technology and science might solve this search problem. Users
confronted with large information systems are often in the position similar to
that of a Martian!
There are a number of formal models that have been invented to solve the IR
problem whether following the haystack analogy, the classical library analogy,
or any other. These models will be presented during the course; they are
grounded in computational logic, probability and linguistics. Over the years
such models have been evaluated extensively and thereby have given rise to a
sophisticated experimental methodology which will be an important part of the
course. Finally, in the last decade the subject has been much affected by the
technology especially World Wide Web, WAIS, various hypermedia systems,
extensive networking, and so forth. Some of the course will be devoted to
exploring the relationships between these recent technologies and IR.
Top of the
page