Information Retrieval

Information Retrieval is one of these disciplines that is pursued by a number of communities: it exists as an important subject both within computing science and information science. In its pursuit a number of topics come together: logic, probability, linguistics, artificial intelligence and cognitive science. They all converge for the purpose of designing and building a large-scale system that will store, manipulate, retrieve, and display electronic information of any kind. Thus it is concerned with objects that may be made up of text, audio, image, video, and graphics that are stored in such a way that they are available for interaction by man, woman, or machine. Its study brings together theory, experiment and practice: theory to specify models for retrieval, experiment to test and evaluate the performance of systems, and practice to take note of, and to consider existing manual or automatic retrieval systems.


This page: Up: Course Contents
Previous: Persistent and Distributed Systems
Next: Building Interactive Information Systems

Topics in this module

The Information Retrieval module consists of the following topics:


The Problem

Imagine if all the recorded information in the world had been stored electronically, and that it was possible to search it. So if you were interested in reading your favourite bit from Alice in Wonderland, or wanted to listen to a part from the second act of Tosca, or were keen to see the farewell scene from Casablanca, then all you would have to do would be to find it, and enjoy it. And here in a nutshell is the central problem of Information Retrieval (IR), how to find it? Do you start reading from the beginning of the text, do you scan the entire movie, do you skip down the audio track, or is there a tool that will enable you to specify approximately what you are looking for which will take you directly to the item of information sought?


The Solution

The traditional response to this problem has been to index the information manually; an expert (usually a trained librarian) would assign index terms or keywords to individual items of information. A searcher would then be required to use the words from this vocabulary of index terms to express their need, or what it is they are looking for. Doing it this way, that is indexing and searching, there is little difference whether the items of information are stored electronically or kept in a large physical stores such as libraries. The main advantage in using computers would be speed. But of course most users do not want to express their request for information in a controlled vocabulary, neither is the manual indexing of all the recorded information produced feasible any more. Think of all the picture and audio archives alone which have been built in the last few decades! Hence the quest for indexing information automatically and for allowing users to express their need in the language and media of their choice. In this way information can be accumulated and stored electronically, and presented so that a user can find an item of interest without much prior knowledge of the mode of representation. Ultimately we would want to represent an item in our store by the information it contains.


Information Retrieval Systems

In order that we can build new IR systems that will operate on large-scale data and will make use of the highly interactive technology that we now have, researchers have proposed various models for the IR problem. A very simple analogy to a haystack will give you some idea of the early models that were proposed. Many search problems in IR can be likened to looking for a needle in a haystack. If you were given the problem of finding a needle in a haystack you would very quickly find a way to do it. To begin with you would think about the problem in the following way. This analogy highlights some of the features of information retrieval: looking for a rare event whose characteristics are uncertain but can be learnt through interaction and feedback. It is an interesting thought experiment to imagine how a Martian ignorant of earth technology and science might solve this search problem. Users confronted with large information systems are often in the position similar to that of a Martian!

There are a number of formal models that have been invented to solve the IR problem whether following the haystack analogy, the classical library analogy, or any other. These models will be presented during the course; they are grounded in computational logic, probability and linguistics. Over the years such models have been evaluated extensively and thereby have given rise to a sophisticated experimental methodology which will be an important part of the course. Finally, in the last decade the subject has been much affected by the technology especially World Wide Web, WAIS, various hypermedia systems, extensive networking, and so forth. Some of the course will be devoted to exploring the relationships between these recent technologies and IR.


Top of the page