How (and why) to find a needle in a haystack
The proper functioning of organisms depends on a complex game of
hide-and-seek conducted inside the cell. Biologists are now beginning to draw on
information theory to develop a better understanding of the rules of this
game
SUPPOSE you really had to find a needle in a haystack. How would
you do it? And how big would the needle have to be for you to be in with a
chance of finding it? These may seem like silly questions. But a version of this
problem occurs at every moment within the cells of organisms. Here, "you" are a
protein molecule with the vital job of switching genes on and off; the
"haystack" is all of the DNA in the cell; and the "needle"
is a particular fragment of DNA, often not longer than five
or six genetic letters (out of, in the case of humans, roughly 3 billion) that
the protein must find before it can do its job.
Although scientists know a lot about
individual genes and what they do, they know much less about the broader
co-ordination of activities within the cell. As a first step, they are eager to
find out how protein molecules conduct their vital search. The cells of any
organism will not work properly unless a few hundred such molecules are able
successfully to find their "binding sites" on the DNA. If
even one were to fail, this might (depending on the gene it is in charge of)
totally disrupt the cellfs activities. Other genes would not be switched on at
the right times, and still others might never get switched off.
So how does the protein organise its search?
Again, take the haystack. Assume it is an ideal (ie, not a realistic) haystack,
one in which it is as easy to check the bottom and the middle as it is to check
the sides. In that case one way to search would be to select a spot randomly,
inspect it, and if it is needle-free, to step back, select another spot and so
on. Eventually, you would find the needle, but it would probably have taken you
a lot of time.
A better way would be to look
around for a while in the vicinity of each selected spot. You would then waste
less time and energy stepping to and from the haystack. And if (this would be a
truly unrealistic haystack) the haystack were one-dimensional, so that you could
only move forwards and backwards along it, your chance of success would actually
be quite high, even if the direction of each step was random.
Although there is some evidence that protein
molecules do indeed search DNA as if it were a
one-dimensional haystack, the absence of good techniques for observing
individual biological molecules in action has made this hard to verify.
Recently, however, Carlos Bustamante, a biophysicist at the Howard Hughes
Medical Institute at the University of Oregon, in Eugene, and his colleagues
there and at the University of California, Santa Barbara, have found a way to
observe proteins as they search for their binding sites.
To do this, the researchers adapted a device
known as an atomic-force microscope. Like a record player, this has a stylus; if
a molecule is passed underneath its tip, the stylus bobs up and down over the
lumps and bumps of the atoms. A laser is bounced off the tip of the stylus to
amplify the minute ups and downs so that the shape of the molecule can be made
into an image, and even into a moving image.
Using such a device to look at active
biological molecules is difficult, however. The tip of the stylus sticks to the
molecules, and it is hard to look at the molecules when they are in liquid-their
natural state-because it is hard to get them to stay put. Dr Bustamante and his
colleagues got around the first problem by coating the tip with carbon. They got
around the second by putting the molecules of interest (a protein called RNA polymerase and a fragment of DNA) into a
solution and placing them on top of a perfectly flat crystal. The molecules
settle on the crystal and can then be spied on.
RNA polymerase plays a
big part in the process of transcription, the first step in the switching on of
a gene, when the information carried in the DNA is copied
into a molecule called RNA. In order to start this process,
the polymerase must first find a binding site known as a "promoter", a fragment
of DNA in front of the gene.
To see how the polymerase seeks out the
promoter, Dr Bustamante and his colleagues played a cruel trick. The sequence of
DNA they selected did not actually contain a promoter. When
they came to watch the film produced by their special microscope, they saw the
polymerase land on the DNA, and then slide up and down
along it, jostled randomly in either direction by the thermal energy of the
solution. From time to time, it would detach itself, and then settle somewhere
else and start hunting again, alas in vain.
But what does the promoter have to be like for
the polymerase to find it? That, to return briefly to the farmyard, depends on
both the size of the haystack (that is, on the size of the genome-ie, how much
DNA there is to sort through) and on the number of needles
(that is, on how many binding sites for one particular type of protein there
are). For a genome of a given size, a binding site will have to be much more
conspicuous if it is alone than if it is one of many. It will, in other words,
need to contain more information.
In this
context, it is necessary to note, "information" has a special, technical
meaning: it is a measure of a decrease in uncertainty. The sending of a fax
message provides a useful analogy: before a fax reaches its destination there is
maximum uncertainty, which decreases with the arrival of each legible letter.
Although information theory-the analysis of signals and noise-was first
developed in the 1940s, it has only recently been applied to biology. Thomas
Schneider, a biologist at the National Cancer Institute, in Frederick, Maryland,
has been applying it to binding sites in DNA.
DNA is a particularly apt
material for information theorists to sink their teeth into. The molecule
carries a signal encoded in an alphabet of four genetic "bases", A, T, C and G. At any given position along the molecule, one of these letters
must be present. But which?
An information
theorist would answer thus: if each letter is equally likely to occur, the
uncertainty is complete and a searcher does not know which of the letters to
expect. But if the same letter always appears in a given position in the
molecule, there is no uncertainty.
A
researcher trying to find out how much information a protein needs in order to
recognise a binding site amidst all the noise and jostling of a busy cell can
try to answer in two different ways. One way is to predict exactly how much
information the site has to contain in order for a protein to detect it. This of
course depends only on the frequency of the binding sites in the genome. The
second is to examine the site and discover what it contains.
This second kind of analysis is done by lining
up lots of known binding sites for a particular protein, comparing them position
by position, and so finding out which letter is most likely to occur at which
position, and how probable it is that a different letter may sometimes crop up
instead. This tells you how much information is present at each position:
variable positions contain little information while constant positions contain
lots. To find out how much information is contained in an average binding site,
just add up the information for each position.
But since some variation is permitted at most
positions within a binding site, it means that the total number of different
possible versions of a site is often large. And the trouble is, neither of the
above approaches tells you much about any particular binding site that a protein
may be looking for. For instance, is the binding site just a rare variant, or
does it contain mutations that have reduced the information so much that the
protein can no longer recognise it?
Recently,
Dr Schneider has addressed this problem, too. Using some mathematical wizardry,
he has taken the average information of a binding site, and worked backwards to
evaluate any individual example. This procedure turns out to have surprising
power. It means that Dr Schneider can tell whether or not the signal from any
particular binding site is too noisy to be useful-in other words, whether it is
a damaging mutation. It allows him to design novel binding sites that will
contain enough information. And it allows Dr Schneider to pretend he is a
protein looking for a binding site. Using his technique, he can wander along
known lengths of DNA (on his computer, not in solution). In
doing so, he has already discovered several binding sites that no one knew were
there.
The
Economist Home Page
©
copyright 1997 The Economist Newspaper Limited. All Rights
Reserved
This article was published in The Economist April
5th-11th 1997, British version: p. 105-107, American version: p. 73-75, Asian
version: p. 79-81.
The article is © copyright The Economist, London, April 5th 1997.
Permission was granted to post this article at the National
Cancer Institute (USA).
This page has been slightly modified from the original html to remove or correct broken links.

Schneider Lab.
origin: 1997
October 28
updated: 2002 Feb 13