|
|
VICKY CHOI |
1. Heterogeneous Approaches to Protein Libraries |
Vicky Choi will be involved in a project which I am heading, along with
Craig Nevill-Manning of Rutgers Computer Science, in collaboration with
the members of PDB team at Rutgers. |
The protein Data Bank (PDB), which is administered
by Rutgers and its partners, is the global repository of known protein
structures. For each protein in the PDB, one knows the three dimensional
coordinates of all the carbon atoms in the protein (and sometimes all the
other atoms as well) and the sequences of the protein.; Additional
information is available from other sources: sequences in Genbank
with good BLAST scores, meta-data from MedLine (a database biomedical journal
abstracts), which might include, e.g., articles which mention the protein
and most importantly the article which discusses the structure determination.
The PDB is an important resource for the molecular biology and structural
chemistry communities, but search capabilities are limited and somewhat
slow. Speeding up searches based on standard criteria, or allowing
more general kinds of searches, for example, those based on heterogeneous
data sources, is necessary for the PDB to reach its full potential within
the research community. We are working on a variety of related search
problem within the PDB. Examples include: |
Proximity searching, by geometry, by sequence, by literature.
Clustering proteins by geometry, by sequence, by literature.
Using literature to disambiguate and augment evidence from proximity searches.
Automatic annotation of clusters derived, e.g., by sequence or structure considerations using literature |
Below, we describe some of these problems in greater detail. |
Searching by Three Dimensional Similarity. Searching by
senescence and searching by 3D structure (henceforth abbreviate simply
as sequence versus structure) have different roles, thought they may be
used for similar purposes, that is, to get clues to the function of a poorly
characterized protein. Since it is easy to find the sequence of a
protein, but much more difficult (and sometimes next to impossible) to
find the structure of a protein, researchers have quite reasonable focused
on the sequence searching problem. |
However, the mapping from sequence to structure
is quite degenerated, and it is well known that some proteins can have
significant structural similarity without sharing much sequence similarity.
This difference is especially important with evolutionarily distant proteins.
| One of the striking facts about structural similarity
searching is that there is no ìgold standardî distance metric which measures
how related protein structures are. In the case of sequences, if
one has the time to perform pairwise alignments, then one gets an excellent
idea of which proteins in a database are related to the query sequence.
Alignments may not give the perfect sequence-based measure of similarity,
but they have been extremely useful in practice. Fast database
searching algorithms such as BLAST seek to approximate the results achieved
by an exhaustive alignment search while saving substantially on the time. |
There is no slow-but-tue structure based distance
measure. Many such functions have been suggested, and for some of
them, detailed comparisons have been performed to determine which is more
effective at teasing out distant relationships not apparent at the sequence
level, but no clear winner has emerged. |
These studies are somewhat hampered by the slow
run-times of these structure comparison techniques. Thus, the data
sets on which the run tend to be small. Ideally, if each comparison measure
had a fast but effective approximate searching algorithm ó analogous to
the speedup that BLAST provides for sequence alignments ó them a more comprehensive
set of method comparisons could be easily performed. However, this
is too much to hope for. A huge amount of work when into making BLAST
fast and sensitive, and it is simply unreasonable to repeat this work for
every proposed structure comparison method. |
Thus, we consider the problem of providing a tool
which speeds up proximity searching for an arbitrary metric. This
method, which is under development but has already shown promising results,
performs approximate searches, just as BLAST does, and it is substantially
faster than an exhaustive search, though its relative speedup is not as
great as that of BLAST. Thus, our methods, which we call SparseMap yields
a tradeoff between speed and generality. |
Many open problems remain. In particular,
SparseMap proceeds by embedding the underlying metric into a low-dimensional
Euclidean space, and then using Euclidean methods for the actual similarity
search. Is it possible to use the embedding generated by SparseMap
to generate meaningful clusters of the data? Also, how can heterogeneous
data sources be combined to provide useful composite measures of similarity
between proteins? |
Literature Searching and Protein Similarity. Bioinformatics
has heretofore been mainly concerned with structure and sequence.
We propose that a third data type, literature, is an untapped resource
for automaticall explicating biological relationships. The explosion
of biological data has occurred at a time when the Internet is ubiquitous
in the research community. Due to this, and the foresight of institutions
such as the National Center for Biotechnology Information, there is rich
connectivity between the various biological data types: sequence-sequence,
structure-structure, sequence-literature, etc. |
However, while research in bioinformatics has overlooked
literature for the purposes of doing computations, perhaps because it was
not intended for computational use. Research in information retrieval
(IR) has demonstrated that useful computations can be done on testual data.
For example, it is a surprising but convenient fact that treating a document
as a ìbag of words,î without any particular word order still allows documents
to be compared in a useful way in order, for example, to perform information
retrieval. |
Given a pair of documents, we can compute a distance
between them in the following way. Each document is represented as
a vector, where each component is the frequency of a lexicon item in the
document. Each vector is scaled to have length one, and the similarity
of two documents is computed as inner product o the vectors, or the cosine
of the angle between the vectors. Given such a distance, we can do
things such as clustering, finding representative documents ó a small number
of the documents that are far from each other, that i, cluster centroids.
| Many such basis IR techniques have not yet been
exploited for protein database searching. The first question, however,
is just how much data there is connecting sequences with literature.
Such connections are explicitly encoded in GenBank. the richness
of these links is often due to the fact that in the life sciences literature,
acceptance of a paper for publication is often contingent on submission
of any sequence described in the paper to GenBank. for example, 79%
of primate sequences in GenBank have links to the literature entry is linked
to four sequences on average. In MEDLINE, the literature database,
a maximum of thirty sequence links are recorded, but the links can be inferred
by starting from the sequence database. |
The literature entries associated with structural
entries usually describe the experimental procedure used to infer the structure.
As with sequence data, deposition of the structure with the Protein Data
Bank (PDB) is required for publication. In addition to describing
the structure, the authors often make observations about how the structure
clarifies the mechanism of the protein's enzymatic activity. There
is only one publication associated with the structure determination, but
there are also often other relevant publications included as remarks.
Unfortunately, these are not explicitly indexed with MEDLINE IDs, but is
should be trivial to identify the correction MEDLINE entry. |
We aim to capitalize on the rich links between biological
data types for many purpose, such as to annotate new sequences, and to
annotate gounps of sequences. Here, we discuss one: to disambiguate
protein comparison. |
Many sequence comparison operations are straightforward.
Where there is high sequence identity between two protein sequences, it
is easy to infer that the proteins have similar function. However,
there are many comparisons where similarity based on sequence alone is
borderline. In such cases, we propose using additional evidence from
literature and structure to disambiguate the relationship between the proteins.
for example, it two sequences are only marginally similar, but the literature
associated with each sequence discusses the same issues, then it may be
possible to increase the confidence in the proteinsí relationship. |
An appropriate frame work for integrating these
separate sources of evidence is a probabilistic one. We can recast
the similarity of two proteins as the probability that they are related
ó i.e., reject the null hypothesis that the sequences are alignable by random
chance. Similarly, it should be possible to cast the literature similarity
as a probability o relateness. We can then calculate the joint probability
of sequence and structure relationships, and contrast it against the joint
probability of the null hyogitgeses. Similarly, evidence may be drawn
from structural relationships. Given two sequences drawn at random,
it is statistically unlikely that both have known structures. However,
if one has a structure, the other sequence can be threaded into that structure.
This example illustrates what we hope to be an appropriate broad approach
to integrating protein data. |
1.1 Vicky Choiís role in the research project |
Vicky is a very strong algorithmicist. At this point in our project,
we have far more algorithmic problems than we have answers. Vicky
will have a central role in algorithm design, implementation, testing and
deployment. The obvious most effect route for deploying our tools
is within the PDB itself. I expect Vicky to be a primary liason with
the PDB team in the Chemistry Department. She will have ample opportunity
to see her work through all four stages. |
|
|