Program in Mathematics and Molecular Biology

PMMB Members

Fellowships

ª Former PMMB National Fellows

Links

Click here to return to PMMB home.

VICKY CHOI

1. Heterogeneous Approaches to Protein Libraries

Vicky Choi will be involved in a project which I am heading, along with Craig Nevill-Manning of Rutgers Computer Science, in collaboration with the members of PDB team at Rutgers.

The protein Data Bank (PDB), which is administered by Rutgers and its partners, is the global repository of known protein structures. For each protein in the PDB, one knows the three dimensional coordinates of all the carbon atoms in the protein (and sometimes all the other atoms as well) and the sequences of the protein.; Additional information is available from other sources: sequences in Genbank with good BLAST scores, meta-data from MedLine (a database biomedical journal abstracts), which might include, e.g., articles which mention the protein and most importantly the article which discusses the structure determination. The PDB is an important resource for the molecular biology and structural chemistry communities, but search capabilities are limited and somewhat slow. Speeding up searches based on standard criteria, or allowing more general kinds of searches, for example, those based on heterogeneous data sources, is necessary for the PDB to reach its full potential within the research community. We are working on a variety of related search problem within the PDB. Examples include:

  • Proximity searching, by geometry, by sequence, by literature.

  • Clustering proteins by geometry, by sequence, by literature.

  • Using literature to disambiguate and augment evidence from proximity searches.

  • Automatic annotation of clusters derived, e.g., by sequence or structure considerations using literature

Below, we describe some of these problems in greater detail.

Searching by Three Dimensional Similarity. Searching by senescence and searching by 3D structure (henceforth abbreviate simply as sequence versus structure) have different roles, thought they may be used for similar purposes, that is, to get clues to the function of a poorly characterized protein. Since it is easy to find the sequence of a protein, but much more difficult (and sometimes next to impossible) to find the structure of a protein, researchers have quite reasonable focused on the sequence searching problem.

However, the mapping from sequence to structure is quite degenerated, and it is well known that some proteins can have significant structural similarity without sharing much sequence similarity. This difference is especially important with evolutionarily distant proteins.

One of the striking facts about structural similarity searching is that there is no ìgold standardî distance metric which measures how related protein structures are. In the case of sequences, if one has the time to perform pairwise alignments, then one gets an excellent idea of which proteins in a database are related to the query sequence. Alignments may not give the perfect sequence-based measure of similarity, but they have been extremely useful in practice. Fast database searching algorithms such as BLAST seek to approximate the results achieved by an exhaustive alignment search while saving substantially on the time.

There is no slow-but-tue structure based distance measure. Many such functions have been suggested, and for some of them, detailed comparisons have been performed to determine which is more effective at teasing out distant relationships not apparent at the sequence level, but no clear winner has emerged.

These studies are somewhat hampered by the slow run-times of these structure comparison techniques. Thus, the data sets on which the run tend to be small. Ideally, if each comparison measure had a fast but effective approximate searching algorithm ó analogous to the speedup that BLAST provides for sequence alignments ó them a more comprehensive set of method comparisons could be easily performed. However, this is too much to hope for. A huge amount of work when into making BLAST fast and sensitive, and it is simply unreasonable to repeat this work for every proposed structure comparison method.

Thus, we consider the problem of providing a tool which speeds up proximity searching for an arbitrary metric. This method, which is under development but has already shown promising results, performs approximate searches, just as BLAST does, and it is substantially faster than an exhaustive search, though its relative speedup is not as great as that of BLAST. Thus, our methods, which we call SparseMap yields a tradeoff between speed and generality.

Many open problems remain. In particular, SparseMap proceeds by embedding the underlying metric into a low-dimensional Euclidean space, and then using Euclidean methods for the actual similarity search. Is it possible to use the embedding generated by SparseMap to generate meaningful clusters of the data? Also, how can heterogeneous data sources be combined to provide useful composite measures of similarity between proteins?

Literature Searching and Protein Similarity. Bioinformatics has heretofore been mainly concerned with structure and sequence. We propose that a third data type, literature, is an untapped resource for automaticall explicating biological relationships. The explosion of biological data has occurred at a time when the Internet is ubiquitous in the research community. Due to this, and the foresight of institutions such as the National Center for Biotechnology Information, there is rich connectivity between the various biological data types: sequence-sequence, structure-structure, sequence-literature, etc.

However, while research in bioinformatics has overlooked literature for the purposes of doing computations, perhaps because it was not intended for computational use. Research in information retrieval (IR) has demonstrated that useful computations can be done on testual data. For example, it is a surprising but convenient fact that treating a document as a ìbag of words,î without any particular word order still allows documents to be compared in a useful way in order, for example, to perform information retrieval.

Given a pair of documents, we can compute a distance between them in the following way. Each document is represented as a vector, where each component is the frequency of a lexicon item in the document. Each vector is scaled to have length one, and the similarity of two documents is computed as inner product o the vectors, or the cosine of the angle between the vectors. Given such a distance, we can do things such as clustering, finding representative documents ó a small number of the documents that are far from each other, that i, cluster centroids.

Many such basis IR techniques have not yet been exploited for protein database searching. The first question, however, is just how much data there is connecting sequences with literature. Such connections are explicitly encoded in GenBank. the richness of these links is often due to the fact that in the life sciences literature, acceptance of a paper for publication is often contingent on submission of any sequence described in the paper to GenBank. for example, 79% of primate sequences in GenBank have links to the literature entry is linked to four sequences on average. In MEDLINE, the literature database, a maximum of thirty sequence links are recorded, but the links can be inferred by starting from the sequence database.

The literature entries associated with structural entries usually describe the experimental procedure used to infer the structure. As with sequence data, deposition of the structure with the Protein Data Bank (PDB) is required for publication. In addition to describing the structure, the authors often make observations about how the structure clarifies the mechanism of the protein's enzymatic activity. There is only one publication associated with the structure determination, but there are also often other relevant publications included as remarks. Unfortunately, these are not explicitly indexed with MEDLINE IDs, but is should be trivial to identify the correction MEDLINE entry.

We aim to capitalize on the rich links between biological data types for many purpose, such as to annotate new sequences, and to annotate gounps of sequences. Here, we discuss one: to disambiguate protein comparison.

Many sequence comparison operations are straightforward. Where there is high sequence identity between two protein sequences, it is easy to infer that the proteins have similar function. However, there are many comparisons where similarity based on sequence alone is borderline. In such cases, we propose using additional evidence from literature and structure to disambiguate the relationship between the proteins. for example, it two sequences are only marginally similar, but the literature associated with each sequence discusses the same issues, then it may be possible to increase the confidence in the proteinsí relationship.

An appropriate frame work for integrating these separate sources of evidence is a probabilistic one. We can recast the similarity of two proteins as the probability that they are related ó i.e., reject the null hypothesis that the sequences are alignable by random chance. Similarly, it should be possible to cast the literature similarity as a probability o relateness. We can then calculate the joint probability of sequence and structure relationships, and contrast it against the joint probability of the null hyogitgeses. Similarly, evidence may be drawn from structural relationships. Given two sequences drawn at random, it is statistically unlikely that both have known structures. However, if one has a structure, the other sequence can be threaded into that structure. This example illustrates what we hope to be an appropriate broad approach to integrating protein data.

1.1 Vicky Choiís role in the research project

Vicky is a very strong algorithmicist. At this point in our project, we have far more algorithmic problems than we have answers. Vicky will have a central role in algorithm design, implementation, testing and deployment. The obvious most effect route for deploying our tools is within the PDB itself. I expect Vicky to be a primary liason with the PDB team in the Chemistry Department. She will have ample opportunity to see her work through all four stages.

 
Click here to return to PMMB home.