Methods
The data presented in this site is based on extensive pre-computation.
The foundation for the classification of proteins is a calculation of
'all-against-all' similarity for the set of proteins. The similarity measure is based
on the Expect value ("E-score") of the BLAST®
algorithm. This similary measure induces a
weighted directed graph which has the protein sequences as its vertices.
Clustering Algorithm
The clustering algorithm is implemented in two phases. In the first
phase, edges representing low similarity are removed from the graph. The
second phase is iterative in nature, and involves merging clusters
based on their mutual distances. At first, each protein constitutes a
cluster of its own (a "leaf"), and in each clustering step two clusters which
have the highest similarity (smallest distance) are merged into one "parent" cluster. The way
in which similarity of clusters is defined is the differentiating
factor for the clustering algorithms.
A new advanced clustering algorithm was implemented from 2008 for ProtoNet. For more details see:
Yaniv Loewenstein, Elon Portugaly, Menachem Fromer, and Michal Linial.
Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space.
Bioinformatics, 24(13):i41-49, 2008.
The clustering source code is available here:
MC-UPGMA source code
Multiple Clusterings
ProtoNet supports three different clustering methods, each with its
own way of defining similarity between clusters:
-
- Harmonic classification uses the harmonic average of
protein distances.
- Geometric classification uses the geometric average of
protein distances.
- Arithmetic classification uses the arithmetic average of
protein distances.
The similarity between clusters is used for the crucial step: the repeated merging of clusters.
Hanging of TrEMBL proteins (applied in versions prior to SWP 41.21 and Tr24.8)
The clustering algorithm described above was only run on the Swiss-Prot proteins of UniProt. The
TrEMBL proteins, a computer-annotated supplement of Swiss-Prot that contains all the translations
of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot, were not used to build
the "skeleton" ProtoNet tree because they have not yet been manually curated.
The TrEMBL proteins were integrated (starting from Swiss-Prot 41.21) into the "skeleton" ProtoNet
tree, built only from Swiss-Prot, in the following way:
Each such TrEMBL protein was integrated into the ProtoNet tree independently and without
consideration of previously added TrEMBL proteins, so that the order of their addition would not
affect the process. For each TrEMBL protein, the most "similar" Swiss-Prot protein is found (using
the similarity measure based on the BLAST algorithm, as described above). The TrEMBL protein is
initally attached ("hung") to the leaf of this Swiss-Prot protein in the ProtoNet tree. Then, the TrEMBL
protein iteratively ascends in the path from this leaf to the root, as long as its addition to the parent
cluster (of its cluster in the previous iteration) improves the merging score of the parent cluster. For
example, if Arithmetic classification is being used, then this condition requires that the arithmetic
average of protein E-score distances in the cluster will decrease by the addition of this TrEMBL
protein; i.e. the proteins in the clusters will be, on average, more similar with this protein present than
without it. The protein is then "hung" in the ProtoNet forest at the location of its final iteration.
In versions ProtoNet 5.1 and ProtoNet 6.0, pre-calculated all vs all BLAST results of UniRef50
are computed. Thus, the hanging of TrEMBL protein protocol is not applicable.
Forest Condensation
The clustering process outlined above results in a group of «trees»
of clusters (a "forest"). Following this process,
the trees in the forest are "trimmed" by automatic parametric removal
of clusters, so as to reduce
the complexity of the trees.
For example, the "simplified mode" of this Web site uses a ProtoLevel-based criterion: each
clusters lifetime is defined as
the difference in its parents and its own ProtoLevel.
Only clusters with a large enough lifetime remain after the condensation.
On the other hand, in advanced mode, the user may choose from various condensation
parameters and
their respective thresholds, yielding
different forests of «trees».
Announcements
Related publications
- N. Kaplan, O. Sasson, U. Inbar, M. Friedlich, M. Fromer,
H. Fleischer, E. Portugaly, N. Linial, M. Linial
ProtoNet 4.0: A hierarchical classification of one million protein sequences.
Nucleic Acids Research, 2005, Vol. 33, Database issue
PDF file [222KB]
- N. Kaplan, M. Friedlich, M. Fromer, M. Linial
A functional hierarchical organization of the protein sequence space.
BMC Bioinformatics 2004, 5:196
PDF file [281KB], Article URL
-
O. Sasson, H. Fleischer, E. Portugaly, Y. Bilu, N. Linial and M. Linial.
ProtoNet: Navigating the Hierarchical Clustering of the Protein Space.
PostScript file [800KB]
-
Sasson O, Linial N, Linial M.
The metric space of proteins-comparative study of clustering algorithms.
Bioinformatics. 2002 Jul; 18 Suppl 1:S14-21.
Accepted to ISMB '02. [submitted]
PostScript file [207KB], PDF file [396KB]
|