2. Similarity: basic concepts


The concept of similarity is deeply rooted in human language and psychology and it is a complex and ambiguous notion. This is in great part true for the similarity of chemical and biological structures even if the structures are represented in a formal mathematical way.

There are two basic concepts behind the practical computation of similarity of sequences and structures . One of them is the mathematical concept of metrics or distances functions (this is related to similarity scores and edit distances). The other is the concept of motifs which is a more biological or chemical notion. Closely similar sequences can be compared in terms of distance measures (global similarity) while those sharing only a few common domains are better be compared in terms of common motifs.

A trivial example. Let's say we have a number of cars of different types, several pieces of each of them and we want to make a small computer program that will tell if they are similar or different. We can randomly choose almost any physical parameter (weight, size etc.) and use it without problems. for example, we can say that cars that have the same weight (i.e. have a weight difference below a certain threshold) belong to the same group. In most cases this will work perfectly. What happens if we now take another class of objects, say motorbikes, and try to compare them with cars? Simple physical concepts may not be quite useful. For example some motorbikes have larger engines than some cars. However there are important similarities too: Motorbikes and cars both have engines and wheels, i.e. they have identical substructures. For example, the statement "Cars and bikes are similar because they both have wheels" makes sense. In addition, we can make numerical comparisons between wheels.

In comparing proteins, we face these two very problems. Members of the same protein family are like cars: they all have the same components. Members of a superfamily only have a number of common domains. The solutions we use are also similar. If we compare homologous proteins (that are evolutionarily related and perform the same function, we can use a numerical similarity measure to compare and classify them. We can do the same with protein domains (i.e. we can compare and classify wheels among themselves quite well). However, classifying widely divergent proteins that have different types of domains may not be as straightforward. This situation is also reminescent on comparing languages: Closely related languages (like members of the indoeuropean family) can be easily compared and their similarities/distances will be found the same using almost whatever measure. Distantly related languages however can not be so easily classified; depending on what measure we take we can find Japanese closer to the Finns or to the Chinese.

The takehome message is that we first use some handles (features, motifs) to determine if two objects belong to the same equivalence class, and if they do then we go ahead with numeric comparison. The features we use very often are domains which are identified from local similarities. Once identified, we can subject the domains of the same type to global comparison (e.g. phylogenetic analysis) as we often do with entire sequences.



2.1. Distances, similarity scores.


The basic concept is a simple geometric distance of two points, a and b, often referred to as a Euclidean distance. Their distance in the plane is:


This distance is positive and has the trivial properties that

1. Distance is positive D(a,b)l0, distance from oneself is zero, Daa=0.

2. Distance is the same in both directions, Dab= Dba

3. Triangular inequality, one side of a triangle is smaller than the sum of the two others, i.e Dab+ DbclDac

The concept of distance was generalized into the concept of metrics which is defined by these three properties, plus

4. D preserves the topology of the space. Here "space" means a collection of points or objects between which we want to calculate D.

What does generalization mean?

- One way to generalize the mathematical formula of the distance is to allow for more dimensions. In fact one can calculate a distance in space, with 3 dimensions, but one can calculate a distance between any two vectors with much more parameters (i.e. in an "n-dimensional space). This is a trivial but important fact since in chemistry and biology one frequently characterizes an object (a compound, for instance) with an arbitrary number of parameters. E.g. the amino acid composition of a protein is a vector of 20 parameters and proteins can be very conveniently compared by the distances of their amino acid composition vectors.

- One may want to weight the parameters in the distance function. Staying with the example of amino acid compositions, one may want to give a higher weight to some "important" amino acids. In the example of the Euclidean distance (equation 1) this would mean that the coordinate y (amino acid y) will be multiplied by a factor greater than 1, because, for some reason, we consider it more important than coordinate x (amino aci d x).

- The distance in the plane uses square and squareroot functions. Even this can be made general, as suggested by the famous physicist Minkovwski (and later used by Einstein in his relativity theory)


The Euclidean distance is Minkowski distance with n=2. There are a great many distance functions used in chemistry and biology, and most of them are related to the Minkowski distances. One interesting distance is the so called city-block or Manhattan distance with n=1. This corresponds to the distance between points on a map of a city built in a checker-board like manner:


This distance is the sum of parametric differences. All of these can naturally be weighted (if we "prefer" some of the parameters for some reasons).



Since all structures can be transformed into vectors by listing some of their numeric properties in a given serial order, one can easily calculate distances between them. Even a long string of characters can be turned into a vector of its character composition. In addition, we can calculate the composition in terms of two-letter words, e.g. one can talk about the dinucleoitde composition of DNA. Here we also have to mention the problem of alphabet size,.since the number of parameters involved will depend both on the wordsize (dimers, trimers tetramers) and on the alphabet size. In the case of DNA we also have to think about strand symmetry (AC=GT)

Wordsize Protein DNA DNA with ds symetry
1 20 4 2
2 400 16 10
3 8000 64 32
4 160000 256 136

One thing is noteworthy: If we have too many parameters, oftentimes the sequence we want to describe will not contain all the words (e.g. it is quite unlikely that of the 160000 possible tetrapeptides many will be missing in an actual protein, or even from the current database of known sequences). In other terms, many of the vector coordinates may become zeroes if we choose too many parameters. Simple amino acid vectors composition can be used quite well for classifying proteins (Cornish-Bowden et al). ds -Trinucleotide distances were used on the other hand to classify genes (Karle et al). One can conclude that - like in every situation - there is an optimal level of complexity for every problem.



2.2. String distances


One can construct distance functions between character strings (i.e. protein and DNA sequences) that - unlike character composition - take the sequence of the characters into consideration. The simples of these is the edit distance, the number of insertions, deletions and replacements that can be used to turn one word into the other. The best way to do this is to "align" the two character strings on top of each other, like:

. . | |


The solid line marks identities while the point marks the non-identities which correspond to replacements. If the words are more complex, it is easier to see that there are more ways to align them, and there is a great deaf possibilities to include insertions/deletions (sometimes these are called gaps or indels):


L A R R Y B I R D     or     L A R R Y B I R D
. . |           |              . . |         |
W O R ----------D              W O R --------D

An edit distance refers to the best of these alignments which is created by the minimum number of replacement and insertion/deletion operations. We have to make arbitrary decisions how we define the distnace since a string can have multiple copies of a word:

      . . | |           | | | |
      W O R D           W O R D

It is a good idea to use some kind of a numeric quality index for the edit distances, for example we can use these in computer programs that look for the best alignment. We can do this very easily, we have to assign some numeric cost value to insertion/deletion and replacement. There is a problem, however. We have to assign more or less arbitrary cost parameters both for gaps and for replacements, so there is no guarantee that we have an optimal solution.

The replacement cost parameters for amino acids are the well known Dayhoff matrices, BLOSUM matrices that will be mentioned later on in this course. These rely on a statistical evaluation of a large number of alignment.

The gap parameters are even more arbitrary - they are different from program to program. One can also define gaps in a length dependent manner, i.e. the cost of introducing two separate gaps may be different than in the cost of introducing two consecutive gaps.

One important feature of biological sequence analysis is that here we use similarity scores rather than distances. The similarity score S calculated between two sequences are adversely proportional to distance, i.e. it will be maximal for identical sequences (and not zero, like a distance function). In mathematical terms S~1/D or S~1/D.

The simplest form of a similarity score looks like this:





Which means, for a given alignment we sum the values of replacements (in this respect identities are a special kind of replacement by the same character) and we subtract a penalty for gaps. The negative sign for this penalty expresses the common sense that a gap is a bad thing to have, i.e. the more the gaps, the worse is the "quality" of the alignment.

The collection of replacement weights is called a replacement matrix, such as the Dayhoff matrix for amino acid replacements to be discussed later in this course. We mention that all kind of heuristic weighting schemes can be used, for example the weight of some structurally conserved amino acids (e.g. cysteines, tryptophans) can be set higher in order to stress their importance in structures. In this way one gets a modified score value that stresses structural similarities in a rudimentary way.



2.3. Approximate string matching; finding optimal alignments


Similarity scores and distances are the main tools for finding optimal alignments between two biological sequences. There are algorithms that are based on maximizing similarity and others that minimize distance. An "exhaustive search" would then take all possible alignments, determine the similarity score for them and then take the one with the highest value.


One trivial problem of the alignments is that most of the mathematically possible alignments make no sense, e.g.

T H E - - - - B I R D I S T H E W O R D
      W O R D                          


These nonsense alignments should be eliminated since they only consume computer time. Many of the improvements that are possible are based either on intuitive tricks, or on a statistical knowledge on the problem. This latter is is especially important in biological applications since biological sequences are not random, so one can make good probability estimates how frequently certain strings are expected to occur.

The currently used algorithms have been developed over many years. They are used in commercial applications such as spell checkers in word processors, dictionaries etc. They were discovered and further developed almost independently in several application fields, including biology, where they constitute the basis of database searching. (Database searching - as will be discussed during this course - simply means to calculate a similarity score between a query sequence and all items in a database, and then list the best scoring entries of the database). As this is the central topic of this course and the programs used will be presented in later lectures, here we only show a brief comparison of the algorithms in terms of their running time requirements.

Comparison of algorithms for comparing protein and DNA sequences
Algorithm      Value       Scoring       Gap         Time       Ref.
             Calculated    matrix       penalty  growth rate
Needleman-     global     arbitrary   penalty/gap    O(l2)   Needleman & 
Wunsch      similarity                    q                 Wunsch, 1970

Sellers       (global)     unity    penalty/residue  O(l2)   Sellers, 1974
              distance                    rk

Smith-         local      Sij<0.0      affine       O(l2)    Smith &
Waterman     similariry                 q + rk               Waterman, 1981
                                                             Gotoh, 1982

FASTA     approx. local   Sij<0.0    limited gap    O(l2)/K  Lipman &
            similarity                   size                Pearson, 1985
                                                             Pearson &
                                                             Lipman, 1988

BLASTP       maximum      Sij<0.0    (-) multiple   O(l2)/K  Altschul et al.,
          segment score                segments              1990

CLUSTAL   multiple         arbitrary  versatile     n2O(l2)  Higgins et al
          alignment               (secondary structure       1994
          (global)                    dependent)


The time requirement, O(l2) refers to the length of the input sequence, and can be better written as O(l*D), where l is the length of the input query sequence and D is the length of the database (i.e. the other input). It is conspicuous that the time growth rate couldnot be improved. The drammatic speed differences between the programs all result from shortening the basic running time of the algorithm (we can picture this as a constant of propotionality which is very small for some of the programs, especially FASTA and BLAST). This can be achieved either by clever programming, or, more importantly, by preprocessing of the input (the query and the database) into a fast-to-read format, like a hash table. However, the most important "trick" is that the fast algorithms are not exhaustive but use statistical considerations to pre-screen the alignments. This takes care of the problem of testing many non-sense alignments. And since biological sequences have no "real extremes" among themselves, statistics can be considered quire reliable.

Alphabet size matters only the amount of space required for the algorithms, it does not change the order of the time (though it obviously changes the constant). The alphabet size effect is sequence length independent, and can be made modest on modern computers.

Many multiple alingment programs are based on pairwise alignments so their theorethical growth rate is O(l2). However, for n sequences there are n(n-1)/2 pairwise alignments, so the CPU time grows quite steeply with the number of sequences.



2.4 Motifs as elements of similarity


Until now we talked about a numerical description of similarity: the similarity score or distance function.

Now we switch to another way of describing similarity. Previously we mentioned that all chemical structures including protein and DNA sequences can be considered as assemblies of substructures with relationships between them. Continuing this line of argument we can tentatively say that two structures can be called similar if they have common substructures. Of course not any substructures, for example we would not consider two protein sequences similar if both of them contain alanine. But we certainly would start thinking about similarity if they have a common domain, say an EGF (epidermal growth factor like domain). In facts, domains or modules of multidomain proteins were the origin of defining sequence motifs. One can tentatively say that two protein sequences are distantly related if they share a local homology domain, especially if the same type of homology has been found among other proteins. In other terms here we speak about local similarities that may not include the entire sequence, only parts of those. A trivial metaphor (mentioned earlier) is the similarity of the car and the bicycle: they both have wheels, so they are in a way similar. The wheels can be compared among themselves in terms of numerical indices like weight, size etc., but once we decided that the wheel is there (local similarity region identified), we actually do not care for the actual differences that exist between bicycle wheels and car wheels. On the other hand, if we tried to cluster objects according their weight and size (i.e. would perform a global similarity search), then cars and bicycles would never end up in the same cluster.

This is the reason why we have to deal with local similarities separately: These similarities can not be always simply detected by calculating a "over-all similarity score" for an entire sequence. Namely, the contribution of a local similarity to the over-all similarity score can be quite small, and can be easily blurred by the spurious similarities randomly found between sequences. For this reason, detection of local similarities developed into a separate art with its own tools. The most important achievement of this development is our knowledge on protein modules or domains which are small, structurally conserved protein segments that are found in various protein families.



When we speak about motifs, domain etc., we speak about two different kind of problems:

i) One is to find out if a sequence contains any of the known domain type. This can be best done by comparing the sequence with one or several of the existing domain collections. Most current domain collections contain a consensus representation of certain domain types. These consensus representations can be given in terms of a consensus sequence, a regular expression, a hidden Markov model, a "profile" etc. All these are common descriptions of a group of sequences. For example, for the EGF domain we have over one thousand known examples, and we can develop one common mathematical description for these. We can use this "consensus representation" to look for new EGF modules, which means one single comparison for the whole group. Naturally, somebody has to develop and maintain these consensus representations, which is not an easy task.

ii) The other, quite different task is to find out if a new sequence contains any new kind of domain. In this case we do a database search, try to find a local similarity region, and then, using our biological insights, try to establish if the newly found similarity region meets the criteria of our biological knowledge on modules (e.g. it is found in well defined group of protein families, one can build "good consensus representation" which will fish out only members of the similarity group etc.) In other terms, we have to ascertain the bological significance of the local similarity.

Summarizing, we accept a local similarity as important either if i) it belongs to a well established group of sequences (known protein modules) or if ii) we ourselves can prove that this similarity is biologically significant e.g. it defines a new type of protein module. As one can see, this apect of sequence similarity is not strictly mathematical but involves a great deal of biological knowledge.

Finally we mention that BLAST, the most frequently used database search program returns a collection of individual local similarities, which - as we will show later during this course - can be easily checked for domain similarities.