Biological significance, mathematical significance

    Similarity groups:patterns, motifs, signatures

      A group of sequences sharing a particular alignment motif is called a "similarity group". The regions of similarity usually include only a part of a sequence. By similarity group we mean a collection of these similarity regions, as schematically shown in the figure below. Collections of similarity groups can be made, manually, with human work, like with the SBASE protein domain sequence library. Or automatically, like in the case of PRODOM.

      Sequences of a similarity group can be subjected to multiple alignment in order to find all conserved residues i.e. the complete pattern. these can be represented in forms of regular expressions, or in dedicated mathematical forms.

      The (complete) pattern can then be used to scan the the database and to locate all proteins that contain it. For this purpose, we need a quatitative measure. There are various measures, e.g. yes-or-no type measures are used with simple pattern descriptions (see below). More refined methods use a similarity score, which is actually identical to those described for alignments. Such a similarity score can be subjected to statistics, so one can quantitatively describe whether a similarity group is different from the rest of the database. Here we can use the Student test for quantitating the difference. This is very useful when one builds a pattern and wants to test if an improved pattern is better than the previous version. For simple applications it is sufficient to say, that there is a preset threshold value, and if comparison of a pattern with a sequence produces a similarity score above the threshold, the sequence contains the pattern.

      A pattern is "diagnostic" if it can be used to locate all sequences from which it was derived. In less fortunate cases there will be a number of sequences that contain the pattern even though they do not (or, for biological reasons, can not) belong to the original group. These are called false positives. Conversely, true members of the group that are missed by the pattern are called false negatives. The distribution of the scores gives a very clear meaning for these terms, and also explains the meaning of statistical significance of a pattern. (Like in the previous example, the t value is a measure of the separation of the "postive" and "negative" groups).

      Many similarity groups are very well conserved so one does not need the full length pattern in order to identify their members without false positives and false negatives. In these cases it is enough to use a small conserved part of the pattern, and such short diagnostic patterns are called "signatures" (Amos Bairoch's expression). In most cases one can use a simple yes-or-no test to see if the signature is found within a sequence. The use of signatures has the danger however that new numbers of the similarity group may not share them (even though the homology is complete at other parts of the sequence) so they can be missed. Patterns, motifs or signatures are names designating consensus representations of a similarity group. The property which is most important for us here is that these consensus representations allow to retrieve possibly all members of a similarity group. There are simple statistical measures that allow to calculate the accuracy of a pattern in database searching.

      The basic step of building a pattern is multiple alignment. Standard programs, like CLUSTAL, the PILEUP program of the GCG package, are very useful starting points to identify the "core" of an alignment. One can usually improve the patterns by intuition, adding residues on both sides of the core. More refined programs (like the PROFILE program of Gribskov, and its more recent versions) can be used in the same way. An elementary example of pattern building is given in here. Once familiar with the basics and having a good knowledge on the group of proteins, one can start building patterns. Some basic rules are summarized in - they describe how to select some of the usual parameters of database and profile search, like search matrices, gap penalties, etc.

      Summarizing, we can define similarity groups as a group of sequences that carry a similar sequence pattern and that potentially have a similar biological function. The task of the researcher is to establish if this group is biologically significant, and if this is the case, a new pattern is born.

    Patterns as representations of similarity groups

      Patterns or motifs are generalized representations of a similarity group. They incorporate "all common features" of the sequence group. Naturally, "all common features" can mean an endless variety of things, this is why there are in fact many forms for representing patterns.

      The simplest form is the regular expression, a well known mathematical term. For example, the pattern