1. Algorithms

Informally, an algorithm is any well-defined computational procedure that takes some value, or set of values, as input and produces some value, or set of values, as output. An algorithm is thus a sequence of computational steps that transform the input into the output.

We can also view an algorithm as a tool for solving a well-specified computational problem. The statement of the problem specifies in general terms the desired input/output relationship. The algorithm describes a specific computational procedure for achieving that input/output relationship. A good algorithm is like a sharp knife-it does exactly what it is supposed to do with a minimum amount of applied effort. Using the wrong algorithm to solve a problem is like trying to cut a steak with a screwdriver: you may eventually get a digestible result, but you will expend considerably more effort than necessary, and the result is unlikely to be aesthetically pleasing. Algorithms devised to solve the same problem often differ dramatically in their efficiency. These differences can be much more significant than the difference between a personal computer and a supercomputer.

1.l. Analyzing algorithms

Analyzing an algorithm has come to mean predicting the resources that the algorithm requires. Occasionally, resources such as memory, communication bandwith, or logic gates are of primary concern, but most often it is computational time and memory requirements that we want to measure. Generally, by analyzing several candidate algorithms for a problem, a most efficient one can be easily identified. Such analysis may indicate more than one viable candidate, but several inferior algorithms are usually discarded in the process.

Before we can analyze an algorithm, we must have a model of the implementation technology that will be used, including a model for the resources of that technology and their costs. Generally one assumes a simple idealized computer, a generic one-processor, random-access machine (RAM) model of computation. In the RAM model, instructions are executed one after another, with no concurrent operations.

Analyzing even a simple algorithm can be a challenge. The mathematical tools required may include discrete combinatorics, elementary probability theory, algebraic dexterity, and the ability to identify the most significant terms in a formula. Because the behavior of an algorithm may be different for each possible input, we need a means for summarizing that behaviour in simple, easily understood formulas.

1.2. Input data

1.2.1. Structures, sequences

Biocomputing algorithms are quite varied in scope and computational tools. The majority of algorithms presented in this course are concerned with sequences. Sequences are a special kind of structure representations.

But what is a structure? Protein structures, gene structures, share a common architecture: they are composed of elements (substructures) and connections (relationships) between them. In a 3D structure of a molecule we have the atoms as elements and the chemical bonds as connections between them, plus the 3D coordinates. In fact this is a rich representation. We can eliminate some of the information and we get more abstract and concise representations. E.g. omitting the 3D coordinates we get atoms and connectivities between them. This is a structural formula (a graph). Omitting the information of the chemical bonds, the structure will collapse to a handful of atoms that we can count piece by piece, to get a composition (a vector). Naturally, atoms are not the only substructures we can use and there is a great variety of larger structural units we can use. Suffice to think about a schematic representation of a protein structure in terms of helices and beta sheets.

Now, an amino acid sequence is a special representation where the substructures we use are the 20 naturally occurring amino acids and the only type of connection or relationship we consider is sequential vicinity. As a result a protein sequence is a character string written with a 20 letter alphabet. DNA sequences are character strings of a 4 letter alphabet.

Other structural representations used in biology use different substructures. For example, protein structures are often depicted as a simple graph of helices, beta sheets and turns. Or, sometimes, we refer to a particular motif as a four helix bundle, i.e. we describe it in terms of composition.

So we have two metaphors for sequences: chemical structures and texts. When to use these? Sequence analysis, such as database searching, can be much better understood with the text metaphor. This is valid to all problems where we keep the amino acid residue representation as standard. If we need more versatile descriptions, the chemical structure metaphor is more useful. For example, the structure of multidomain proteins can be described as a series of complex elements, such as signal peptides, immunoglobulin-domains etc. The chemical structure analogy is especially conspicuous when dealing with the 3D structure of proteins composed of a few a-helices and b-sheets in a special arrangement.

1.2.2. Databases

Sequence databanks are collections of DNA or protein sequences/sequence motifs. The data are organized into records (entries).Current databases are simple series of sequence entries, i.e. are produced in the form of "flat files", not ready for practical use. For the use of the analysis programs that perform e.g. data retrieval and similarity search, the flat files are reorganized into special internal formats that best correspond to the use. The major program packages have their own internal databank format, not seen by the users.

The machinery of computer databanks was developed first for industrial/financial applications (payroll management, banking). Biological sequence databanks are quite simple as compared to these.

Sequence records or "entries" contain data on one single molecule (protein, DNA fragment). The record consists of fields that have internal structures (beginning of field, end of field, subfields). The sequence field contains the sequence data as a character string. The ID (identification) field is a (more-or-less) mnemonic identification code that sometimes changes, the AC (accession number) field is an unequivocal registration number. These record identifiers are useful when retrieving a record.

In addition to identifiers and sequence, the records contain an optional number of additional fields, generally summarized as "annotations". Annotations may contain simple information such as bibliographic references, functional descriptions, cross-references to other databases etc.. Special annotation fields contain structural information organized into a "feature table". This is a special representation of the sequence (in addition to the sequence field), in which sequence segments are associated with descriptors, e.g.. "from position 20 to position 30: signal peptide".

In a more formal way, a database record is a collection of data-structures, and there is no canonical definition what it should contain. We can attempt to describe some of the general features of sequence databases as they are today (1998) but even this small field changes very fast. In any case, a sequence record must contain one canonical structure representation, the sequence, and this is usually given in a character string format. In addition there is a large variety of data given, like a) the name and various identifiers of the sequence b) the organism the sequence is derived of c) the function of the sequence. d) bibliographic references e) cross references to records in other databases (e.g. to 3D data), f) A feature table, which is a collection of sequence segments of with some function. a) to d) are textual information given in a human-like language. In contrast, the feature table is a structure representation, but instead of residues, we now have functional or structural domains, i.e. complex substructures (as compared to simple amino acids).

When we speak about sequence analysis algorithms, the overwhelming majority deals with the sequence (character strings) only and the evaluation of the annotation part is left to the user.

1.3. Input size

The best notion for input size depends on the problem being studied. For many problems, such as sorting or computing discrete Fourier transforms, the most natural and general measure is the number of items in the input. Specifically, sequence processing algorithms in terms of sequence length and alphabet size. Some algorithms may run very well for short DNA sequences but less well for large protein sequences.

The linearity of the sequences is a very important aspect. Many algorithms would run well for character strings but would take impracticably long times for more complex structures such as graphs. For instance, if the input to an algorithm is a graph, the input size can be described by the numbers of vertices and edges in the graph. We shall indicate which input size measure is being used with each problem we study.

Many biological algorithms operate on two kinds of input: a sequence input, and a database. In a first approximation, the sequence database can be considered as a very long string which is concatenated from each sequence entry within the database. So, a database can be characterized by the total number of amino acids or nucleotides.

1.4. Running time

The running time of an algorithm on a particular input is the number of primitive operations or "steps" executed. It is convenient to define the notion of step so that it is as machine-independent as possible. For the moment, let us adopt the following view. A constant amount of time is required to execute each line of our algorithm-code. One line may take a different amount of time than another line, but we shall assume that each execution of the ith line takes time ci, where ci is a constant. This viewpoint is in keeping with the RAM model, and it also reflects how the algorithm code would be implemented on most actual computers.

Clearly, the above notion of running time is symbolic, and can only serve for the comparison of algorithms rather than to estimate real time necessary to run a program. Such "real computing times" may depend much more on input/output operations and the actual architecture of the machine. But if these factors are comparable for two algorithms, then the number of steps is the indicator we can use best.

1.5. Memory requirements

Memory requirements can be best defined as the number of values (numbers, strings etc.) we need to store during the calculations. This can be measured in terms of megabytes of disk space or RAM space necessary. Typically, a program may use a set of computed values that we can pre-compute and store for later use. Or, we can compute the values "on the fly" in each case the program runs. If computation of a value takes less time than reading it from the disk, we may want to compute it afresh every time. So there is a trade off between storage space and running time.

1.6. Worst-case and average-case analysis

Running time and memory requirements can vary a great deal depending on the quality of the algorithm as well as on the input. It is very important to find out how an algorithm will behave in the average case and how it will fare with strange cases. To these problems we refer as average-case and worst case scenarios. The best case has not occurred yet...

• The worst-case running time of an algorithm is an upper bound on the running time for any input. Knowing it gives us a guarantee that the algorithm will never take any longer. We need not make some educated guess about the running time and hope that it never gets much worse.

• For some algorithms, the worst case occurs fairly often. For example, in searching a database for a particular piece of information, the searching algorithm's worst case will often occur when the information is not present in the database. In some searching applications, searches for absent information may be frequent. Therefore, for many algorithms The "average case" is often roughly as bad as the worst case.

• In some particular cases, we shall be interested in the average-case or expected running time of an algorithm. One problem with performing an average-case analysis, however, is that it may not be apparent what constitutes an "average" input for a particular problem. Often, we shall assume that all inputs of a given size are equally likely.

Biological sequence analysis deals with natural protein and gene sequences. As computational objects, these sequences are fairly uniform, i.e. strange sequences, like several hundred identical amino acids in a row, do not occur. This is why for many purposes these can be considered as random character strings of a given composition. For example one can generate random strings whose composition corresponds to the composition of the database. In other words we can base our average case analysis on an input of a protein sequence of 100 or 500 residues which we generate randomly.

In some cases, however, random sequences are not good models of natural sequences. In particular, both proteins and nucleic acid sequences contain sequences of biased composition. These are sometimes referred to as "low complexity regions" (to be discussed later). We have to remember that some algorithms do not perform well of the biased sequence parts, for example sequence similarity searches are often mislead if a sequence contains long runs of identical residues. In many cases however the compositional bias does not influence the running time estimate.

1.7. Order of growth

Once we know running time and memory requirements for a given input, the next step is to find out how these change if we have a different (larger) input size. In terms of sequences, we are particularly interested how running time is dependent on sequence length. In cases of algorithms operating on a database, we also have to know how running time depends on database size. Usually the size of the alphabet (20 or 4) is also an important factor.

Generally we express order of growth as a proportionality. Simply put, we now ignore our original estimate of running time, we are only concerned with the fact how the increase depends on sequence length. For example, some algorithms have running times that grow linearly with the length l of a sequence O(l) (pronounced as "order of l"). Some have running time estimates O(l2) (pronounced as "theta of l-squared"). Obviously, algorithms of linear growth rate dependence are better than O(l2). We shall use the O-notation informally. However, this is a very important notion since some algorithms which require running times proportional to higher powers of sequence length may give impracticably long running times for sequences that are only slightly longer than the "average sequence" we used to test an algorithm.

We usually consider one algorithm to be more efficient than another if its worst-case running time has a lower order of growth. This evaluation may be in error for small inputs, but for large enough inputs a O(l2) algorithm, for example, will run more quickly in the worst case than a O(l3) algorithm.

Here we have to mention a trivial fact. Linear inputs like character strings of sequences have shallow growth rates in terms of computer times. In contrast, higher order representations, like graphs of molecular topology or atomic detail 3D structures, often require higher analysis times and their size increases running time in a much more dramatic way. So there are many algorithms that can run on sequences, while very often the same type of analysis on a 3D structure would take frighteningly long times. For example, finding a sequence segment in a large sequence is quite trivial. However, finding a particular 3D arrangement in a protein structure is much more time-consuming.

1.8 Examples

1.8.1. Sorting numbers

Let's take one example from computer science, the problem of sorting n numbers. One of the simplest algorithms, "Insertion-sort" solves the problem in a way we sort cards. To sort the cards in your hand you extract a card, shift the remaining cards, and then insert the extracted card in the correct place. Shell-sort and Quicksort are improved variants of

Insertion-sort. The algorithms differ in their length as well as in the estimated running times:

 Comparison of Sorting Methods Method Statements Average-case time Worst-case time Insertion sort 9 O(n2) O(n2) Shell sort 17 O(n1.25) O(n1.5) Quicksort 21 O(n lg n) O(n2)

1.8.2 Substring matching: Finding a part in a large sequence

A large sequence is a given series of characters (proteins have a 20 letters alphabet, DNA only has a 4 letters alphabet). Very often we want to find a sequence motif, say ATGC, in a large sequence. Let suppose that the large sequence has 1000 residues. We can design various algorithms that will perform the job. Let's say that the larger sequence is l residue long (l =1000) and the small part we want to find is a residue long (w=4). Let's call the large string L and the small string W.

A naive algorithm may work as follows:

a) Identify the first residue of the whole string (A)

b) Find the first A in the long sequence. If there is no A, all we have done was to read the whole string of (l -w+1) residues and the procedure is finished.

c) If we found a residue A at position i, let's stop there and see if residue i+1 corresponds to position 2 of the small string (i.e. C) . If yes, then check if residue i+2 corresponds to residue 3 of the small string and so on.

From this we can estimate the number of reading and comparison operation necessary to find all strings. Namely, if the long string does not have any substring corresponding to the small one, then we had to perform l reading+comparison operations, i.e. performance of the algorithm will be proportional to O(l). This is a best case. Now let us take a worst case, e.g. the large string has exactly n pieces of A-s and all of them correspond to an exact copy of the W string. In this case we will have to read and compare l times, again, and so on as an A is found we start a procedure in which we make a maximum of w-1 reading and comparison operations. This is a very crude approach of course but we know that we will need l operations plus n times w-1 comparisons. l +n(w-1) is a maximum. Since n is smaller than l, we can say that the worst case scenario is less than l +l (w-1)=l w. So our algorithm has an upper bound of O(l w), which means that it is a linear function of the sequence length. However if both l and w are very large, this still can mean a lot of steps. Now, the number of comparisons to find out if two letters are equal is maimum equal to the size of the alphabet, S. This comes in as a multiplyer, so the growth rate can be written as SO(l w).

A more educated scenario is the Boyer-Moore (BM) algorithm. Same as the naive algorithm, BM includes a first step identification of the first character but then uses two new functions (compute last-occurence and good-suffix) which makes it the most efficient string-matching algorithm than the others. Without details we mention that this algoritm has a worst case running time O([l-w+1]w+ S) which is better than SO(l w).

1.8.3 Calculating a numeric property of a long sequence

Numeric properties are often calculated from sequence data. In most cases we have a small table that assigns a numeric value to each residue (character) in the sequence. A typical example is a hydrophobicity profile. A hydrophobicity value is an experimental value assigned to each of the 20 amino acids, which means we have a small look-up table of 20 items.

Taking a long sequence of l residues, we can convert it into a hydrophobicity profile by exactly l reading and look-up operations. The result is a numeric profile, a series of l numbers taken from our hydrophobicity table. Let's denote each value in the profile as pi

Very often one has to average the property (hydrophobicity) within a window w. For example, the Kyte-Doolittle plot uses a window size of 21. Given the numeric profile of length l, a naive algorithm would take a pi at each character position in the sequence and would start an averaging process of stepw at each position. This would amount to (l- w+1)w averaging processes. Namely, we only have l -w+1 positions to think about, and at each position we make w operations, each of the consisting of calculating the average of w numbers, dividing the sum by w. Since in this case we have to calculate each value, i.e. we can not introduce further shortcuts, the growth rate of parametric algorithms is, O(l -w+1)w~ O(l w). A "clever" i.e. more intuitive algorithm would use the following consideration. If the sum at position i is known then the sum at position i+1 can be calculated not with w but with exactly two operations: Subtract pi and add p(i+w). For this to be implemented one has to calculate the full sum at position 1 (w operations) then perform 2(l -w) operations which is, roughly ~2l. If w is not very small, this will be much faster than the naive algorithm.

How does this algorithm depend on the alphabet size (protein or DNA). This dependence is hidden in the look-up procedure. If we have to find a value in a table of 20 it takes longer than to find it in a table of 4. So this will be an independent variable, and we can approximately say that if the alphabet is size S, then the naive algorithm will have a running time O(Sw[l -w+1]) and the "clever algorithm will have a running time O(S[2l -w]). The latter will be more favourable than the naive algorithm in all practical cases.