Table I. Checklist when applying motif and profile searches.
Is the multiple alignment correct?
A correct alignment is a prerequisite for any search method: misalignments and frameshifts seriously degrade the signal.
Is the given set of sequences representative?
Some programs lack effective downweighting schemes for close relatives in the learning set: it is sometimes wise to omit very redundant sequences and select a representative set of divergent sequences.
Are the borderlines correctly defined or does enlargement/ reduction of the studied segment make more sense?
Arbitrarily truncated segments will have a weaker signal whereas artificially enlarged ones add noise.
Am I using an appropriate method for my sequence family?
For highly gapped alignments, block searches are not advisable; profiles spanning full proteins are sometimes less sensitive than restriction to a few short conserved motifs, when there is no other string conservation.
Is the chosen amino acid substitution matrix appropriate?
Depending on the family, another matrix might lead to clearer/different results. For example, "soft" matrices tend to be inappropriate for short motifs.
Are gaps considered appropriate?
Large insertions might occur between more conserved regions; small extracellular domains with cores mainly of disulfide bridges have more freedom for insertions/deletions than, for example, enzymes.
For profile searches, have I optimised the gap penalties on trial runs?
Appropriate penalties vary with divergence of query set. Set too strong gaps can't be crossed. Set too weak query profile spreads out over false positives, so giving higher scores.
Are apparently essential positions (e.g. required for catalysis) set to be required in the pattern/profile?
The weight/penalty for such positions is often not high enough in the available programs and the additional knowledge should be manually included e.g. by stronger weight/penalties.
Have the databases been searched?
Many programs are unable to search in DNA databases or their 6 frame translations which usually harbour additional hits; some network servers might offer outofdate databases.
Do I interpret the output correctly?
People are often not very familiar with parameters and scoring systems, but one needs to be sure about the resulting scores (e.g. normalised Zscores in PROFILESEARCH are misleading when searching for small domains in larger proteins as they upweight small sequences).
Do I apply my knowledge about putative target sequences appropriately?
Searching with a core metabolic enzyme suggests "downweighting" hits with extracellular proteins, as the biological context is different; but one has to be very careful as all kinds of exceptions exist and proteins with unrelated functions can indeed be homologous.
Have l carefully searched databases with putative novel members before inclusion into the alignment for the next iteration?
If the putative novel member belongs to a characterised family which is wellcharacterised and distinct or which has different conserved regions, it is likely a false positive.
Did I check the reciprocity of detections?
If a profile of family A identifies family B as similar, does a profile of family B find family A? Caution is needed as in some cases profiles of two artificially aligned families might identify both families before the noise.