
Exercises 1 through 5 are practical tool-focused exercises using Clustal and T-COFFEE to create (approximate) multiple alignments and distance-based phylogenetic trees. Exercises 6 and 7 are `theoretical' exercises designed to show that exact optimal multiple alignment is hopeless except for fairly short and highly similar sequences. Exercise 8 is again a practical exercise, introducing the PSI-Blast tool.
If you are interested in phylogeny, I recommend doing some exercises with Phylip. Use the Phylip server at the Pasteur Institute. These exercises were designed by Henrik Christensen, Veterinary Microbiology, KVL.
However, it may be more convenient to download and install it on your local computer, so that's what you will do in this exercise. Clustal is available for a number of platforms from ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ at the Université Louis Pasteur, Strasbourg. Here are some instructions for Microsoft Windows:
Use Trees | Draw NJ Tree from Clustal to create a phylogenetic tree from the multiple alignment (by Neighbour-Joining), and save the tree to a file globin.ph.
Try also to `bootstrap' the tree to see how good support there is for the various branching points. Choose a different random seed than the one provided.
To view the tree you need to open the tree file from the program C:\Clustal\njplotWIN95. From this program you can also convert the tree to Postscript for printout by choosing File | Save Plot and naming the file globintree.eps or similar. Unfortunately, MS Word does not understand the Postscript files produced by njplotWIN95. Instead use TreeView; see below.
Take a look at the Alignment | Alignment Parameters menu. You'll see that the Gonnet substitution matrices are used by default, that one can adjust gap penalties etc.
Start TreeView and use it to open globin.ph or globin.phb from the C:\Clustal folder. Now you can edit the tree, change its layout, fonts, etc. Then you can print it, or use File | Save as graphic to save it as an Enhanced Metafile, a format suitable for including in MS Word.
To make TreeView display the bootstrap values created by Clustal (and saved in a file such as globin.phb), you must choose Trees | Output Format Options | Bootstrap Labels on | Nodes in Clustal before doing Trees | Boostrap N-J Tree in Clustal. Then in TreeView, select Tree | Show internal edge labels.
The sequences of 12 beta-lactamases can be found in file bcII-all.fasta. Download them and align them in Clustal. Choose Alignment | Output Format Options | Output Order | Input to prevent Clustal from rearranging the sequences.
There are a few sites at which an amino acid is conserved in all twelve sequences. How many? (Look at the Column Score Profile, or look for a star above the alignment).
The twelve proteins fall in three families, B1, B2, and B3. The B1 family consists of the six first proteins (in the original file order), the B2 family consists of the next two proteins, and the B3 family consists of the last four proteins.
Check that this is also what Clustal inferred, by saving and then viewing the Neighbour-Joining tree.
Print out the alignment produced by Clustal and compare it to the alignment shown in the paper by Galleni et al (available only on paper). Reordering the sequences to fit the Galleni paper will make this task much easier.
(Thanks to Galleni Moreno, University of Liège, Belgium, and Lasse Hemmingsen, KVL. See Galleni, Moreno; Lamotte-Brasseur, Josette; Rossolini, Gian Maria; Spencer, Jim; Dideberg, Otto; Frère, Jean-Marie: Standard Numbering Scheme for Class B beta-Lactamases (guest commentary). Antimicrobial Agents and Chemotherapy 2001 - volume 45 - issue 3, 660-663).
The file carboxypep.fasta contains 18 sequences for carboxypepsidases from humans, cows, rats, pigs, etc. Download the file and align the sequences using Clustal.
How well conserved are the sequences? Are there any sequences that seem to be outliers (more distantly related)? Does the Neighbour-Joining tree support your notion of outlier? Try removing the outliers using the Edit menu in Clustal, and then redo the alignment.
(Thanks to Morten Bjerrum, Department of Chemistry, KVL).
The file chloroperoxidase.fasta contains eight sequences for proteins that are chloroperoxidases, or related to chloroperoxidases. (Thanks to former bioinformatics student Line Albertsen and others).
Use Clustal to find a multiple alignment for these sequences. Because some sequences are short and others very long, Clustal fails to find the common motif in the sequences. For instance, look at the multiple alignment at positions 509ff. Here apparently there is a PAYPSGHAT motif in three of the sequences (Curvularia, Drechslera, Embellisia). Then look at position 604ff. Here the highly similar PSYPSGHAT motif appears, this time in the other sequences Ascophyllum, Fucus and Deinococcus, and to a lesser extent in Nostoc and Corallina.
The positions indicated above are for the downloadable version of Clustal. The Clustal server at EBI produces a slightly different alignment, perhaps because it runs Clustal 1.82 instead of 1.83, but the resulting alignment contains the same mistake.
Now use T-COFFEE instead of Clustal to make a multiple aligment of the same eight sequences. You can run T-COFFEE at http://igs-server.cnrs-mrs.fr/Tcoffee/. Inspect the HTML or PDF version of the alignment. You'll see that T-COFFEE finds the P[AS]YPSGHAT motif without problems. (A somewhat bizarre feature of T-COFFEE is that high-similarity regions in the multiple aligment are marked yellow/red in the HTML format, and blue/green in the PDF format).
If you want, you can download T-COFFEE and install it on your local computer. See http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/t_coffee_home_page.html.
For your convenience I've compiled an executable for MS Windows; it requires a DLL called cygwin1.dll. Download both of them to the directory C:\windows\.
Here are three sample sequences in FASTA format:
>Sequence 1 VLSPADKTNVKAAWGKVGAHAGEYGAEALE RMFLSFPTTKTYFPHFDLSHGSAQVKGHGK KVADALTNAVAHVDDMPNALSALSDLHAHK LRVDPVNFKLLSHCLLVTLAAHLPAEFTPA VHASLDKFLASVSTVLTSKYR >Sequence 2 VHLTPEEKSAVTALWGKVNVDEVGGEALGR LLVVYPWTQRFFESFGDLSTPDAVMGNPKV KAHGKKVLGAFSDGLAHLDNLKGTFATLSE LHCDKLHVDPENFRLLGNVLVCVLAHHFGK EFTPPVQAAYQKVVAGVANALAHKYH >Sequence 3 GLSDGEWQLVLNVWGKVEADIPGHGQEVLI RLFKGHPETLEKFDKFKHLKSEDEMKASED LKKHGATVLTALGGILKKKGHHEAEIKPLA QSHATKHKIPVKYLEFISECIIQVLQSKHP GDFGADAQGAMNKALELFRKDMASNYKELG FQG
Download the file SampleData.fasta containing these sample sequences to directory C:\temp, say.
Now you should be able to run MSA inside a Command Prompt or MS-DOS Prompt (usually found from Start | Programs | Accessories or similar.
C: cd \temp msa SampleData.fasta
To capture the output, pipe it to a file (e.g. result.txt), using
msa SampleData.fasta > result.txtThen look at the file C:\temp\result.txt using an editor (e.g. Wordpad).
Using a text editor (Wordpad), select the six closely related B1 family sequences from the beta-lactamase file above, and save to a new file bcII-B1.fasta. Save in text format, not in Word format, otherwise your file will be incomprehensible to MSA. Run MSA on the file. (Note that the friendly Microsoft programmers decided that a text file saved by Wordpad should have the extension .txt, even if you have already given it the extension .fasta). This alignment should take between 2 and 10 seconds depending on your computer.
Save the alignment to a file. Align the same six sequences using Clustal, and compare the alignment found by Clustal (which is not necessary optimal) with that found by MSA (which is optimal). Edit using Wordpad and print it out in a tiny fixed-width font (such as Courier) if that makes the comparison easier.
This exercise serves to demonstrate that optimal multiple alignment may work for a limited number of closely related sequences, but is not feasible in general. This is true not only for the MSA program, but for all attempts at optimal multiple alignment. Therefore researchers developed Clustal and T-COFFEE and similar approximate and non-optimal but fast programs.
How many new significant database hits do you get in the second iteration? How many carboxypeptidases are there among them? How many new significant hits do you get in the third iteration?
Note that in window (B) you can tick a box to determine which hits should be used to build the next iteration's position-specific score matrix. Thus you can ensure biologically more meaningful results by leaving out irrelevant hits (if you known what you are doing).
Note also that PSI-Blast can give you the position-specific score matrix (after the second iteration and later); choose PSSM on the Format page. You can save the matrix on your computer and later feed it back into PSI-Blast (under Options, in the PSSM text area).
Use the mouse to slide (drag) sequences left and right, thus inserting (or closing) gaps at the residue you point at. You'll find it pretty tiresome to try to do better than Clustal.
Jalview can also be downloaded from http://www.jalview.org/ if you want to install it locally.
Then go back to the Prosite entry page and click on ScanProsite. Take one of the sequences and paste it into the `Scan a protein for Prosite matches', check the `Exclude patterns with a high probability of occurrence' checkbox, and start the scan. Hopefully you will get exactly the two Zinc carboxypeptidase patterns. You will also get a Graphical summary showing the two matching sequence fragments. Move the red vertical rulers so that they eactly enclose a fragment, and the press Zoom. Then you'll get the exact matching subsequences.
Instead of just Prosite you may search Interpro which integrates Prosite, Pfam, Prints, and several other protein motif databases. Use the server at http://www.ebi.ac.uk/interpro/: click Sequence Search, then enter the sequence and your email address. In addition to the two Prosite patterns you now get hits from Pfam and Prints.
KVL PhD Bioinformatics 2004. Peter Sestoft (sestoft@dina.kvl.dk) 2001-06-22,
2004-02-05