ClustalW Multiple Sequence Alignments

Multiple Sequence Alignments

CLUSTALW tutorial

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673-4680. CLUSTAL W version 1.4 dated September 23, 1994: Clustal W is a general purpose multiple alignment program for DNA or proteins. Clustalw is produced by Julie D. Thompson, Toby Gibson of European Molecular Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics Institute, Cambridge, UK. Algorithmic

Access to the last documentation of Clustalw 1.06 Multiple alignments are carried out in 3 stages: 1. All pairs of sequences are aligned separately (pairwise alignments) in order to calculate a distance matrix giving the divergence of each pair of sequences; 2. A guide tree is constructed from the distance matrix ; 3. The sequences are progressively aligned according to the hiearchy in the guide tree. Clustalw program is a major update and rewrite of clustalv program. Improvements over clustalv :

Reads GDE, MSF and Clustal format alignments as input ;
Individual weights are assigned to each sequence in a partial alignment in order to downweight near-duplicate sequences and upweight the most divergent ones ;
Amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned ;
Residue specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure ;
Positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions ;
The similary scores between pairs of sequences are clustered according to the Neighbor-Joining (NJ) method and no more according to the UPGMA one.
The New Hampshire (nested parentheses) tree format is used as default for all trees (compatible with the PHYLIP package).
Enhanced facilities for addition of new sequences to old alignments.

Input data file In this tutorial, it is assumed that the user has access to the GCG package and the SwissProt protein sequence database. To extract the sequences, one needs to create a text file (using an editor e.g. emacs under UNIX) with the names of the sequences on each line. One can then use the tofasta command of the GCG package to extract these sequences from the database and put them into a single file in FASTA format. This format can be used as input to CLUSTALV or CLUSTALW programs. Note : Sequences submitted to CLUSTALW should all be in 1 file. Six formats are accepted: NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF. Among these the fasta format is by far the simplest one, if the orginal sequences are raw data. Readseq is a very useful program to convert sequence formats. In the gcg package there are some sequence exchange programs which allow conversion of sequences to or from the gcg format. The sequence data examples used in this tutorial are drawn from Desmond Higgins course. Data set 1 : Cysteine (Thiol) proteases Create a text file called "cath.list" with the following lines: sw:aleu_horvu sw:cath_human sw:cath_rat sw:catl_human sw:catl_rat sw:cys1_dicdi sw:papa_carpa This is just a list of SwissProt sequence entry names. One needs to extract them into a single file in FASTA format; this is done by using the tofasta command as follows: % tofasta @cath.list The program will ask to suggest a name for the output file; call it "cath.pep". Helix-turn-Helix DNA binding/repressor proteins Repeat the above exercise and create a file called "repr.list" which should contain the following lines: sw:dica_ecoli sw:immf_bpph1 sw:rpc_bpph1 sw:rpc_bpp2 sw:rpc2_bpp22 Then extract the sequences as before with the tofasta command: % tofasta @repr.list and call the output file "repr.pep". This file is shown below. reper.pep should look like this: >dica_ecoli P06966 escherichia coli. repressor protein of division inhibition gene dicb. 7/89 METKNLTIGERIRYRRKNLKHTQRSLAKALKISHVSVSQWERGDSEPTGKNLFALSKVLQ CSPTWILFGDEDKQPTPPVEKPVALSPKELELLELFNALPESEQDTQLAEMRARVKNFNK LFEELLKARQRTNKR >immf_bpph1 P13772 bacteriophage phi-105. immf control region 10 kd protein. 4/90 LDGKKLGALIKDKRKEKHLKQTEMAKALGMSRTYLSDIENGRYLPSTKTLSRIAILINLD LNVLKMTEIQVVEEGGYDRAAGTCRRQAL >rpc_bpph1 P06153 bacteriophage phi-105. immunity repressor protein. 8/92 MTVGQRIKAIRKERKLTQVQLAEKANLSRSYLADIERDRYNPSLSTLEAVAGALGIQVSA IVGEETLIKEEQAEYNSKEEKDIAKRMEEIRKDLEKSDGLSFSGEPMSQEAVESLMEAME HIVRQTQRINKKYTPKKYRNDDQE >rpc_bpp2 P04132 bacteriophage p2. repressor protein c. 5/92 MSNTISEKIVLMRKSEYLSRQQLADLTGVPYGTLSYYESGRSTPPTDVMMNILQTPQFTK YTLWFMTNQIAPEFGQIAPALAHFGQNETTSPHSGQKTG >rpc2_bpp22 P03035 bacteriophage p22, and bacteriophage p21 (bacteriophage 21). repressor protein c2. 6/94 MNTQLMGERIRARRKKLKIRQAALGKMVGVSNVAISQWERSETEPNGENLLALSKALQCS PDYLLKGDLSQTNVAYHSRHEPRGSYPLISWVSAGQWMEAVEPYHKRAIENWHDTTVDCS EDSFWLDVQGDSMTAPAGLSIPEGMIILVDPEVEPRNGKLVVAKLEGENEATFKKLVMDA GRKFLKPLNPQYPMIEINGNCKIIGVVVDAKLANLP Note that if the considered sequences are raw files, the easyest way to submit them to clustalw, is to present them in 1 file and in this format. The exact rules for sequence input are in the ClustalV.doc (not yet ClustalW.doc) documentation file;

One can use UPPER or lower case and the sequences can be DNA or PROTEIN;
No ambiguity codes are allowed; a symbol is either a valid amino acid or nucleotide or will be considered unknown;
All spaces, digits etc. will be ignored except for the hyphen character "-" which is used to represent gaps;
Each sequence name begins with an angle bracket ">";
The text on the line that starts with the ">" will be ignored; it is treated as free text;
the sequence is on the following lines and the next sequence starts with a ">" etc.

Running Clustalw To start up CLUSTALW just type clustalw (may be preceeded by the pathname) : % clustalw The following main menu will apear. ************************************************************** ******** CLUSTAL W(1.4) Multiple Sequence Alignments ******** ************************************************************** 1. Sequence Input From Disc 2. Multiple Alignments 3. Profile Alignments 4. Phylogenetic trees S. Execute a system command H. HELP X. EXIT (leave program) Your choice: 1 Type "1", press return and the program will ask for the name of a sequence file; enter "repr.pep" or "cath.pep". The program will list the names of the sequences, their lengths and its guess as to whether proteins or dna are being aligned. Sequences should all be in 1 file. 6 formats accepted: NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF. Enter the name of the sequence file: repr.pep Sequence format is Pearson Sequences assumed to be PROTEIN Sequence 1: dica_ecoli 135 aa Sequence 2: immf_bpph1 89 aa Sequence 3: rpc_bpph1 144 aa Sequence 4: rpc_bpp2_P 99 aa Sequence 5: rpc2_bpp22 216 aa ************************************************************** ******** CLUSTAL W(1.4) Multiple Sequence Alignments ******** ************************************************************** 1. Sequence Input From Disc 2. Multiple Alignments 3. Profile Alignments 4. Phylogenetic trees S. Execute a system command H. HELP X. EXIT (leave program) Your choice: 2 To do a multiple alignment, enter "2" and the following menu is displyed: ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps between alignments? = ON 8. Toggle screen display = ON 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice: 5 This menu contains some new features with regard to the clustlv program (4, 7, 8). Option 5 shows the default parameters. ********* PAIRWISE ALIGNMENT PARAMETERS ********* Slow/Accurate alignments: 1. Gap Open Penalty :10.00 2. Gap Extension Penalty :0.10 3. Protein weight matrix :BLOSUM30 Fast/Approximate alignments: 4. Gap penalty :3 5. K-tuple (word) size :1 6. No. of top diagonals :5 7. Window size :5 8. Toggle Slow/Fast pairwise alignments = SLOW H. HELP Enter number (or [RETURN] to exit): ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps between alignments? = ON 8. Toggle screen display = ON 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice: 6 Hit return to see the default multiple alignement parameters. ****** MULTIPLE ALIGNMENT PARAMETERS ****** 1. Gap Opening Penalty :10.00 2. Gap Extension Penalty :0.05 3. Delay divergent sequences :40 % 4. Toggle Transitions (DNA) :Weighted 5. Protein weight matrix :BLOSUM series 6. Use negative matrix :OFF 7. Protein Gap Parameters H. HELP Enter number (or [RETURN] to exit): ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps between alignments? = ON 8. Toggle screen display = ON 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice: 9 Hit 2 to get the output of aligned sequences NBRF/PIR format, 3 in gcg/msf format, 4 in the PHYLIP interleaved format, or hit 5 to get the output in the GDE format. ********* Format of Alignment Output ********* 1. Toggle CLUSTAL format output = ON 2. Toggle NBRF/PIR format output = OFF 3. Toggle GCG/MSF format output = OFF 4. Toggle PHYLIP format output = OFF 5. Toggle GDE format output = OFF 6. Toggle GDE output case = LOWER 7. Toggle output order = INPUT FILE 8. Create alignment output file(s) now? 9. Toggle parameter output = OFF H. HELP Enter number (or [RETURN] to exit): 9 If we want to get the output in all proposed formats, after hitting 2 then 3 then 4 then 5, each followed by return, the format of alignment menu looks like the following : ********* Format of Alignment Output ********* 1. Toggle CLUSTAL format output = ON 2. Toggle NBRF/PIR format output = ON 3. Toggle GCG/MSF format output = ON 4. Toggle PHYLIP format output = ON 5. Toggle GDE format output = ON 6. Toggle GDE output case = LOWER 7. Toggle output order = INPUT FILE 8. Create alignment output file(s) now? 9. Toggle parameter output = ON H. HELP Enter number (or [RETURN] to exit): Hit return to get the multiple alignment menu : ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps between alignments? = ON 8. Toggle screen display = ON 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice: 8 ****** MULTIPLE ALIGNMENT MENU ****** 1. Do complete multiple alignment now (Slow/Accurate) 2. Produce guide tree file only 3. Do alignment using old guide tree file 4. Toggle Slow/Fast pairwise alignments = SLOW 5. Pairwise alignment parameters 6. Multiple alignment parameters 7. Reset gaps between alignments? = ON 8. Toggle screen display = OFF 9. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menu Your choice: 1 To do a complete multiple alignment using default parameters, just type "1" and hit return . The user will be prompted for 2 file names (defaults are offered); one of these will be the same as the input file but with the extension ".aln". This is the alignment output file. The second one has the extension ".dnd" and this one will contain a description of a "guide tree" used to guide the multiple alignment. CLUSTAL W(1.4) multiple sequence alignment Enter a name for the CLUSTAL output file [repr.aln]: Enter a name for the NBRF/PIR output file [repr.pir]: Enter a name for the GCG output file [repr.msf]: Enter a name for the PHYLIP output file [repr.phy]: Enter a name for the GDE output file [repr.gde]: Enter name for GUIDE TREE file [repr.dnd]: Enter a name for the parameter output file [repr.par]: The alignment algorithm is as follows:

First all pairs of sequence are aligned using a fast approximate alignment method. A similarity score (percent identity) is calculated from each alignment between every pair of sequences.
These scores are used to calculate a dendogram i.e a tree which describes the approximate relationships of the sequences to each other.
Finally, the sequences are aligned in larger and larger groups according to the branching order in the tree.

During the alignment process, the program will print the scores of each fast 2 sequence alignments and then the scores of the progressive alignments as the multiple alignment is built up. The format of the ".dnd" file is described in the CLUSTALV documentation. Start of Pairwise alignments Aligning... Sequences (1:2) Aligned. Score: 17 Sequences (1:3) Aligned. Score: 21 Sequences (1:4) Aligned. Score: 16 Sequences (1:5) Aligned. Score: 27 Sequences (2:3) Aligned. Score: 29 Sequences (2:4) Aligned. Score: 20 Sequences (2:5) Aligned. Score: 14 Sequences (3:4) Aligned. Score: 12 Sequences (3:5) Aligned. Score: 11 Sequences (4:5) Aligned. Score: 12 Guide tree file created: [repr.dnd] Start of Multiple Alignment There are 4 groups Aligning... Group 1: Delayed Group 2: Delayed Group 3: Delayed Group 4: Delayed Sequence:2 Score:0 Sequence:3 Score:777 Sequence:1 Score:804 Sequence:5 Score:928 Sequence:4 Score:714 Alignment Score 1510 Consensus length = 218 CLUSTAL-Alignment file created [repr.aln] The alignment output file "repr.aln" looks like this: CLUSTAL W(1.4) multiple sequence alignment DNA binding domain -------------------- dica_ecoli METKNLTIGERIRYRRKNLKHTQRSLAKALKISHVSVSQWERGDSEPTGKNLFALSKVLQ immf_bpph1 --LDGKKLGALIKDKRKEKHLKQTEMAKALGMSRTYLSDIENGRYLPSTKTLSRIAILIN rpc_bpph1 -----MTVGQRIKAIRKERKLTQVQLAEKANLSRSYLADIERDRYNPSLSTLEAVAGALG rpc_bpp2_P ---MSNTISEKIVLMRKSEYLSRQQLADLTGVPYGTLSYYESGRSTPPTDVMMNILQTPQ rpc2_bpp22 --MNTQLMGERIRARRKKLKIRQAALGKMVGVSNVAISQWERSETEPNGENLLALSKALQ . * ** . . . .. * * . . dica_ecoli CSPTWILFGDE------------------------DKQPTPPVEKPVALSPKE-----LE immf_bpph1 -----LDLNVL------------------------KMTEIQVVEE--------------- rpc_bpph1 -----IQVSAI------------------------VGEETLIKEEQAEYNSKEEKDIAKR rpc_bpp2_P -----FTKYTL------------------------WFMTNQIAPE--------------- rpc2_bpp22 CSPDYLLKGDLSQTNVAYHSRHEPRGSYPLISWVSAGQWMEAVEPYHKRAIENWHDTTVD dica_ecoli LLE-------------LFNALPESEQDTQLAEMRAR------------------------ immf_bpph1 ------------------GGYDRAAGTCR------------------------------- rpc_bpph1 MEE-------------IRKDLEKSDGLSFSGEPMSQEAVESLMEAMEH------------ rpc_bpp2_P -----------------FGQIAPALAHFG------------------------------- rpc2_bpp22 CSEDSFWLDVQGDSMTAPAGLSIPEGMIILVDPEVEPRNGKLVVAKLEGENEATFKKLVM dica_ecoli --VKNFNKLFEELLKARQRTNKR--------------- immf_bpph1 ---RQAL------------------------------- rpc_bpph1 -IVRQTQRINKKYTPKKYRNDDQE-------------- rpc_bpp2_P -----QNETTSPHS--GQKTG----------------- rpc2_bpp22 DAGRKFLKPLNPQYPMIEINGNCKIIGVVVDAKLANLP "*" is used to indicate identical residues while "." is used to show positions where all of the sequences are "similar" to each other. Note the alignment obtained by the old version CLUSTALV, when using the default parameters, looks like : CLUSTAL V multiple sequence alignment DNA binding domain -------------------- DICA_ECOLI METKNLTIGERIRYRRKNLKHTQRSLAKALKISHVSVSQWERGDSEPTGKI MMF_BPPH1 LDGK--LLGALIKDKRKEKHLKQTEMAKALGMSRTYLSDIENGRYLPSTK RPC_BPPH1 MT-----VGQRIKAIRKERKLTQVQLAEKANLSRSYLADIERDRYNPSLS RPC_BPP2 MS-N--TISEKIVLMRKSEYLSRQQLADLTGVPYGTLSYYESGRSTPPTD RPC2_BPP22 MNTQLM--GERIRARRKKLKIRQAALGKMVGVSNVAISQWERSETEPNGE . . * ** . .. .. .. * * DICA_ECOLI NLFALSKVLQCSPTWILFGD--------------EDKQPTP--------- IMMF_BPPH1 TLSRIAILINLDLNVLKMTEIQVVEE-GGYD------------------- RPC_BPPH1 TLEAVAGALGIQVSAIVGEETLIKEEQAEYNSKEEKD----------IAK RPC_BPP2 VM----------MNILQTPQF------TKYT------------------- RPC2_BPP22 NLLALSKALQCSPDYLLKGD--LSQTNVAYHSRHEPRGSYPLISWVSAGQ . . . DICA_ECOLI -----------------------------------PVEKPVALS-PKELE IMMF_BPPH1 --------------------------------------RAAGTCRRQAL- RPC_BPPH1 RMEEI----RKDLEK---------SDGLSFS--GEPMSQEAVESLMEAME RPC_BPP2 ---------------------------LWFM--TNQIAPE--------FG RPC2_BPP22 WMEAVEPYHKRAIENWHDTTVDCSEDSFWLDVQGDSMTAPAGLSIPEGMI DICA_ECOLI LL----------ELFNALPESEQDTQLAEM-----RARVKNFNKLFE--- IMMF_BPPH1 -------------------------------------------------- RPC_BPPH1 HIVRQ---------------------------------TQRINKKYTPKK RPC_BPP2 QIAPA---------------------------------LAHFGQNETTSP RPC2_BPP22 ILVDPEVEPRNGKLVVAKLEGENEATFKKLVMDAGRKFLKPLNPQYPMIE DICA_ECOLI --------ELLKARQRTNKR IMMF_BPPH1 -------------------- RPC_BPPH1 YRND-------------DQE RPC_BPP2 HSGQ-------------KTG RPC2_BPP22 INGNCKIIGVVVDAKLANLP The parameters that control the pairwise and multiple alignments (including gap penalties, weight matrix etc.) can all be changed from the menu. One can also choose between 5 different output formats: default clustal output (like above); PIR output which is the same as the input format but with gaps indicated by "-"'s; GCG msf format and Phylip format for Joe Felsensteins phylogenetic inference package and the GDE (Genetic Data Environnment) format. These can also be set from the menu.

Choosing the PIR format output is useful because one can input these alignments again; the positions of gaps will be preserved. The repr.pir contains the output alignment in the pir format.
Choosing the PHYLIP format is useful when using the PHYLIP package for phylogenetic analysis. The repr.phy contains the output alignment in the phylip format. This output alignment can be used with the phylip programs.
Choosing GCG/msf format is useful when using the gcg lineup of pretty programs to manipulate the alignment and to establish the consensus sequence. The repr.msf contains the output alignment in the gcg msf format. This file might be used by some gcg programs as consensus, lineup, etc.. to determine the consensus sequence or to manipulate the alignment.
Choosing the GDE format is useful to manipulate the multiple alignment with the GDE editor. The repr.gde contains the output alignment in the gde format.

One may choose to ONLY produce the ".dnd" guide tree file; alternatively, one can use an old ".dnd" file. This is useful if one is using numerous (e.g. 100's) sequences where the ".dnd" file is time consuming to produce. The dendogram file ".dnd" looks like : ( ( dica_ecoli:0.33448, rpc2_bpp22:0.39144) :0.06486, ( immf_bpph1:0.33455, rpc_bpph1:0.37331) :0.05357, rpc_bpp2_P:0.43077); Its representation is here.

Fredj Tekaia tekaia@pasteur.fr Goto Multiple Alignment Tutorial