Multiple Sequence Alignments
CLUSTALW tutorial
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, positions-specific gap penalties and weight matrix
choice. Nucleic Acids Research, 22:4673-4680.
CLUSTAL W version 1.4 dated September 23, 1994:
Clustal W is a general purpose multiple alignment program for DNA or proteins.
Clustalw is produced by Julie D. Thompson, Toby Gibson of European Molecular
Biology Laboratory, Germany and Desmond Higgins of European Bioinformatics
Institute, Cambridge, UK. Algorithmic
Access to the last documentation of Clustalw 1.06
Multiple alignments are carried out in 3 stages:
1. All pairs of sequences are aligned separately (pairwise alignments) in
order to calculate a distance matrix giving the divergence of each pair of
sequences;
2. A guide tree is constructed from the distance matrix ;
3. The sequences are progressively aligned according to the hiearchy in the
guide tree.
Clustalw program is a major update and rewrite of clustalv program.
Improvements over clustalv :
- Reads GDE, MSF and Clustal format alignments as input ;
- Individual weights are assigned to each sequence in a partial alignment in
order to downweight near-duplicate sequences and upweight the most divergent
ones ;
- Amino acid substitution matrices are varied at different alignment stages
according to the divergence of the sequences to be aligned ;
- Residue specific gap penalties and locally reduced gap penalties in
hydrophilic regions encourage new gaps in potential loop regions rather than
regular secondary structure ;
- Positions in early alignments where gaps have been opened receive locally
reduced gap penalties to encourage the opening up of new gaps at these
positions ;
- The similary scores between pairs of sequences are clustered according
to the Neighbor-Joining (NJ) method and no more according to the UPGMA one.
- The New Hampshire (nested parentheses) tree format is used as default
for all trees (compatible with the PHYLIP package).
- Enhanced facilities for addition of new sequences to old alignments.
Input data file
In this tutorial, it is assumed that the user has access to the GCG package
and the SwissProt protein sequence database. To extract the sequences,
one needs to create a text file (using an editor e.g. emacs under UNIX)
with the names of the sequences on each line. One can then use the tofasta
command of the GCG package to extract these sequences from the database and
put them into a single file in FASTA format. This format can be used as input
to CLUSTALV or CLUSTALW programs.
Note : Sequences submitted to CLUSTALW should all be in 1 file.
Six formats are accepted:
NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF.
Among these the fasta format is by far the simplest one, if the orginal
sequences are raw data.
Readseq is a very
useful program to convert sequence formats.
In the gcg package there are some sequence exchange programs which allow conversion
of sequences to or from the gcg format.
The sequence data examples used in this tutorial are drawn from Desmond Higgins
course.
Data set 1 : Cysteine (Thiol) proteases
Create a text file called "cath.list" with the following lines:
sw:aleu_horvu
sw:cath_human
sw:cath_rat
sw:catl_human
sw:catl_rat
sw:cys1_dicdi
sw:papa_carpa
This is just a list of SwissProt sequence entry names. One needs to extract
them into a single file in FASTA format; this is done by using the
tofasta
command as follows:
% tofasta @cath.list
The program will ask to suggest a name for the output file; call it "cath.pep".
Helix-turn-Helix DNA binding/repressor proteins
Repeat the above exercise and create a file called "repr.list" which should
contain the following lines:
sw:dica_ecoli
sw:immf_bpph1
sw:rpc_bpph1
sw:rpc_bpp2
sw:rpc2_bpp22
Then extract the sequences as before with the tofasta command:
% tofasta @repr.list
and call the output file "repr.pep". This file is shown below.
reper.pep should look like this:
>dica_ecoli P06966 escherichia coli. repressor protein of division inhibition gene dicb. 7/89
METKNLTIGERIRYRRKNLKHTQRSLAKALKISHVSVSQWERGDSEPTGKNLFALSKVLQ
CSPTWILFGDEDKQPTPPVEKPVALSPKELELLELFNALPESEQDTQLAEMRARVKNFNK
LFEELLKARQRTNKR
>immf_bpph1 P13772 bacteriophage phi-105. immf control region 10 kd protein. 4/90
LDGKKLGALIKDKRKEKHLKQTEMAKALGMSRTYLSDIENGRYLPSTKTLSRIAILINLD
LNVLKMTEIQVVEEGGYDRAAGTCRRQAL
>rpc_bpph1 P06153 bacteriophage phi-105. immunity repressor protein. 8/92
MTVGQRIKAIRKERKLTQVQLAEKANLSRSYLADIERDRYNPSLSTLEAVAGALGIQVSA
IVGEETLIKEEQAEYNSKEEKDIAKRMEEIRKDLEKSDGLSFSGEPMSQEAVESLMEAME
HIVRQTQRINKKYTPKKYRNDDQE
>rpc_bpp2 P04132 bacteriophage p2. repressor protein c. 5/92
MSNTISEKIVLMRKSEYLSRQQLADLTGVPYGTLSYYESGRSTPPTDVMMNILQTPQFTK
YTLWFMTNQIAPEFGQIAPALAHFGQNETTSPHSGQKTG
>rpc2_bpp22 P03035 bacteriophage p22, and bacteriophage p21 (bacteriophage 21). repressor
protein c2. 6/94
MNTQLMGERIRARRKKLKIRQAALGKMVGVSNVAISQWERSETEPNGENLLALSKALQCS
PDYLLKGDLSQTNVAYHSRHEPRGSYPLISWVSAGQWMEAVEPYHKRAIENWHDTTVDCS
EDSFWLDVQGDSMTAPAGLSIPEGMIILVDPEVEPRNGKLVVAKLEGENEATFKKLVMDA
GRKFLKPLNPQYPMIEINGNCKIIGVVVDAKLANLP
Note that if the considered sequences are raw files, the easyest
way to submit them to clustalw, is to present them in 1 file and
in this format.
The exact rules for sequence input are in the ClustalV.doc (not yet
ClustalW.doc) documentation file;
- One can use UPPER or lower case and the sequences can be DNA or PROTEIN;
- No ambiguity codes are allowed; a symbol is either a valid amino acid or
nucleotide or will be considered unknown;
- All spaces, digits etc. will be ignored except for the hyphen character
"-" which is used to represent gaps;
- Each sequence name begins with an angle bracket ">";
- The text on the line that starts with the ">" will be ignored;
it is treated as free text;
- the sequence is on the following lines and the next sequence starts with
a ">" etc.
Running Clustalw
To start up CLUSTALW just type clustalw (may be preceeded by
the pathname) :
% clustalw
The following main menu will apear.
**************************************************************
******** CLUSTAL W(1.4) Multiple Sequence Alignments ********
**************************************************************
1. Sequence Input From Disc
2. Multiple Alignments
3. Profile Alignments
4. Phylogenetic trees
S. Execute a system command
H. HELP
X. EXIT (leave program)
Your choice: 1
Type "1", press return and the program will ask for the name of a sequence file;
enter "repr.pep" or "cath.pep". The program will list the names of the
sequences, their lengths and its guess as to whether proteins or dna are
being aligned.
Sequences should all be in 1 file.
6 formats accepted:
NBRF/PIR, EMBL/SwissProt, Pearson (Fasta), GDE, Clustal, GCG/MSF.
Enter the name of the sequence file: repr.pep
Sequence format is Pearson
Sequences assumed to be PROTEIN
Sequence 1: dica_ecoli 135 aa
Sequence 2: immf_bpph1 89 aa
Sequence 3: rpc_bpph1 144 aa
Sequence 4: rpc_bpp2_P 99 aa
Sequence 5: rpc2_bpp22 216 aa
**************************************************************
******** CLUSTAL W(1.4) Multiple Sequence Alignments ********
**************************************************************
1. Sequence Input From Disc
2. Multiple Alignments
3. Profile Alignments
4. Phylogenetic trees
S. Execute a system command
H. HELP
X. EXIT (leave program)
Your choice: 2
To do a multiple alignment, enter "2" and the following menu is displyed:
****** MULTIPLE ALIGNMENT MENU ******
1. Do complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only
3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters
6. Multiple alignment parameters
7. Reset gaps between alignments? = ON
8. Toggle screen display = ON
9. Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice: 5
This menu contains some new features with regard to the clustlv program
(4, 7, 8).
Option 5 shows the default parameters.
********* PAIRWISE ALIGNMENT PARAMETERS *********
Slow/Accurate alignments:
1. Gap Open Penalty :10.00
2. Gap Extension Penalty :0.10
3. Protein weight matrix :BLOSUM30
Fast/Approximate alignments:
4. Gap penalty :3
5. K-tuple (word) size :1
6. No. of top diagonals :5
7. Window size :5
8. Toggle Slow/Fast pairwise alignments = SLOW
H. HELP
Enter number (or [RETURN] to exit):
****** MULTIPLE ALIGNMENT MENU ******
1. Do complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only
3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters
6. Multiple alignment parameters
7. Reset gaps between alignments? = ON
8. Toggle screen display = ON
9. Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice: 6
Hit return to see the default multiple alignement parameters.
****** MULTIPLE ALIGNMENT PARAMETERS ******
1. Gap Opening Penalty :10.00
2. Gap Extension Penalty :0.05
3. Delay divergent sequences :40 %
4. Toggle Transitions (DNA) :Weighted
5. Protein weight matrix :BLOSUM series
6. Use negative matrix :OFF
7. Protein Gap Parameters
H. HELP
Enter number (or [RETURN] to exit):
****** MULTIPLE ALIGNMENT MENU ******
1. Do complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only
3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters
6. Multiple alignment parameters
7. Reset gaps between alignments? = ON
8. Toggle screen display = ON
9. Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice: 9
Hit 2 to get the output of aligned sequences NBRF/PIR format,
3 in gcg/msf format, 4 in the PHYLIP interleaved format,
or hit 5 to get the output in the GDE format.
********* Format of Alignment Output *********
1. Toggle CLUSTAL format output = ON
2. Toggle NBRF/PIR format output = OFF
3. Toggle GCG/MSF format output = OFF
4. Toggle PHYLIP format output = OFF
5. Toggle GDE format output = OFF
6. Toggle GDE output case = LOWER
7. Toggle output order = INPUT FILE
8. Create alignment output file(s) now?
9. Toggle parameter output = OFF
H. HELP
Enter number (or [RETURN] to exit): 9
If we want to get the output in all proposed formats, after hitting
2 then 3 then 4 then 5, each followed by return, the format of alignment
menu looks like the following :
********* Format of Alignment Output *********
1. Toggle CLUSTAL format output = ON
2. Toggle NBRF/PIR format output = ON
3. Toggle GCG/MSF format output = ON
4. Toggle PHYLIP format output = ON
5. Toggle GDE format output = ON
6. Toggle GDE output case = LOWER
7. Toggle output order = INPUT FILE
8. Create alignment output file(s) now?
9. Toggle parameter output = ON
H. HELP
Enter number (or [RETURN] to exit):
Hit return to get the multiple alignment menu :
****** MULTIPLE ALIGNMENT MENU ******
1. Do complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only
3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters
6. Multiple alignment parameters
7. Reset gaps between alignments? = ON
8. Toggle screen display = ON
9. Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice: 8
****** MULTIPLE ALIGNMENT MENU ******
1. Do complete multiple alignment now (Slow/Accurate)
2. Produce guide tree file only
3. Do alignment using old guide tree file
4. Toggle Slow/Fast pairwise alignments = SLOW
5. Pairwise alignment parameters
6. Multiple alignment parameters
7. Reset gaps between alignments? = ON
8. Toggle screen display = OFF
9. Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice: 1
To do a complete multiple alignment using default parameters, just
type "1" and hit return . The user will be prompted for 2 file names
(defaults are offered); one of these will be the same as the input
file but with the extension ".aln". This is the alignment output file.
The second one has the extension ".dnd" and this one will contain a
description of a "guide tree" used to guide the multiple alignment.
CLUSTAL W(1.4) multiple sequence alignment
Enter a name for the CLUSTAL output file [repr.aln]:
Enter a name for the NBRF/PIR output file [repr.pir]:
Enter a name for the GCG output file [repr.msf]:
Enter a name for the PHYLIP output file [repr.phy]:
Enter a name for the GDE output file [repr.gde]:
Enter name for GUIDE TREE file [repr.dnd]:
Enter a name for the parameter output file [repr.par]:
The alignment algorithm is as follows:
- First all pairs of sequence are aligned using a fast approximate
alignment method. A similarity score (percent identity) is calculated
from each alignment between every pair of sequences.
- These scores are used to calculate a dendogram i.e a tree which
describes the approximate relationships of the sequences to each other.
- Finally, the sequences are aligned in larger and larger groups
according to the branching order in the tree.
During the alignment process, the program will print the scores of each
fast 2 sequence alignments and then the scores of the progressive
alignments as the multiple alignment is built up. The format of the
".dnd" file is described in the CLUSTALV documentation.
Start of Pairwise alignments
Aligning...
Sequences (1:2) Aligned. Score: 17
Sequences (1:3) Aligned. Score: 21
Sequences (1:4) Aligned. Score: 16
Sequences (1:5) Aligned. Score: 27
Sequences (2:3) Aligned. Score: 29
Sequences (2:4) Aligned. Score: 20
Sequences (2:5) Aligned. Score: 14
Sequences (3:4) Aligned. Score: 12
Sequences (3:5) Aligned. Score: 11
Sequences (4:5) Aligned. Score: 12
Guide tree file created: [repr.dnd]
Start of Multiple Alignment
There are 4 groups
Aligning...
Group 1: Delayed
Group 2: Delayed
Group 3: Delayed
Group 4: Delayed
Sequence:2 Score:0
Sequence:3 Score:777
Sequence:1 Score:804
Sequence:5 Score:928
Sequence:4 Score:714
Alignment Score 1510
Consensus length = 218
CLUSTAL-Alignment file created [repr.aln]
The alignment output file "repr.aln" looks like this:
CLUSTAL W(1.4) multiple sequence alignment
DNA binding domain
--------------------
dica_ecoli METKNLTIGERIRYRRKNLKHTQRSLAKALKISHVSVSQWERGDSEPTGKNLFALSKVLQ
immf_bpph1 --LDGKKLGALIKDKRKEKHLKQTEMAKALGMSRTYLSDIENGRYLPSTKTLSRIAILIN
rpc_bpph1 -----MTVGQRIKAIRKERKLTQVQLAEKANLSRSYLADIERDRYNPSLSTLEAVAGALG
rpc_bpp2_P ---MSNTISEKIVLMRKSEYLSRQQLADLTGVPYGTLSYYESGRSTPPTDVMMNILQTPQ
rpc2_bpp22 --MNTQLMGERIRARRKKLKIRQAALGKMVGVSNVAISQWERSETEPNGENLLALSKALQ
. * ** . . . .. * * . .
dica_ecoli CSPTWILFGDE------------------------DKQPTPPVEKPVALSPKE-----LE
immf_bpph1 -----LDLNVL------------------------KMTEIQVVEE---------------
rpc_bpph1 -----IQVSAI------------------------VGEETLIKEEQAEYNSKEEKDIAKR
rpc_bpp2_P -----FTKYTL------------------------WFMTNQIAPE---------------
rpc2_bpp22 CSPDYLLKGDLSQTNVAYHSRHEPRGSYPLISWVSAGQWMEAVEPYHKRAIENWHDTTVD
dica_ecoli LLE-------------LFNALPESEQDTQLAEMRAR------------------------
immf_bpph1 ------------------GGYDRAAGTCR-------------------------------
rpc_bpph1 MEE-------------IRKDLEKSDGLSFSGEPMSQEAVESLMEAMEH------------
rpc_bpp2_P -----------------FGQIAPALAHFG-------------------------------
rpc2_bpp22 CSEDSFWLDVQGDSMTAPAGLSIPEGMIILVDPEVEPRNGKLVVAKLEGENEATFKKLVM
dica_ecoli --VKNFNKLFEELLKARQRTNKR---------------
immf_bpph1 ---RQAL-------------------------------
rpc_bpph1 -IVRQTQRINKKYTPKKYRNDDQE--------------
rpc_bpp2_P -----QNETTSPHS--GQKTG-----------------
rpc2_bpp22 DAGRKFLKPLNPQYPMIEINGNCKIIGVVVDAKLANLP
"*" is used to indicate identical residues while "." is used to show positions
where all of the sequences are "similar" to each other.
Note the alignment obtained by the old version CLUSTALV, when using
the default parameters, looks like :
CLUSTAL V multiple sequence alignment
DNA binding domain
--------------------
DICA_ECOLI METKNLTIGERIRYRRKNLKHTQRSLAKALKISHVSVSQWERGDSEPTGKI
MMF_BPPH1 LDGK--LLGALIKDKRKEKHLKQTEMAKALGMSRTYLSDIENGRYLPSTK
RPC_BPPH1 MT-----VGQRIKAIRKERKLTQVQLAEKANLSRSYLADIERDRYNPSLS
RPC_BPP2 MS-N--TISEKIVLMRKSEYLSRQQLADLTGVPYGTLSYYESGRSTPPTD
RPC2_BPP22 MNTQLM--GERIRARRKKLKIRQAALGKMVGVSNVAISQWERSETEPNGE
. . * ** . .. .. .. * *
DICA_ECOLI NLFALSKVLQCSPTWILFGD--------------EDKQPTP---------
IMMF_BPPH1 TLSRIAILINLDLNVLKMTEIQVVEE-GGYD-------------------
RPC_BPPH1 TLEAVAGALGIQVSAIVGEETLIKEEQAEYNSKEEKD----------IAK
RPC_BPP2 VM----------MNILQTPQF------TKYT-------------------
RPC2_BPP22 NLLALSKALQCSPDYLLKGD--LSQTNVAYHSRHEPRGSYPLISWVSAGQ
. . .
DICA_ECOLI -----------------------------------PVEKPVALS-PKELE
IMMF_BPPH1 --------------------------------------RAAGTCRRQAL-
RPC_BPPH1 RMEEI----RKDLEK---------SDGLSFS--GEPMSQEAVESLMEAME
RPC_BPP2 ---------------------------LWFM--TNQIAPE--------FG
RPC2_BPP22 WMEAVEPYHKRAIENWHDTTVDCSEDSFWLDVQGDSMTAPAGLSIPEGMI
DICA_ECOLI LL----------ELFNALPESEQDTQLAEM-----RARVKNFNKLFE---
IMMF_BPPH1 --------------------------------------------------
RPC_BPPH1 HIVRQ---------------------------------TQRINKKYTPKK
RPC_BPP2 QIAPA---------------------------------LAHFGQNETTSP
RPC2_BPP22 ILVDPEVEPRNGKLVVAKLEGENEATFKKLVMDAGRKFLKPLNPQYPMIE
DICA_ECOLI --------ELLKARQRTNKR
IMMF_BPPH1 --------------------
RPC_BPPH1 YRND-------------DQE
RPC_BPP2 HSGQ-------------KTG
RPC2_BPP22 INGNCKIIGVVVDAKLANLP
The parameters that control the pairwise and multiple alignments (including gap
penalties, weight matrix etc.) can all be changed from the menu.
One can also choose between 5 different output formats:
default clustal output (like above); PIR output which is the same as the input
format but with gaps indicated by "-"'s; GCG msf format and Phylip format for
Joe Felsensteins phylogenetic inference package and the GDE (Genetic Data
Environnment) format. These can also be set from the menu.
- Choosing the PIR format output is useful because one can input
these alignments again; the positions of gaps will be preserved.
The repr.pir contains the output alignment in the pir
format.
- Choosing the PHYLIP format is useful when using the PHYLIP package for
phylogenetic analysis.
The repr.phy contains the output alignment in the phylip
format.
This output alignment can be used with the phylip programs.
- Choosing GCG/msf format is useful when using the gcg lineup of pretty
programs to manipulate the alignment and to establish the consensus sequence.
The repr.msf contains the output alignment in the gcg msf
format.
This file might be used by some gcg programs as consensus, lineup, etc..
to
determine the consensus sequence or to manipulate the alignment.
- Choosing the GDE format is useful to manipulate the multiple alignment
with the GDE editor.
The repr.gde contains the output alignment in the gde
format.
One may choose to ONLY produce the ".dnd" guide tree file; alternatively, one
can use an old ".dnd" file. This is useful if one is using numerous
(e.g. 100's) sequences where the ".dnd" file is time consuming to produce.
The dendogram file ".dnd" looks like :
(
(
dica_ecoli:0.33448,
rpc2_bpp22:0.39144)
:0.06486,
(
immf_bpph1:0.33455,
rpc_bpph1:0.37331)
:0.05357,
rpc_bpp2_P:0.43077);
Its representation is here.
Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, positions-specific gap penalties and weight matrix
choice. Nucleic Acids Research, 22:4673-4680.
Fredj Tekaia tekaia@pasteur.fr Goto
Multiple Alignment
Tutorial