######################################################################
RepeatMasker
Developed by Arian Smit and Robert Hubley
Please refer to: Smit, AFA, Hubley, R. & Green, P "RepeatMasker" at
http://www.repeatmasker.org
For RepeatMasker database, see information on RepBase
The interspersed repeat databases are modified versions of
those found in "RepBase Update" (http://www.girinst.org/)
######################################################################
RepeatMasker is a program that screens DNA sequences for interspersed
repeats and low complexity DNA sequences. The output of the program is
a detailed annotation of the repeats that are present in the query
sequence as well as a modified version of the query sequence in which
all the annotated repeats have been masked (default: replaced by
Ns). Sequence comparisons in RepeatMasker are performed by the program
cross_match, an efficient implementation of the Smith-Waterman-Gotoh
algorithm developed by Phil Green, or by WU-Blast developed by Warren
Gish.
This help file discusses the following topics:
0 Basic input and output
1 Options
1.1 Species and contamination check options
1.2 Options effecting which repeats get masked
1.3 Speed and search parameters
1.4 Output and formatting
1.5 ProcessRepeats options
2 Methodology and quality of output
2.1 Methodology
2.2 Scoring matrices
2.3 Databases
2.4 Sensitivity and speed
2.5 Selectivity and matches to coding sequences
2.6 Low complexity DNA and simple repeats
3 How to read the results
3.1 The annotation (.out) file
3.2 Alignments
3.3 The summary (.tbl) file
4 Applications
4.1 Use in database searches
4.2 Identification of DNA source and bacterial insertions
4.3 DateRepeats - Masking lineage-specific repeats for genomic alignments
4.4 Use with gene prediction programs and other applications
5 References
0 INPUT and OUTPUT
Input format:
Sequences have to be in the ' FASTA format':
>sequencename all kind of info
AGCGATCGCATCGAGCGCATTCGCATGGGG
>sequencename2 all kind of info
GCCCATGCGATCGAGCTTCGCTAGCATAGCGATCA
The program accepts FASTA format with errors and raw sequence files,
but does not work with other formats like GenBank, Staden, etc..
You can use RepeatMasker on a file containing multiple FASTA format
sequences and on multiple sequence files at the same time:
RepeatMasker *.fasta
This command will mask all files that end with .fasta in the current
directory and give separate reports for each file. Note that if you
have multiple small sequences it is considerably faster to run
RepeatMasker on one batch file than on many single sequence files. The
summary file will be more informative as well. However, analysis on
single files (when larger than 2 kb each) can be slightly more
accurate, since GC levels for each sequence will be calculated and
used to choose appropriate parameters.
Standard output:
RepeatMasker returns a .masked file containing the query sequence(s)
with all identified repeats and low complexity sequences masked. These
masked sequences are listed and annotated in the .out file. The masked
sequences are printed in the same order as they are in the submitted
file, whereas the sequences are presented alphabetically in the
annotation table. The .tbl file is a summary of the repeat content of
the analyzed sequence.
1 OPTIONS
1.1 Species options
-species Indicate source species of query DNA
-lib [filename] Allows the use of a custom library
contamination checking options
-is_only only clips E coli insertion elements out of FASTA and .qual files
-is_clip clips IS elements before analysis (default: IS only reported)
-no_is skips bacterial insertion element check
-rodspec only checks for rodent specific repeats (no RepeatMasker run)
-primspec only checks for primate specific repeats (no RepeatMasker run)
For detailed explanation of the contamination detection options, see
"4.2 Identification of DNA source" below.
-spec
Interspersed repeats mostly are copies of transposable elements in
different states of erosion. Thus, dependent on the time of activity
of the source transposable element, interspersed repeats generally are
specific to a (clade of) species, and different redatabase
(http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html). In
principal, all unique clade names occurring in this database can be
used. Examples are:
-species "sus scrofa"
-species chimpanzee
-species arabidopsis
-species canidae
-species mammals
Capitalization is ignored, multiple words need to bound by apostrophes.
RepeatMasker builds one or more repeat consensus files the first time
a species/group has been chosen, or when a new database has been
downloaded. These will be written in a subdirectory of the Libraries
directory named after the date of the repeat database version and the
Latin name of the clade. For example, "-species monocotyledons"
creates the file
"..../RepeatMasker/Libraries/20040616/liliopsida/specieslib".
Currently, only for mammalian species multiple files are created,
bearing names like "shortcutlib" and "longlib", which the queries are
compared to sequentially.
The creation of these files takes some time (a few seconds sometimes),
but the next times RepeatMasker is run on the same species these
existing files will be used. When Wu-BlAST is used as the search
engeine (see 1.3), blastable libraries are built, again as a one time
event for each species.
After multiple database updates, the libraries could hog some space,
and you may consider deleting the older
"..../RepeatMasker/Libraries/" directories.
The files contain all repeats of the RepeatMasker database that have
been found in the genome of the given species, or have been found in a
related species and are thought to predate the speciation time of the
two. For example, -species gorilla, will create a gorilla repeat file
that is almost as big as the human file, because almost all repeats in
human predate the 6-10 million years that separates us from the
gorilla, though none of the consensus sequences have been derived from
Gorilla DNA. A repeat file for hyraxes, for which order no repeats
have been submitted to the database yet, will contain all repeats
found in the human genome that are thought to be older than the origin
of most mammalian orders.
If a group of species is indicated, all repeats are included that are
found in any species belonging to this clade. Thus, "-species diptera"
leads to comparison against repeats found in the genomes of any
diptera species, currently primarily represented by fruitfly and
mosquitoes, and "-species murinae" compares the query to all known
murine repeats, including rat and mouse.
Not all "common" English names occur in the taxonomy database. For
example, "chimp", "squirrels", "grasses", or "carnivores" are not
present. The program will suggest functional names using Soundex, with
oftentimes unexpected results. Using Latin names is always safest.
util/queryRepeatDatabase.pl
The script queryRepeatDatabase.pl in the util subdirectory of the
RepeatMasker directory allows you to check if a species is covered and
which repeats the query will be compared to if the species indication
is used. For example,
util/queryRepeatDatabase.pl -species sorghum -stat
shows that, besides the universal simple repeat and bacterial
insertion elements contamination checks the query will be compared to
only 4 sorghum specific repeats (some of the many maize/corn specific
repeats may also occur in sorghum, but this has not yet been studied).
Type queryRepeatDatabase.pl for further options with this script.
These are the numbers and bp of repeat consensus sequences (excluding
simple repeats and RNAs) as of October 2005 for a few better
represented species:
species # of consensi total bp
Mammalian-wide 339 472647
Primates * 401 686251
Rodents * 462 717119
Carnivores * 101 149290
Marsupials * 105 162839
Chicken 92 177063
Fugu (pufferfish) 95 175420
Danio (zebrafish) 214 459777
Ciona (two sea squirts) 127 304646
Strongylocentrotus 86 124328
Drosophila 214 611814
Anopheles 280 721533
Caenorhabditis (2 spec) 354 525523
Arabidopsis 457 1460558
Oryza (rice) 525 1154651
Wheats (Triticeae) 198 494718
Zea (maize) 65 206724
Thalassiosira diatom 40 96637
Chlamydomonas algae 65 137363
* Excluding mammalian-wide elements.
-lib
The majority of species are of course not yet covered in the repeat
databases and many are far from complete, but you may have your own
collection. At other times you may want to mask or study only a
particular type of repeat.
For these types of siutations, you can use the -lib option to
specify a custom library of sequences to be masked in the query. The
library file needs to contain sequences in FASTA format. Unless a full
path is given on the command line the file is assumed to be in the
same directory as the sequence file.
The recommended format for IDs in a custom library is:
>repeatname#class/subclass
or simply
>repeatname#class
In this format, the data will be processed (overlapping repeats are
merged etc), alternative output (.ace or .gff) can be created and an
overview .tbl file will be created. Classes that will be displayed in
the .tbl file are 'SINE', 'LINE', 'LTR', 'DNA', 'Satellite', anything
with 'RNA' in it, 'Simple_repeat', and 'Other' or 'Unknown' (the
latter defaults when class is missing). Subclasses are plentiful. They
are not all tabulated in the .tbl file or necessarily spelled
identically as in the repeat files, so check the RepeatMasker.embl
file for names that can be parsed into the .tbl file.
You can combine the repeats available in the RepeatMasker library
with a custom set of consensus sequences. To accomplish this
use the queryRepeatDatabase.pl tool provided in the util
directory of the RepeatMasker distribution. [ Running the program
without any options will print the documentation to the screen. ]
Use this tool to extract RepeatMasker sequences and concatenate
them to your custom sequences in a new library file.
1.2 Masking options (options that determine what kind of repeats are masked)
-cutoff [number] sets cutoff score for masking repeats when using -lib
(default cutoff 225)
-nolow does not mask low complexity DNA or simple repeats
-l(ow) same as nolow (historical)
-(no)int only masks low complex/simple repeats (no interspersed repeats)
-alu only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)
-div [number] masks only those repeats that are less than [number] percent
diverged from the consensus sequence
-cutoff
When using a local library you may want to change the minimum score
for reporting a match. The default is 225, lowering it below 200 will
usually start to give you significant numbers of false matches,
raising it to 250 will guarantee that all matches are real. Note that
low complexity regions in otherwise complex repeat sequences in your
library are most likely to give false matches.
-nolow / -l(ow)
With the option -nolow or -l(ow) only interspersed repeats are
masked. By default simple tandem repeats and low complexity
(polypurine, AT-rich) regions are masked besides the interspersed
repeats. For database searches the default setting is recommended, but
sometimes, e.g. when using the masked sequence to predict the presence
of exons, it may be better to skip the low complexity masking.
-noint / -int
When using the -noint or -int option only low complexity DNA and
simple repeats will be masked in the query sequence.
Inexact simple repeats may be spanned and hidden by an interspersed
repeat annotation. In particular, most A-rich simple repeats derived
from the poly A tails of SINEs and LINES are merged with the
annotation of the SINE or LINE (i.e. you can't tell there is a simple
repeat). Thus, if you're interested in finding the location of
potentially polymorphic simple repeats, this option is recommended.
-norna
Because of their close similarity to SINEs and the abundance of some
of their pseudogenes, RepeatMasker by default screens for matches to
small pol III transcribed RNAs (mostly tRNAs and snRNAs). When you're
interested in small RNA genes, you should use the -norna option that
leaves these sequences unmasked, while still masking SINEs.
-alu
-div
You can limit the masking and annotation to (primate) Alu repeats with
the -alu option and to a subset of less diverged (younger) repeats
with the option -div. For example,
"RepeatMasker -div 20 -mus mysequence"
will mask only those rodent repeats and simple repeats that are less
than 20% diverged from the consensus sequence and
"RepeatMasker -div 10 -alu mysequence"
will mask Alus that are less than 10% diverged from the Alu consensus
sequences and no other repeats.
The -div option may be used to limit the masking to those repeats that
are specific to a species group for use in subsequent comparison of
orthologous genomic loci. Notice that a more sophisticated method to
mask lineage-specific repeats (currently only in mammals) is now
available with the script DateRepeats (4.3).
1.3 Options effecting speed and search parameters
-q Quick search; 5-10% less sensitive, 3-4 times faster than default
-qq Rush job; about 10% less sensitive,
-s Slow search; 0-5% more sensitive, 2.5 times slower than default.
-pa(rallel) [number]
Number of processors to use in parallel (only works for
batch files or sequences larger than 50 kb)
-engine [crossmatch|wublast|decypher]
Select a non-default search engine to use. If not specified
RepeatMasker will use the default configured at install time.
-w(ublast) Use WU-blast, rather than cross_match as engine
**DEPRECATED** Use -engine [crossmatch|wublast|decypher] now.
-frag [number] Maximum sequence length masked without fragmenting
(default 40000).
-maxsize [nr] Maximum length for which IS- or repeat clipped sequences
can be produced (default 4000000). Memory requirements go
up with higher maxsize.
-gc [number] Use matrices calculated for 'number' percentage background
GC level.
-gccalc Program calculates the GC content even for batch files/small
sequences.
-nocut Skips the steps in which repeats are excised.
-noisy Prints cross_match progress report to screen (defaults to
.stderr file)
-s -q -qq
RepeatMasker can be run at four different sensitivity/speed levels,
with the option -q providing quick (less sensitive) and -s slow
(sensitive) results compared to default. The option -qq has been added
for when you're in a frightful hurry. Each higher gear is about 2-3
times faster, and 90% as sensitive as the next lower gear. See "2.4
Sensitivity and Speed" below for details
-w(ublast)
**DEPRECATED** See -engine.
-engine [crossmatch|wublast|decypher]
By default, RepeatMasker uses the search engine configured
during installation as the default. To use the non-default
search engine you can specify it with the -engine parameter.
Before June 2004, the script MaskerAid (written by Joey Bedell, Ian
Korf and Warren Gish at the St Louis Washington University Genome
Center) was necessary to use WU-BLAST with RepeatMasker, but that
functionality is now built in. RepeatMasker includes a search engine
object that allows relatively straightforward integration of other
search engines. Currently only WU-BLAST has the flexibility to accept
all cross_match options.
For longer sequences, default RepeatMasker runs with WU-BLAST take
about as long as cross_match powered runs at -qq settings (see "2.4
Sensitivity and speed"). The speed settings have relatively little
effect on the speed when using WU-BLAST, with the fastest settings
1.25-1.75 as fast as the slowest settings, while the sensitivity
increases significantly. Thus, I recommend to always run RepeatMasker
in sensitive (-s) or default mode when using WU-BLAST. I've made the
difference in parameters between sensitive and default settings larger
at -w settings, to make these speed options more meaningful and gain
more sensitivity (with little cost in speed).
Even with these more extreme parameters, the sensitivity can't quite
reach that of the sensitive settings using cross_match, but it comes
very close, and the huge difference in speed make this option
very attractive.
The output format with the -w option is identical to default and
scores are comparable, as the same complexity adjustment is applied.
The only difference is that, when using the wublast option, hyphens
in the sequence are retained (in default mode all non-letters were
deleted from the sequence). WU-BLAST uses hyphens to indicate
insurmountable barriers and alignments will not span hyphens.
-pa(rallel)
For sequences over 50 kb long or files wit multiple sequences,
RepeatMasker can use multiple processors. When you type:
RepeatMasker -par 10
A batch file of sequences will run with up to 10 sequences at the
time, until all sequences are done, while a file with one large
sequence will analyze the sequence in up to 10 fragments at the same
time. The minimum fragment size is 25 or 33 kb, the maximum 66 kb (all
sequences over 100 kb are divided in 33-66 kb fragments). For the
batch files no minimum size exists. Thus,
If contains: RM runs in parallel:
one 60 kb sequence two 30 kb fragments
one 400 kb sequence ten 40 kb fragments
one 1 Mb sequence ten 50 kb fragments, twice
ten 500 bp sequences ten 500 bp sequences
two 500 kb sequences ten 50 kb fragments, twice
Processing of the detected matches takes place after all batches or
fragments have been cross-matched with the databases.
Beware that, generally, you have a limited number of processor IDs
allotted. RepeatMasker uses 4 PIDs for each parallel job, so if you're
allotted 64 user PIDs, you can 'only' run 16 fragments/batches in
parallel.
-frag
Even when the -par option is not used, RepeatMasker transparently
fragments sequences over 40 kb in fragments of equal sizes with 1 kb
overlaps. Similarly, sequence batches containing more than 51 kb are
subdivided in batches of 40 kb or less. The -frag option sets the
maximum fragment and batch size
The only visible effect of the fragmentation is in the alignment
files, where alignments at the edges of the fragments can be
duplicated and/or truncated. The 1 kb overlap between fragments
almost guarantees that there is no loss in sensitivity at the
edges. Fragmentation initially was implemented to allow the size of
sequences and sequence batches to be unlimited. Cross_match can be
very memory intensive when SW alignments have to be performed in large
matrices. This may happen with short minmatch and large bandwidth
settings. Note that RepeatMasker should not croak when cross_match
runs out of memory; it will redo the failed search with a higher word
length or smaller bandwidth until it succeeds. However, this will lead
to gradually less sensitive comparisons.
Fragmentation also can improve repeat detection when a genomic
sequence contains large regions of DNA with significantly different GC
levels (isochores), since sets of scoring matrices are chosen based on
the GC level of a fragment.
Since April 2002 the maximum fragment size is hardwired to be half of
"maxsize" (see below).
-maxsize
To limit the memory requirements of the script an upper boundary to
the amount of sequence stored in a single array in the script is set
to 4 million bp. This parameter can be reduced with the -maxsize
option to a minimum of 500000, for severely memory-impaired computers.
The size of maxsize further determines the largest length single
sequence from which E. coli insertion sequences and full-length
repeats can be clipped. Increase the size of maxsize to allow removal
of IS elements from larger sequences, like: RepeatMasker -is_clip
-maxsize 9999999999 muntjakchromosome1
-gc
-gccalc
Neutral mutation patterns differ significantly depending on the GC
richness of a locus and we have calculated optimal scoring matrices
for the alignment to consensus sequences in a range of background GC
levels (see 2.2). Usually, RepeatMasker calculates the percentage of
the sequence consisting of Gs and Cs and uses the appropriate
matrices. However, the program defaults to using 'average' 43% GC
matrices when the query is shorter than 2000 bp or a batch file is
analyzed. This is because short sequences can diverge greatly from the
GC level of the locus. For example, CpG islands and exons are more GC
rich than the surrounding DNA, whereas a LINE-1 element can be more AT
rich than the background. In a batch file, RepeatMasker analyses all
sequences together with the same matrices. The percentage GC in all
the sequences combined may be inappropriate for some sequence entries;
using high GC level matrices in AT rich sequences (and vice versa) may
result in false masking.
One can override this behavior in two ways:
With the option -gc you can set the GC level to a certain percentage:
RepeatMasker -gc 37 mybatchofsequences.fa
lets the program use matrices appropriate for 37% GC background. The
batch could, for example, contain ESTs from a single locus with a
known GC level.
Alternatively, the -gccalc option forces RepeatMasker to use the
actual GC level of a short sequence or the average GC level of a batch
of sequences. The latter sequences, for example, may be contigs in a
sequencing project.
-nocut
The option -nocut skips a step in the default procedure for human and
rodent queries, in which full-length younger insert are spliced out of
the query to reconstruct a pre-insertion situation. RepeatMasker is
generally more sensitive and efficient including the deletion step as
it can unearth older repeats that were interrupted by these younger
elements.
1.4 Output options
-a shows the alignments in a .align output file; -ali(gnments) also works
-inv alignments are presented in the orientation of the repeat (with option -a)
-cut saves a sequence (in file.cut) from which full-length repeats are excised
(temporarily disfunctional)
-small returns complete .masked sequence in lower case
-xsmall returns repetitive regions in lowercase (rest capitals) rather than masked
-x returns repetitive regions masked with Xs rather than Ns
-poly reports simple repeats that may be polymorphic (in file.poly)
-ace creates an additional output file in ACeDB format
-gff creates an additional General Feature Finding format output
-u creates an untouched annotation file besides the manipulated file
-xm creates an additional output file in cross_match format (for parsing)
-fixed creates an (old style) annotation file with fixed width columns
-no_id leaves out final column with unique ID for each element
-e(xcln) calculates repeat densities (in .tbl) excluding runs of >25 Ns in query
-noisy prints cross_match progress report to screen (defaults to .stderr file)
-a / -ali(gnments)
-inv
Alignments are saved in a .align file when using the option -a. They
are shown in the orientation of the query sequence, unless you use the
option -inv as well, which will return alignments in the orientation
of the repeats (see 3.2 Alignments).
-cut
The -cut option to RepeatMasker is not supported in this release. It
will be rolled into a new annotation utility in the near future. If
you need this functionality sooner please send an email to Robert
Hubley ( rhubley@systemsbiology.org ). Thanks for your patience.
The option made the program save a file "file.cut" which contains
an intermediate sequence in the masking progress. In this sequence all
full-length elements, young LINE-1 3' ends, and close to perfect simple
repeats were deleted.
-x
When -x is used the repeat sequences are replaced by Xs instead of
Ns. The latter allows one to distinguish the masked areas from
possibly existing ambiguous bases in the original sequence. However,
when running BLAST searches (and maybe other programs) Xs are deleted
out of the query and the returned BLAST matches will have position
numbers not necessarily corresponding to that of the original
sequence.
-xsmall
When the option -xsmall is used a sequence is returned in the .masked
file in which repeat regions are in lower case and non-repetitive
regions are in capitals.
-poly
You can get a list of potentially polymorphic microsatellites with the
option -poly. This is simply a subset of the list in .out, with
dimeric to tetrameric repeats less than 10 % diverged from perfection.
-xm
When using the -xm option an additional output file (.out.xm) is
created that contains the same information as the .out file (excluding
the low-complexity/simple DNA), but then in the original cross_match
format. This output is harder to read but there are programs that
require the exact cross_match output format.
-u
The script ProcessRepeats adjusts the original RepeatMasker output so
that the annotation more closely reflects reality. With the option -u
a .ori.out file is created that contains the original (but sorted)
cross_match summary lines.
-ace
With the -ace option the script creates an .ace file. This is merely a
suggestion. The columns in the table currently are:
Motif_homol RepeatMasker(method)
-gff
The script creates a .gff file with the annotation in 'General Feature
Finding' format. See http://www.sanger.ac.uk/Software/GFF for
details. The current output follows a Sanger convention:
RepeatMasker Similarity
. Target "Motif:"
In this line, 'RepeatMasker' becomes 'RepeatMasker_SINE' if the match
is against an Alu. I don't know why.
-fixed
Since April 1999 the column widths in the annotation table are
adjusted to the maximum length of any string occurring in a column;
this allows long sequence names to be spelled out completely.
Previously, a fixed column width table was returned, which can still
be obtained by using the -fixed option. Parsing should not be effected
by this change of default behavior, as the same number of columns with
the same formatted text are still separated by white space.
-no_id
Since September 2000 a column displaying a unique number (ID) for each
integrated element is printed by default. This used to be optional
(-id). Fragments of a single element, separated from each other by
subsequent insertions of other elements, deletions or recombinations,
carry the same number. This feature allows better interpretation of
the data and should greatly help proper graphical display of the
repeats.
The column follows all other columns, except for the (rare) indication
that an annotation overlaps another annotation (*). This change, which
was announced in the previous release, should not hinder most parsing
scripts. If it causes problems, the old format can be retrieved with
the option -no_id.
-excln
The percentages displayed in the .tbl file are calculated using a
total sequence length excluding runs of 25 Ns or more. This is useful
when analyzing draft sequences that are often concatenated contigs
separated by (sometimes very) long stretches of Ns. This option can
be used with ProcessRepeats as well. The number of Ns in long runs in
the query are apparent in the .tbl file, and you only need to run
ProcessRepeats with the option on the .cat file.
-noisy
RepeatMasker used to print the voluminous cross_match progress reports
to the screen. Since the Dec 1998 version this output is stored in a
.stderr file and a more informative much smaller progress report is
printed to the screen. The option -noisy allows one to see the
cross-match reports coming by on the screen (yeah).
1.5 ProcessRepeats options
When you have already run RepeatMasker and want to recreate the .out
or .tbl file, you only need to rerun ProcessRepeats on the .cat
file(s), which will take just a small fraction of the time required to
rerun RepeatMasker. Such a situation can occur when you've
accidentally deleted the .out or .tbl file or want additional or
differentially formatted output files. Note that alignment files
cannot be created unless RepeatMasker was run with the -a option and
that the original .tbl and .out file will be overwritten unless you
rename them.
ProcessRepeats -species mus -nolow -gff -excln myhumongousmousesequence.cat
Repeat matches are processed differently for different query species,
so the -species mus option is necessary. With the -nolow option, the
.out file will not contain information on simple repeats and low
complexity DNA anymore. The -gff option creates an additional output
file in GFF format, and the -excln option displays the density of
repeats in the .tbl file as a percentage of those bp that are not
contained in long stretches of Ns.
The options/flags for ProcessRepeats are:
-species Identical as for the RepeatMasker script
-lib skips most of processing, does not produce a .tbl file unless the
custom library is in the >name#class format.
-nolow does not display simple repeats or low_complexity DNA in the annotation
-noint skips steps specific to interspersed repeats, saving lots of time
-u creates an untouched annotation file besides the manipulated file
-xm creates an additional output file in cross_match format (for parsing)
-ace creates an additional output file in ACeDB format
-gff creates an additional Gene Feature Finding format
-poly creates an output file listing only potentially polymorphic simple repeats
-no_id leaves out final column with unique number for each element (was default)
-fixed creates an (old style) annotation file with fixed width columns
-excln calculates repeat densities excluding long stretches of Ns in the query
-orf2 results in sometimes negative coordinates for L1 elements; all L1 subfamilies
are aligned over the ORF2 region, sometimes improving interpretation of data
-a shows the alignments in a .align output file
2 METHODOLOGY AND QUALITY OF OUTPUT
2.1 Methodology
RepeatMasker compares the query sequence against one or more files of
FASTA sequences. The sequences in the libraries provided with
RepeatMasker are consensus sequences derived from alignment of
multiple copies of interspersed or satellite repeats. For interspersed
repeats, a consensus tends to approach the sequence of the
transposable element from which the repeat is derived.
Both cross_match and WU-blast perform their Smith-Waterman (SW)
alignments by first identifying exact word matches and restricting the
alignment to a band or matrix surrounding this exact
match(es). Overlapping matrices are merged. The speed settings of
RepeatMasker are purely changes in the minimum word length from which
an alignment can be seeded and, in some cases, changes in the width of
the band. A wider bandwidth allows more gaps in the alignment and,
more importantly, increases the likelihood that neighboring matrices
overlap.
Cross_match does a low complexity adjustment of the raw SW score. When
WU-blast is used, the RepeatMasker script performs this adjustment. Low
complexity matches are the primary cause of false matches, and this
adjustment contributes significantly to the high selectivity of
RepeatMasker (see 2.5)
As a result of the existence of many related consensus sequences in
the database, usually multiple repeats match one region in the query
at the same time. Generally, cross_match and WU-blast report to the
script only those matches that are less than 80-90% overlapped by a
higher scoring match. This implies that, at first approximation, names
are assigned to repeats based on the highest SW score. Given
appropriate consensus sequences and alignment parameters, this is
intuitively correct as well. However, the scripts have a lot of code
to improve on this first approximation, primarily to deal with partial
matches.
The cut-off SW score above which matches are reported is empirically
derived (see '2.5 selectivity' below). Note that there is no cut-off
divergence level; reported matches can be less than 60% identical.
The alignments parameters -substitution matrices, and gap initiation
and extension penalties- are derived from data harbored in multiple
alignments of a special subset of interspersed repeats. The derived
matrices are theoretically optimal for a series of conditions (see
below). The gap penalties are sub-optimal, primarily because gap
lengths have a non-linear distribution and are poorly represented by a
single gap-extension penalty.
For primate, rodent and other mammalian DNA, the query is compared to
consecutive subsets of repeat libraries. For primates, perfect simple
repeats, full-length Alus, full-length short interspersed repeats, and
young L1 3' ends are first (and in that order) clipped from the
sequence to expose underlying older elements. Subsequently, the query
is compared to most repeats, a set of ancient elements under
especially sensitive settings, a large set of long retroviral
sequences under faster settings (to save time), and AT-rich L1 3' ends
that may have been discarded earlier as low complexity
matches. Finally, simple repeats and low complexity regions are
masked.
2.2 Scoring matrices
We have calculated statistically optimal scoring matrices for the
alignment of neutrally diverging (non-selected) sequences in human DNA
to their original sequence. These matrices have been in use since the
May 1998 release. The matrices were derived from alignments of DNA
transposon fossils to their consensus sequences. A series of different
matrices are used dependent on the divergence level (14-25%) of the
repeats and the background GC level (35-53%, neutral mutation patterns
differ significantly in different isochores).
These matrices are (close to) optimal for human genomic sequences
longer than 10 kb, for which length the GC level usually is
representative of the isochore in which the sequence lives. However,
the GC level of small fragments can diverge a lot from the surrounding
(e.g. a fragment spanning a CpG island, a GC rich exon or an AT-rich
LINE-1 element) and RepeatMasker defaults to using matrices derived for
a 43% GC background when a sequence is shorter than 2000 bp or when a
batch file is submitted. When the appropriate background GC level is
known, this can be entered with the -gc option.
(Note that these matrices are an integral portion of RepeatMasker and
are covered under the same restrictions as the scripts and databases
as described in the signed software agreement).
2.3 Repeat databases
The interspersed repeat databases provided in the RepeatMasker package
are maintained in synch with the repeat databases (RepBase Update)
copyrighted by the Genetic Information Research Institute
(G.I.R.I.). Whereas non-mammalian libraries currently are identical to
the RepBase Update FASTA files except for formatting and corrections,
mammalian databases are extensively modified. The modification
primarily entails inclusion of complete sets of subfamilies for Alu
and L1, modifications to avoid false matches and false annotations,
and subdivision in multiple sets for optimization of the analysis.
We transformed the RepBase database from a set of prototypes to a set
of consensus sequences (described in my dissertation) to allow both
determination of the origin of these repeats and improved detection. A
consensus properly derived from a multiple alignment of copies closely
approaches the original transposable element, since substitutions
accumulate by-and-large unselected in copies of transposable
elements. Because of the latter, a copy is on average twice as close
to the consensus as to any other copy. Consensus sequences are also
more sensitive search tools because directional substitution matrices
can be used (see above).
Consensus sequences would be identical to the original transposable
element if all copies were inserted at about the same time from a
single source. DNA transposon copies approach this ideal, but
retroposons (giving rise to most repeats in our genome) live for long
periods in a genome and evolve doing so. Thus, over time the sequence
of the transposable element has changed, and a single consensus does
not describe the original sequence of each copy. Also, usually at any
time multiple distinct sequences with a common origin, cousins if you
will, were active. This situation is reflected by the presence in the
databases of multiple subfamilies for the more common retroposons
(usually having the same name ending in a different number or letter.
The mammalian repeat libraries contain, besides consensus sequences
for transposon derived repeats, consensus satellite units, and a set
of *small structural RNA sequences*. The latter have created a large
amount of processed pseudogenes in our genome, and in that way are
interspersed repeats.
2.4 Sensitivity and speed
The program can be run at four levels of sensitivity. The only
difference between these settings is the minimum match or word length
in the initial (not quite) hashing step of the cross_match program
(see the cross_match/PHRAP documentation). For mammalian queries, he
"slow" setting will find and mask 0-5% more repetitive DNA sequences
than by default, whereas the "quick" settings miss 5-10%, and the
"rush" (-qq) settings may miss 10-25% of the sequences masked by
default. The alignments may extend more or be somewhat more accurate
in the more sensitive settings as well.
Following are benchmark times for random 1 Mbp of sequences of a
variety of different species run in parallel on 4 Pentium4 2.4Ghz
processors with 3 GB RAM with June 2004 RepeatMasker databases. The
percentage of the query masked is given in parentheses.
------------------------ cross_match ------------------------
Species WUBlast (Def) Rush Quick Default Slow
------- ------------- ------------- ------------- ------------- -------------
Human 02:54 (39.26) 01:54 (33.91) 05:05 (36.85) 22:15 (39.92) 57:54 (40.58)
Human-reversed 01:09 ( 1.98) 01:05 ( 2.00) 03:39 ( 2.06) 18:44 ( 2.07) 53:37 ( 2.09)
Chimpanzee 03:00 (40.83) 01:50 (35.24) 04:45 (38.70) 20:22 (41.59) 53:14 (42.24)
Mouse 03:31 (54.02) 01:47 (48.65) 04:21 (51.74) 18:54 (54.15) 47:26 (55.18)
Rat 04:46 (66.07) 02:05 (62.07) 04:32 (63.84) 19:41 (65.97) 48:23 (67.20)
Dog 02:24 (34.62) 01:32 (29.15) 03:07 (32.44) 12:29 (35.09) 30:14 (35.69)
Arabidopsis 01:01 ( 3.02) 00:51 ( 2.95) 04:41 ( 3.00) 46:52 ( 3.12) 1:46:53 ( 3.13)
Ciona savigny 01:25 (15.64) 01:02 (13.12) 01:30 (14.45) 06:13 (15.90) 15:24 (16.30)
C. elegans 02:35 (22.63) 01:38 (20.84) 02:39 (22.52) 12:12 (23.21) 25:15 (23.59)
Drosophila 01:59 (47.21) 01:23 (43.08) 02:30 (45.60) 15:49 (47.51) 39:24 (48.38)
Chicken 00:42 ( 6.52) 00:35 ( 6.18) 00:58 ( 6.42) 04:59 ( 6.53) 11:48 ( 6.58)
Fugu 00:35 ( 5.89) 00:34 ( 5.40) 00:49 ( 5.70) 03:51 ( 5.89) 09:20 ( 6.05)
The human-reversed sequence is the "human" sequence reversed but not
complemented. 2% of this sequence is (properly) masked as simple
repeats or low complexity DNA.
Note that for many non-mammalian species the slower settings do not
dramatically increase the percentage recognized as interspersed
repeats. Most of the repeats in the databases for these species are
relatively young and thus are easily detected. This particular 1Mbp
Arabidopsis sequence is an extreme example, where at slow settings in
almost two hours only 1800 bp more is masked than at rush settings in
51 seconds (the Arabidopsis database is large).
The speed is also dependent on the repeat content of the sequence. For
human sequences, Alu rich sequences are analyzed fastest, LINE rich
sequences somewhat slower, repeat poor regions slower still, and long
satellite regions can take a while.
If you have several shorter sequences it is much faster to run
RepeatMasker on a batch file (all sequences in one file). On above
computer, in the rush mode (cross_match), a batch of 10 5 kb sequences
is analyzed in 23 seconds, 20 5kb in 34 sec., etc.
The user time for larger sequences or sequence batches (50 kb and up)
is linearly related to the length of the query due to the
fragmentation of the query sequence.
The increase in speed by using multiple processors is dependent on the
usage of the computer and the above-mentioned non-linear relationships
of sequence length and processing time. However, under the right
circumstances, using 2 processors can increase the speed close to
twofold, because the most time-consuming processes are performed in
parallel.
2.5 Selectivity and matches to coding sequences
The cutoff Smith-Waterman scores for masking interspersed repeats are
conservative, since masking of one short potentially interesting
region generally is more harmful than not masking a number of hard to
find matches. If there are any false matches, they tend to have
scores close to the cutoff, which is 225 for most repeats, 300 for the
low-complexity LINE-1 search*, and 180 for the very old MIR, LINE2 and
MER5 sequences.
* most LINE-1s are detected with a 225 cut-off, but in one step in
RepeatMasker the low-complexity score adjustment is turned off to find
ancient A-rich L1 elements.
With each release, we test for the occurrence of false matches in
randomized and in inverted (but not complemented) DNA including a
range of isochores from 36% to 54% GC. To retain seeds for Smith
Waterman alignments, sequences are randomized at the 10 bp word
level. Note that the inverted sequences retain the low complexity and
simple repeat patterns of the original sequences. Even at sensitive
settings, for which false matches are most likely, the 1998-2004
versions of RepeatMasker have reported no (false) matches at all to
interspersed repeats in the randomized or inverted sequences. No
simple repeats were reported in the randomized queries.
In a 1999 test, RepeatMasker returned only a single probably false
match (71 bp) when analyzing a batch of 4440 coding regions in human
mRNAs (7.2 Mb) at sensitive settings. The coding regions were
collected from GenBank, based on annotations, filtered for the
presence of complete ORFs and initiator methionines, and made more or
less non-redundant. When each coding region was analyzed individually
using the -gccalc option, 5 matches (414 bp, 0.006%) were falsely
masked (156 bp at default speed, 76 bp at quick settings). In this
analysis each sequence was analyzed with matrices chosen based on the
actual GC level, even for very short sequences, while in the batch
analysis of the coding regions the 'average' 43% GC matrices were
used.
The 1998 and later versions of RepeatMasker show somewhat more false
masking when a pre-1998 version of cross_match is used. These are
primarily the result of improper assumptions of the background
nucleotide frequency used in the scoring matrix calculation when
adjusting for the complexity of a match. Specifically, a very GC rich
region in an AT-rich isochore (like an exon) may improperly match a GC
rich repeat, since the scores for C/G matches are higher in the used
scoring matrix than for AT matches (calculated for this AT rich
background) whereas the old cross_match assumed that a 50% GC
background in these calculations and equal scores for A/T and G/C
matches have been given. The new version of cross_match reads the
correct nucleotide background level from the matrix used.
2.6 Simple repeats and low complexity DNA
Low-complexity DNA
By default, along with the interspersed repeats, RepeatMasker masks
low-complexity DNA. Simple repeats (micro-satellites) can originate at
any site in the genome, and therefore have an interspersed
character. Other low-complexity DNA, primarily poly-purine/
poly-pyrimidine stretches, or regions of extremely high AT or GC
content will result in spurious matches in some database searches as
well (especially in the ungapped BLASTN searches). For example,
extremely AT-rich regions consistently will give very low probability
matches to mitochondrial DNA in BLASTN searches. The settings are very
stringent, and we think that few if any sequences informative in
database searches are masked as low-complexity DNA. However, you can
skip the low-complexity DNA masking using the option -nolow or -l(ow).
Under the current settings a 100 bp stretch of DNA is masked when it
is >87% AT or >89% GC, a 30 bp stretch has to contain 29 A/T (or GC)
nucleotides. The settings are slightly more stringent than the
original settings, partly because the gapped BLAST programs are less
sensitive to short regions of low complexity then the old gapless
BLAST. In coding regions I have not yet found extensive regions (>10
bp) masked as low complexity DNA that would not be masked by the
combined XNU and SEG filters routinely used in BLASTX.
Annotation of simple repeats
Although RepeatMasker does a good job in masking simple repeats to
avoid spurious matches in database searches, it is not written to find
and indicate all possibly polymorphic simple repeat sequences. Only
di- to pentameric and some hexameric repeats are scanned for and
simple repeats shorter than 20 bp are ignored. The -poly option prints
out a separate list of simple repeats of < 10% divergence from a
perfect repeat. However, even long perfect repeats may not be
presented in this list; e.g. two perfect 40 bp long (CA)n repeats
interrupted by 10 Ts are aligned in one piece and may be reported as
having > 10% divergence from the consensus. Many perfect hexameric or
longer unit repeats will be listed as more or less diverged smaller
unit repeats and may not appear in the .polyout file.
Also note that, in the default output, simple repeats expanded from
the poly A tails of Alus and LINE-1 are now included in the Alu or
LINE-1 annotation. This cleans up the annotation a bit and lets the
stand-alone poly A regions stand out (they may indicate the presence
of a processed pseudogene). However, even perfect simple repeats in
such tails will be hidden in the .out file.
A program optimized to quickly find all dimeric to pentameric repeats
is sputnik, available at http://espressosoftware.com/pages/sputnik.jsp.
Local duplications, tandem repeats and satellites.
Gary Benson's program "Tandem Repeat Finder" (another catchy name)
currently is the standard for finding satellites and all other direct
repeats (http://tandem.bu.edu/trf/trf.html).
Any local duplications (tandem, inverted, interrupted) can be detected
with the program miropeats (http://www.genome.ou.edu/miropeats.html),
which presents this similarity information graphically.
3 HOW TO READ THE RESULTS
3.1 The annotation (.out) file
The annotation file contains the cross_match summary lines. It lists
all best matches (above a set minimum score) between the query
sequence and any of the sequences in the repeat database or with low
complexity DNA. The term "best matches" reflects that a match is not
shown if its domain is over 80% or 90% contained within the domain of
a higher scoring match, where the "domain" of a match is the region in
the query sequence that is defined by the alignment start and
stop. These domains have been masked in the returned masked sequence
file. In the output, matches are ordered by query name, and for each
query by position of the start of the alignment.
Example:
SW perc perc perc query position in query matching repeat position in repeat
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
...
1320 15.6 6.2 0.0 HSU08988 6563 6781 (22462) C MER7A DNA/MER2_type (0) 337 104 20
12279 10.5 2.1 1.7 HSU08988 6782 7718 (21525) C Tigger1 DNA/MER2_type (0) 2418 1486 19
1769 12.9 6.6 1.9 HSU08988 7719 8022 (21221) C AluSx SINE/Alu (0) 317 1 17
12279 10.5 2.1 1.7 HSU08988 8023 8694 (20549) C Tigger1 DNA/MER2_type (932) 1486 818 19
2335 11.1 0.3 0.7 HSU08988 8695 9000 (20243) C AluSg SINE/Alu (5) 305 1 18
12279 10.5 2.1 1.7 HSU08988 9001 9695 (19548) C Tigger1 DNA/MER2_type (1600) 818 2 19
721 21.2 1.4 0.0 HSU08988 9696 9816 (19427) C MER7A DNA/MER2_type (224) 122 2 20
This is a sequence in which a Tigger1 DNA transposon has integrated into a MER7 DNA transposon copy. Subsequently two Alus integrated in the Tigger1 sequence. The first line is interpreted as such:
1320 = Smith-Waterman score of the match, usually complexity adjusted
The SW scores are not always directly comparable. Sometimes
the complexity adjustment has been turned off, and a variety of
scoring-matrices are used dependent on repeat age and GC level.
15.6 = % divergence = mismatches/(matches+mismatches) **
6.2 = % of bases opposite a gap in the query sequence (deleted bp)
0.0 = % of bases opposite a gap in the repeat consensus (inserted bp)
HSU08988 = name of query sequence
6563 = starting position of match in query sequence
6781 = ending position of match in query sequence
(22462) = no. of bases in query sequence past the ending position of match
C = match is with the Complement of the repeat consensus sequence
MER7A = name of the matching interspersed repeat
DNA/MER2_type = the class of the repeat, in this case a DNA transposon
fossil of the MER2 group (see below for list and references)
(0) = no. of bases in (complement of) the repeat consensus sequence
prior to beginning of the match (0 means that the match extended
all the way to the end of the repeat consensus sequence)
337 = starting position of match in repeat consensus sequence
104 = ending position of match in repeat consensus sequence
20 = unique identifier for individual insertions
An asterisk (*) following the final column (see below example)
indicates that there is a higher-scoring match whose domain partly
(<80%) includes the domain of the current match.
** This has changed in August 2001: cross_match output gives the
percent mismatches/(matches+mismatches+unaligned bases in query). I
didn't think this definition is otherwise commonly used and most users
will assume the divergence level would be
mismatches/(matches+mismatches).
Note that the SW score and divergence numbers for the three Tigger1
lines are identical. This is because the information is derived from a
single alignment (the Alus were deleted from the query before the
alignment with the Tigger element was performed). The ProcessRepeats
script makes educated guesses if any pair of fragments is derived from
the same element or not; if so, the fragments will have the same ID in
the last column, in this example it figured that the MER7A fragments
represent one insert.
Here is another example that shows how much trouble ProcessRepeats
takes to defragment elements and how the ID can be useful in
interpreting the results:
7120 19.9 0.6 0.3 NT_001227 85631 87837 (19816) + L1PA16 LINE/L1 1 1885 (4964) 123
2503 14.9 6.5 0.7 NT_001227 87839 88241 (19412) + MSTA LTR/MaLR 1 428 (0) 100
867 12.9 2.7 0.0 NT_001227 88242 88388 (19265) + MSTA-int LTR/MaLR 1 151 (1500) 100 *
5219 19.5 2.9 0.6 NT_001227 88386 89342 (18311) + MSTA-int LTR/MaLR 629 1607 (44) 100
8003 3.5 0.8 0.0 NT_001227 89362 90773 (16880) C L1PA3 LINE/L1 (0) 6155 4745 103
7677 3.5 0.0 0.0 NT_001227 90795 94059 (13594) C L1PA3 LINE/L1 (0) 6155 2872 104
9050 6.5 0.4 0.1 NT_001227 94060 95127 (12526) C MER11C LTR/ERVK (0) 1071 1 106
7677 3.5 0.0 0.0 NT_001227 95128 97101 (10552) C L1PA3 LINE/L1 (3282) 2873 900 104
5619 7.8 0.3 0.9 NT_001227 97097 97865 (9788) C L1PA3 LINE/L1 (5370) 776 13 104 *
320 16.9 0.0 1.7 NT_001227 97876 97934 (9719) + MSTA-int LTR/MaLR 1594 1651 (0) 100
1475 19.0 4.8 5.6 NT_001227 97935 98255 (9398) + MSTA LTR/MaLR 1 323 (48) 100
2322 14.4 0.8 1.6 NT_001227 98256 98629 (9024) + THE1C LTR/MaLR 1 371 (0) 112
10051 12.9 3.5 4.3 NT_001227 98630 100221 (7432) + THE1C-int LTR/MaLR 1 1580 (0) 112
2359 15.7 0.3 1.9 NT_001227 100224 100598 (7055) + THE1C LTR/MaLR 3 371 (0) 112
1475 19.0 4.8 5.6 NT_001227 100599 100646 (7007) + MSTA LTR/MaLR 323 371 (0) 100
1360 19.4 8.2 1.7 NT_001227 100662 100955 (6698) + MSTA LTR/MaLR 114 426 (0) 113
11892 24.7 1.9 2.0 NT_001227 100968 101243 (6410) + L1PA16 LINE/L1 1881 2143 (4706) 123
2062 11.9 8.4 0.0 NT_001227 101244 101563 (6090) C L1PA12 LINE/L1 (10) 6164 5818 116
11892 24.7 1.9 2.0 NT_001227 101564 105425 (2228) + L1PA16 LINE/L1 2137 5989 (860) 123
257 0.0 0.0 2.9 NT_001227 105436 105469 (2184) + (TAA)n Simple 2 34 (0) 118
2189 18.2 0.2 0.7 NT_001227 105470 105893 (1760) + L1PA16 LINE/L1 6062 6483 (386) 123
255 6.1 0.0 0.0 NT_001227 105896 105928 (1725) + (TA)n Simple 1 33 (0) 120 *
369 0.0 0.0 0.0 NT_001227 105928 105968 (1685) + (GA)n Simple 2 42 (0) 121
305 18.8 0.0 1.0 NT_001227 105971 106066 (1587) + (TA)n Simple 2 96 (0) 122
1589 21.2 1.6 1.1 NT_001227 106068 106449 (1204) + L1PA16 LINE/L1 6485 6868 (1) 123
This entire 20,819 bp block of sequence is comprised by an L1PA16
(#123), in which 7 or 8 elements have integrated (it is unclear to me
if the MSTA #113 is a separate integration or a tandem
duplication). There are at least four layers, with MER11 (#106)
inserted in L1PA3 (#104) inserted in MSTA (#100, maybe in #113)
inserted in L1PA16. L1PA16 is already primate specific, so that all
these insertions took place during primate evolution.
The ID column helps much in deciphering the events. It also should be
a basis for graphic display of RepeatMasker output.
3.2 Alignments
When using the -a option, a .align file is created that contains
alignments of your query sequence to the matching repeat consensus
sequences. The alignments are given in the same order as listed in the
.out file. They are always in the orientation of the query; you can
use the -inv option to produce all alignments in the orientation of
the consensus sequence.
The alignments are in the cross_match/SWAT format, in which mismatches
rather than matches are indicated (transitions with an i and
transversions with a v). The description line preceding the alignment
is similar to that seen in the .out file. In the example of an
alignment below, an old retrovirus-like LTR (MLT1H) has been
interrupted by the more recent insertion of a short DNA transposon
(MADE2):
384 28.89 9.24 2.17 chr1_4622259_4622561 21 77 (225) MLT1H#LTR/MaLR 23 88 (461) 5
chr1_4622259_ 21 TGGCC-CAATTCTTTACCTCTC--TGCCTCTTGTGCCTTTTG-------G 60
- ? ii i -- iv i-i i -------i
MLT1H#LTR/MaL 23 TGGCCACAATTMTCCACCCCTCCCTGTATCC-ATGCCCTTTGCAATGTGA 71
chr1_4622259_ 61 CTTTGCCATTTCTTCTA 77
vii i i i
MLT1H#LTR/MaL 72 CTTTGCAGCTCCTCCCA 88
Transitions / transversions = Transitions / transversions = Unknown
Gap_init rate = Unknown
557 11.25 0.00 0.00 chr1_4622259_4622561 78 157 (145) C MADE1#DNA/Mariner (0) 80 1 3
chr1_4622259_ 78 TTAGGTTGGTGCAAAAGTAATTGTGGGTTTTAGCATTTAAAGTAATACCA 127
i v iv ? iv
C MADE1#DNA/Mar 80 TTAGGTTGGTGCAAAAGTAATTGCGGTTTTTGCCATTRAAAGTAATGGCA 31
chr1_4622259_ 128 AAAACCACAACTACTTTTGCACCAACCTAA 157
i i
C MADE1#DNA/Mar 30 AAAACCGCAATTACTTTTGCACCAACCTAA 1
Transitions / transversions = Transitions / transversions = Unknown
Gap_init rate = Unknown
384 28.89 9.24 2.17 chr1_4622259_4622561 158 283 (19) MLT1H#LTR/MaLR 89 218 (331) 5
chr1_4622259_ 158 TAGTAAAAGCAGAGGATAAT-----ATTCCTTGTCTTTGGGTTTGTCATG 202
i -- i i ii vv v ----- ii vv i i v i v
MLT1H#LTR/MaL 89 CA--AGAGGTGGAGTCTATTTCCCCACCCCTTGAATCTGGGCTGGCCTTG 136
chr1_4622259_ 203 TGACTCTCTTTGGCCATGGGAACATAGGCAAAAATGACT-TGTGCCCCTT 251
iv vvi ii - i i v- vv
MLT1H#LTR/MaL 137 TGACTTGCTTTGGCCAATAGAATGT-GGCAGAAGTGACGGTGTGCCAGTT 185
chr1_4622259_ 252 CTGAGCCCCGGCCTTGAGAGGTCTT-CATGCTT 283
iv ii i - i
MLT1H#LTR/MaL 186 CTGAGCCTAGGCCTCAAGAGGCCTTGCACGCTT 218
Transitions / transversions = Transitions / transversions = Unknown
Gap_init rate = Unknown
Note that the description line is identical for the first and third
alignment. Before the query was compared to the MLT1H consensus,
RepeatMasker had recognized the MADE1 element and had removed it from
the query sequence to more or less reconstruct the pre-MADE1-insertion
situation. Thus, position 21 to 283 of the query could be aligned to
the MLT1H consensus in a single piece. Since 2004 (RepeatMasker3.0 and
up), such alignments are broken up to present all matches in serial
order.
Alignments are especially useful for designing PCR primers in a region
full of repeats. It is possible to design primers contained in a
common repeat that still work in a whole genome, when the 3' end is in
a region that is very different from the consensus.
Discrepancies between alignments and the .out file
Discrepancies between alignments and annotation result from the
adjustments made by the ProcessRepeats script to produce more legible
annotation. This annotation also tends to be closer to the biological
reality than the raw cross_match output.
For example, adjustments often are necessary when a repeat is
fragmented through deletions, insertions, or an inversion. Many
subfamilies of repeats closely resemble each other, and when a repeat
is fragmented these fragments can be assigned different subfamily
names in the raw output. ProcessRepeats often can decide if fragments
are derived from the same integrated transposable element and which
subfamily name is appropriate (subsequently given to all
fragments). This can result in discrepancies in the repeat name and
matching positions in the consensus sequence (subfamily consensus
sequences differ in length).
In many cases matches are fused into one annotation. To give a few
common examples:
- In large sequences that are analyzed in fragments consecutive
fragments overlap and repeats in these overlaps will appear twice
(partially or wholly) in the alignment file but are merged in the .out
file.
- A-rich simple repeats originated from the poly A tail of Alus and
LINE-1s are incorporated in the annotation of the Alu or LINE-1.
- There is an 'endless' number of subfamilies for retroposons which
can not all be represented in the databases and sometimes an element
is matched by overlapping pieces of two related subfamilies (which
will be merged).
- You may find large discrepancies in position numbering if an element
includes tandem repeat units. For example, MER109 contains multiple
~300 bp repeat units that can lead to overlapping matches. In the
annotation such matches are fused.
- Simple repeats or satellites that are longer than the number of
units represented in the repeat library will be represented by
multiple, genrally overlapping alignments in the .align file, but only
a single annotation line in the .out file.
Specific LINE problems:
Some other discrepancies between alignments and annotations are
specific to LINE-like elements. These repeats usually do not appear as
complete elements in the consensus database. For LINE-1, this is
mostly due to the contrast in conservation over the length of its
sequence during its evolution in the mammalian genome; the ~3 kb ORF2
region of LINE-1 has been very conserved, whereas the untranslated
regions and ORF1 to a lesser degree have evolved very fast. Thus the
3' end or 5' end of an ancient LINE-1 does not even remotely resemble
that of the currently active LINE-1, whereas the coding region for
reverse transcriptase is closely related. Thus, many subfamilies have
been defined for both the 5' and 3' UTRs (48 and 55, resp.) of LINE-1
elements in human DNA, whereas only 11 ORF2 entries are present in the
database. Besides the fact that some 3' ends have multiple defined 5'
ends, and vice versa, the program would become very slow when each
query is compared to 55 full length (6 to 8 kb) LINE-1 elements.
Thus, LINE-1 elements are presented in the database in 3 pieces, and
the ProcessRepeats script puts these pieces together. As a result both
the names of the repeats and position numbering in the consensus
sequence are generally different in the alignments than in the output
file. The LINE2 elements are likewise broken up in 3' UTRs for
different subfamilies and one 5'UTR-ORF2 region.
Between LINE-1 subfamilies, the 3' UTR ranges from 500 bp to over 2000
bp (in L1MC/D3), and the length of the 5' UTR is even more variable,
even between subfamilies that show strong similarity in the 3' UTR. To
allow the LINE-1 fragments to be put together, all position numbers in
older LINE-1 subfamilies are normalized relative to the position of
ORF2 (the conserved part of LINE-1) in a complete L1PA2 element. Since
some older elements have much longer 5' UTRs or ORF1-ORF2 linker
regions than L1PA2, this often results in the assignment of negative
position numbers for the 5' end of LINEs. Since the March2000 release,
such positions and all positions in fragments thought to be part of
the same LINE-1 insert are readjusted to count from the 5' end (which
is not necessarily the very 5' end of the LINE-1 source gene, as these
are hard to derive for old elements). One problem with this approach
is that positions are not adjusted in detached 3' fragments that are
somehow not recognized by the program as originating from the same
insertion. Thereby, the common origin of the 5' fragments and 3'
fragments may become completely obscured. You can use the option
'-orf2' of ProcessRepeats to retrieve an output in which all LINE-1s
are numbered so that position 1 of ORF2 is aligned (resulting in
occasionally negative positions).
3.3 The summary (.tbl) file
The summary file is pretty much self-explanatory. Below is an example.
==================================================
file name: AC027410.fa
sequences: 1
total length: 152192 bp (148791 bp excl N-runs)
GC level: 39.59 %
bases masked: 88734 bp ( 59.64 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
SINEs: 195 45195 bp 30.37 %
ALUs 178 43249 bp 29.07 %
MIRs 17 1946 bp 1.31 %
LINEs: 54 31173 bp 20.95 %
LINE1 36 24602 bp 16.53 %
LINE2 18 6571 bp 4.42 %
L3/CR1 0 0 bp 0.00 %
LTR elements: 13 5833 bp 3.92 %
MaLRs 8 4079 bp 2.74 %
ERVL 0 0 bp 0.00 %
ERV_classI 5 1754 bp 1.18 %
ERV_classII 0 0 bp 0.00 %
DNA elements: 17 4459 bp 3.00 %
MER1_type 12 1903 bp 1.28 %
MER2_type 4 2466 bp 1.66 %
Unclassified: 0 0 bp 0.00 %
Total interspersed repeats: 86660 bp 58.24 %
Small RNA: 2 124 bp 0.08 %
Satellites: 0 0 bp 0.00 %
Simple repeats: 22 1151 bp 0.77 %
Low complexity: 22 799 bp 0.54 %
==================================================
* most repeats fragmented by insertions or deletions
have been counted as one element
Runs of >20 Ns in query were excluded in % calcs
The query species was assumed to be Pan troglodytes
RepeatMasker version 20040617 , default mode
run with cross_match version 0.990329
RepBase Update 9.04, RM database version 20040617
----------------------------------------------------
AC027410 was a draft sequence, with individual contigs separated by
poly N linkers. In this case, the option -excln was used, so that
these strings of Ns were ignored for the percent calculations.
The classification in this table is well defined (see my reviews in
COGD) and forms a good basis for visual presentation and tabulation of
the repeats in your study.
We've been able to classify almost all human repeats, most of them
even in subclasses. The totals for the classes often are higher than
the sum of the subclasses, because not all elements fit in a subclass
and minor subclasses are not listed separately in the table (e.g. for
the human table the Mariner, Tc2, Piggybac, Zaphod, and Arthur
families of DNA transposons). The HAL1 element, derived from LINE-1,
is added to the LINE-1 total in this table.
Note that the "MER" subclasses have no relationship to each other. The
term MER (MEdium Reiterated repeats) was introduced for purely
administrative purposes to give the beast a name. The MER1 and MER2
groups were named after the first member of these groups identified as
an interspersed repeat in our genome. In the literature they're also
known as the Tigger and Charlie groups.
The nomenclature of mammalian repeats derived from retrovirus-like
elements is different from older versions. I've now divided this class
up in the traditional class I, class II (ERVK), class III (ERVL)
retroviruses and the ERVL-derived but very distinct non-autonomous
MaLR elements. Since 'class III' is not an accepted classification
yet, for now this class is called ERVL. The large MER4-group of
non-autonomous LTR elements merges seamlessly with class I endogenous
retroviruses, making it hard to define, and is now incorporated in the
latter group. The ERV classes are most readily distinguished by the
size of the insertion site duplication: 4 in class I, 6 in class II,
and 5 in class III, though there are some exceptions to this rule. My
LTR classification is not based target size duplication sizes, but on
the encoded proteins in the internal sequences or, if these are not
known, on matches to LTRs with internal sequences.
As described above, the ProcessRepeats script tries very hard to find
out which repeat fragments were derived from the same insertion event
of a transposable element, but there still will be a slight
overestimate of the copy numbers.
There may be slight differences in the number of "bases masked" and
the sum of the bases annotated in this .tbl file. At this moment bases
are masked based on the unprocessed matches (as they are in the .cat
file) and most of the discrepancies are accounted for by unmasked
regions between flanking identical simple repeats, annotated as one
stretch if fewer than 10 bases separate them, and fragments of repeats
shorter than 10 bp which are not annotated but are masked.
4 APPLICATIONS
4.1 Use in database searches
RepeatMasker is most commonly used to avoid spurious matches in
database searches. Generally this step is strongly recommended before
doing BLASTN or BLASTX equivalent searches with mammalian DNA
sequence.
The most common concern is of course if RepeatMasker ever masks coding
regions.
We found that false matches in coding regions are extremely rare, but
did identify 38 genuine fragments of interspersed repeats (4214 bp) in
the (annotated) coding regions of the 4440 human mRNAs (7.2 Mb)
analyzed (excluding annotated coding sequences of LINE-1 elements and
endogenous retroviruses). We verified matches with lower scores by
comparing the translation products to close homologous or redundant
entries in the database (the repeat matching regions always were
exactly missing). In the majority of these cases, the sequences appear
to be improperly annotated or to represent either artificially or
naturally defective mRNAs (e.g. alternatively spliced exons comprised
of a small fragment of a repeat). Genuine overlaps of interspersed
repeats with coding sequences usually involve terminal regions of the
ORFs. Since the transposable element derived region is unique to the
protein in that (group of) species, the masking does not interfere
with database searches.
However, some cautionary comments are necessary. First, a few active
cellular genes are derived from transposable elements (see my list of
50 in our genome in Lander et al. 2001). Some of these genes will be
partially masked by a (related) transposon in the repeat database. EST
and cDNA matches beyond the masked region should alert you.
Also remember that, currently only for mammals, RepeatMasker screens
for small RNA (pseudo)genes because of their similarity to SINEs. The
number of matches to small RNAs are listed in the overview table;
(close to) exact matches are possibly active genes, although related
active genes not in the database may show diverged matches. If you're
interested in (small) RNA genes, you should use the -norna option to
leave these sequences unmasked, while SINEs will remain masked.
A final caution relates to the fact that 3' UTRs of transcripts are
about as dense in interspersed repeats as intergenic regions
are. Thus, many ESTs are completely masked as repetitive DNA. I
recommend that, when you compare a genomic sequence against the EST
database or use ESTs as a query in nucleotide searches, you search
with the unmasked sequence as well; use a long minimum match (word
length/ word size) like 40 bp to identify exact matches and avoid most
background. Unfortunately the maximum word length that can be used in
the NCBI BLASTN program is 18 (due to memory limitations).
4.2 Identification of DNA source (contamination detection)
Bacterial insertion elements
Bacterial insertion sequences (IS elements) often pop up in foreign
sequences, as their activity in the E. coli is not always successfully
suppressed during cloning. As late as 2002, human entries in the
'finished' section of GenBank contained over a hundred IS elements.
With each run, RepeatMasker includes a quick check for bacterial
insertion elements that may have inserted during cloning. You can turn
this off with the -no_is option. The -is_only option limits the run to
this check only.
When a full-length element is found and a target site duplication is
confirmed, its location is both reported to the screen and stored in a
.alert file. The latter also contains information of possible
mouse<->human contamination.
-is_clip, -is_only
With the -is_only and is_clip options, the detected IS and one of the
flanking repeats is clipped out to restore the pre-cloning artifact
situation before comparison with the repeat databases. The original
query FASTA file will remain unchanged. An insertion-sequence-clipped,
but otherwise unmasked query sequence is printed to .withoutIS.
For single sequences larger than 4 Mbp, the -maxsize option needs to
be set to a number larger than the sequence length to retrieve this
file.
With either of these options, a properly adjusted quality string is
printed to a file with the suffix .qual.withoutIS when a corresponding
PHRED quality file (.qual) is in the same directory. Note that these
names won't be such that the clipped sequence and quality file form a
pair for subsequent cross_match/PHRAP work. They need to be renamed,
as I assume one wants to do anyway.
Most but not all IS elements can be precisely cut out. The element may
be at the edge of a sequence, or (rarely) the element may have
inserted improperly, lacking target site duplications or missing
terminal bases (internal deletion products are generally handled
okay). These matches are reported, but are left untouched even in
_is_only or is_clip mode.
The location of any IS element is both reported to the screen and
stored in an .alert file. The latter also contains information of
possible mouse<->human contamination.
Here are the specifics of IS element insertions:
IS1 8-10 bp duplication
IS2 5 bp duplication; published sequence was too short
IS3 3 bp duplication
IS4 No examples of clonal artifacts; no dup site info
IS5 4 bp duplication; preferred target TCTAGA
IS10 9 bp duplication; extreme preference for CGCTNAGCN; published
sequence for IS5 & 10 were too long, included preferred target site
IS30 2 bp duplication
IS150 3 bp dup, with one exception (4 bp); strong pref for CAGNNTGGGGCY
IS186 10 or 11 bp dup
Tn1000 5 bp duplication;
Human, mouse, or rat sequence contamination or mix-up.
A straightforward way to distinguish murine and human DNA is by
checking for either rodent-specific or primate specific
repeats. Likewise, rodent or primate contamination in any other
mammalian or non-mammalian background can be picked up as well. If
your lab has, say, a rat and a pink fairy armadillo sequencing
project, rat DNA in a supposedly armadillo sequence can be picked up
quite reliably, depending on the length of the query.
When the option -rodspec or -primspec is used, RepeatMasker only
checks the query against a small library of repeats that have not
(yet) been observed in the 'other' species. The locations of the
matches are printed to .alert. This function will be expanded to
other mammals, when these species are starting to be sequenced in
earnest.
I've checked for the specificity of the reported matches quite
extensively. Whenever two or more types of repeats are reported, the
odds are that the alert is correct. Very occasionally, a single
reported match could be a false alert. This is especially possible
when a 'new' mammalian species is analyzed, because, unbeknownst to
me, a related repeat may have amplified in such a genome.
Other species contamination.
When a supposedly mammalian clone is of non-mammalian origin, very few
if any interspersed repeats will be reported by RepeatMasker. Rodent
or primate genomic sequences are on average 40-50% dense in
recognizable interspersed repeats, so that any stretch of genomic DNA
of significant length (say 30 kb or more) showing less than 10%
density in interspersed repeats is of suspect origin. An automated
alert for such a situation is not included, as query sequences of
coding regions or transcripts, generally of very low repeat density,
would constantly cause an alert.
4.3 DateRepeats - Masking lineage-specific repeats for genomic alignments
Since June 2003 each repeat consensus in the mammalian repeat
libraries has a phylogenetic label. The interspersed repeat is
expected to be found in all species belonging to the specified genus,
family, order, etc. The label is based on the presence or absence of
repeats at orthologous sites in different genomes and on the average
divergence of repeat copies from their derived consensus
sequence. This phylogeny will become much more accurate and refined
over time.
The tag allows RepeatMasker to compare queries only to repeats
expected to be found in the query species, without having to provide a
library for each species. For example, a rat query currently is
compared to repeats tagged with Rattus, Murinae, Muridae, Rodentia,
Eutheria, and Mammalia.
It also allows one to mask only those interspersed repeats that have
arrived in a genome after the speciation of two species. For optimal
alignment of genomic sequences of two species, 'ancestral repeats'
that are located at orthologous sites in both genomes (unless deleted)
should not be masked, whereas 'lineage-specific' repeats should be
masked or clipped out. An experimental version of this RepeatMasker
feature has been used in the alignment of the mouse to the human
genome (Waterston et al. 2002). By clipping out rather than masking
the lineage-specific repeats the aligning fraction for the mouse
genome could be increased from 35% to 40% (see also Schwartz et al
2003, NAR 31:3518-24, and Thomas et al 2003, Nature 424:788-93).
As of the September 2003 version, the RepeatMasker package contains a
script "DateRepeats" that takes a RepeatMasker .out file and creates
annotation with added column(s) indicating if a repeat is expected to
be present in the indicated 'other species' as well as a sequence with
lineage-specific repeats masked only. The script currently works only
for a few mammalian species (human, mouse, rat, cat, dog, cow, pig,
horse, rabbit), but refinement is inevitable.
DateRepeats -query -comp
[-comp -mask ]
The required flags are:
-q -query the species that has been analyzed
-c -comp other mammalian species; can be used multiple times, adding extra
columns to the annotation in a
Optional parameters are:
-m -mask produces a sequence file with all lineage specific repeats masked, i.e.
those predicted to be in the -query and absent in the -mask species
(sequence and RepeatMasker files must be in same directory)
must correspond to one of the -comp species
-a -aggressive also mask those repeats unclear to be lineage specific or ancestral
-n -nolow does not mask (micro)satellites or low complexity DNA, which are
generally lineage specific, but hard to date
(-a and -n have no effect unless -m is used)
In the first release of this script the for -q, -c, and -m
are limited to human, mouse, rat, cat, dog, cow, pig, horse and
rabbit.
For example the command line:
DateRepeats chr3_4000001_4005000.out -q mouse -c rat -c human
prints the following output to chr3_4000001_4005000.out_rat_human:
SW perc perc perc query position in query matching repeat position in repeat rat hum
score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID
12436 19.0 1.7 7.7 chr3_400 9 536 (4464) + L1_Mur2 LINE/L1 1850 2408 (3469) 1 X 0
2728 5.2 0.0 3.9 chr3_400 537 896 (4104) + ORR1A0 LTR/MaLR 1 346 (0) 2 0 0
12436 19.0 1.7 7.7 chr3_400 897 2441 (2559) + L1_Mur2 LINE/L1 2408 3665 (2212) 1 X 0
5229 7.2 0.0 1.1 chr3_400 2442 3134 (1866) + L1Md_F2 LINE/L1 5877 6580 (2) 3 0 0
12436 19.0 1.7 7.7 chr3_400 3135 4259 (741) + L1_Mur2 LINE/L1 3665 4856 (1021) 1 X 0
394 20.9 0.0 0.9 chr3_400 4260 4351 (649) + Lx2 LINE/L1 6892 6996 (1) 4 X 0
The X indicates that the repeat is expected to be present at
orthologous sites, while the O predicts an absence. None of the above
repeats are found in the human genome. Notice that a mouse-specific
ORR1 (#2) and L1 (#4) has inserted into a rodent specific L1 (#1).
A few lines of the output for
DateRepeats chr19.fa.out -q rat -c mouse -c human -m mouse -a -n
are
1199 17.9 11.2 1.8 chr19 313706 313990 (58909535) C Lx9 LINE/L1 (9) 7635 7324 377 X 0
23 0.0 0.0 0.0 chr19 314125 314147 (58909378) + AT_rich Low_complexity 1 23 (0) 378 - -
726 17.5 1.3 8.9 chr19 314152 314308 (58909217) + B1_Rn SINE/Alu 2 146 (2) 379 ? 0
228 32.8 0.0 7.4 chr19 314355 314476 (58909049) C B3 SINE/B2 (77) 139 27 380 X 0
The chr19.fa.masked_vs_mouse file contains a sequence appropriately
masked for alignment against mouse. It has repeats 377 & 380
masked. The -n flag leaves the AT-rich region unmasked, while the -a
flag forced it to mask repeat 379 as well. B1_Rn is rat specific, but
the 17.5% divergence from the consensus is much higher than the
average divergence level of non-functional DNA since the rat-mouse
split. It therefore gets the "?" assignment. The rules for assigning
question marks are arbitrary (2-fold lower divergence than expected,
1.5x higher than expected for a repeat at the boundary).
4.4 Use in gene prediction and other applications
Predicting genes from a masked sequence has several problems. First,
one should use the option -nolow to avoid masking low complexity
regions and trinucleotide repeats in coding regions. But even with
only interspersed repeats masked, gene prediction programs may fail to
identify exons correctly. As pointed out above, sometimes tail ends of
coding regions may have originated from transposable elements. Some
gene prediction programs suggest the extend of 3' UTRs. These will be
often overestimated in masked DNA, as many genuine poly A signals are
located in interspersed repeats. Finally, even if no coding regions
have been masked, splice sites may be compromised; e.g. the
polypyrimidine region that contributes to an acceptor splice site may
be contained within a repeat.
Thus, I generally recommend to run a gene prediction program on
unmasked DNA (as well) and compare the predicted genes and exons with
the RepeatMasker output. Some gene prediction program allow you to
force certain exons out of the predictions (e.g. often the old ORFs of
LINE-1 elements and endogenous retroviruses are included in
genes). Work is also in progress at several sites to incorporate
RepeatMasker into gene prediction programs, in which cases matches to
repeats are weighted in along with the other parameters used.
Other uses
Many people mask repeats before designing primers or oligo probes from
sequence data. I've often been told that primers/probes designed from
regions unmasked by RepeatMasker have a much better success rate. A
cautionary note here is that unmasked regions not necessarily are
unique in the genome (e.g. many lower copy repeats are not in the
database yet) and experiments should be performed as if no filtering
against repeats has been done. The alignments can help in designing
primers from sequences that are completely masked. Regions that
diverge much from the consensus are less likely to misbehave than
others.
RepeatMasker is sometimes used during assembly of large genomic
sequences. This procedure probably is most useful in very Alu rich
regions; in that situation I recommend to only mask the Alus, and
maybe limit the masking to those Alus less than 15% diverged (-div
15).
There are plenty of other uses, e.g. analysis of repeats can reveal a
lot about the evolution of a locus (deletions vs. insertions,
inversions, approximate time of these events). When you're doing that
you're a specialist and don't need any help from this help file (maybe
from some of the literature sited below though).
5 REFERENCES
Reference for RepeatMasker
We appreciate it if you could refer to the web page (Smit,AFA &
Green,P RepeatMasker at http://www.repeatmasker.org) or otherwise to
Smit, AFA & Green, P., unpublished.
The EMBL format of the RepBase Update database contains references for
individual repeats, as well as annotation with respect to divergence
level, affiliation, copy number, etc. Much if not most of the
information in this database is not published elsewhere. It can be
accessed at http://www.girinst.org/.
We are trying to keep the nomenclature of the interspersed repeats in
the output of RepeatMasker identical to that of the reference
database. In most cases the names correspond to those most commonly
used in the literature.
There is much too much literature out there to list these days. My own
views on the repeat structure in mammalian genomes have most recently
been described in the following papers:
Waterston et al. (2002) Initial sequencing and comparative analysis of
the mouse genome. Nature. 420(6915):5 20-62.
Lander E. S., et al. (2001). Initial sequencing and analysis of the
human genome. Nature 409(6822): 860-921.
Smit, A.F.A. (1999) Interspersed repeats and other mementos of
transposable elements in mammalian genomes. Curr Opin Genet Devel 9
(6), 657-663.
If you have ideas for improvements or found a problem, drop a note
at asmit@hoh.biotech.washington.edu or afasmit@pacbell.net
/*****************************************************************************
# Copyright (C) 1996-2003 by Arian Smit
# All rights reserved.
#
# The software and databases should not be redistributed or used for
# any commercial purpose, including commercially funded sequencing,
# without written permission from Geospiza Inc, Seattle
# (http://www.geospiza.com/)
/*****************************************************************************