SOAPdenovo
Introduction
SOAPdenovo is a novel short-read assembly method that can build a de
novo draft assembly for the human-sized genomes. The program is
specially designed to assemble Illumina GA short reads. It creates new
opportunities for building reference sequences and carrying out
accurate analyses of unexplored genomes in a cost effective way.
System Requirement
SOAPdenovo aims for large plant and animal genomes, although it also
works well on bacteria and fungi genomes. It runs on 64-bit Linux
system with a minimum of 5G physical memory. For big genomes like
human, about 150 GB memory would be required.
Download
Release 1.0 Only precompiled binary version available now. (http://soap.genomics.org.cn/soapdenovo.html)
Installation
- Download the SOAPdenovo tar package;
- Unpack it;
- There are one executable file "soapdenovo", one demo
configure file "example.contig" and this readme file.
How to use it
1. Configuration file
For big genome projects with deep sequencing, the data is usually
organized as multiple read sequence files generated from multiple
libraries. The configuration file tells the assembler where to find
these files and the relevant information. "example.config" is an
example of such a file.
The configuration file has a section for global information, and then
multiple library sections. Right now only "max_rd_len" is included in
the global information section. Any read longer than max_rd_len will be
cut to this length. The library information and the information of
sequencing data generated from the library should be organized in the
corresponding library section. Each library section starts with tag
[LIB] and includes the following items:
- avg_ins
This value indicates the average insert size of this library or the
peak value position in the insert size distribution figure.
- reverse_seq
This option takes value 0 or 1. It tells the assembler if the read
sequences need to be complementarily reversed. Illumima GA produces two
types of paired-end libraries: a) forward-reverse, generated from
fragmented DNA ends with typical insert size less than 500 bp; b)
forward-forward, generated from circularizing libraries with typical
insert size greater than 2 Kb. The parameter "reverse_seq" should be
set to indicate this: 0, forward-reverse; 1, forward-forward.
- asm_flags=3
This indicator decides in which part(s) the reads are used. It takes
value 1(only contig assembly), 2 (only scaffold assembly), 3(both
contig and scaffold assembly), or 4 (only gap closure).
- rd_len_cutoff
The assembler will cut the reads from the current library to this
length.
- rank
It takes integer values and decides in which order the reads are used
for scaffold assembly. Libraries with the same "rank" are used at the
same time during scaffold assembly.
- pair_num_cutoff
This parameter is the cutoff value of pair number for a reliable
connection between two contigs or pre-scaffolds.
- map_len
This takes effect in the "map" step and is the minimun alignment length
between a read and a contig required for a reliable read location.
The assembler accepts read file in two formats: FASTA or FASTQ.
Mate-pair relationship could be indicated in two ways: two sequence
files with reads in the same order belonging to a pair, or two adjacent
reads in a single file (FASTA only) belonging to a pair. In the
configuration file single end files are indicated by "f=/path/filename"
or "q=/pah/filename" for fasta or fastq formats separately. Paired
reads in two fasta sequence files are indicated by "f1=" and "f2=".
While paired reads in two fastq sequences files are indicated by "q1="
and "q2=". Paired reads in a single fasta sequence file is indicated by
"p=" item.
All the above items in each library section are optional. The assembler
assigns default values for most of them. If you are not sure how to set
a parameter, you can remove it from your configuration file.
2. Get it started
Once the configuration file is available, a typical way to run the
assembler is:
./soapdenovo all -s
config_file -K 25 -R -o graph_prefix
User can also choose to run the assembly process step by step as:
./soapdenovo pregraph -s
config_file -K 25 [-R -d -p] -o graph_prefix
./soapdenovo contig -g graph_prefix [-R -M 1 -D]
./soapdenovo map -s config_file -g graph_prefix [-p]
./soapdenovo scaff -g graph_prefix [-F -u -G -p]
3. Options:
-s STR configuration file -o STR output graph file prefix -g STR input graph file prefix -K INT K-mer size [default 23, min 13, max63] -p INT multithreads, n threads [default 8] -R use reads to solve tiny repeats [default no] -d INT remove low-frequency K-mers with frequency no larger than [default 0] -D INT remove edges with coverage no larger that [default 1] -M INT strength of merging similar sequences during contiging [default 1, min 0, max 3] -F intra-scaffold gap closure [default no] -u un-mask high coverage contigs before scaffolding [defaut mask] -G INT allowed length difference between estimated and filled gap -L minimum contigs length used for scaffolding
4. Output files
- 4.1 These files are output as assembly results:
- a. *.contig contig sequences without using mate pair
information
- b. *.scafSeq scaffold sequences (final contig sequences
can be
extracted by breaking down scaffold sequences at gap regions)
- 4.2 There are some other files that provide useful
information for
advanced users, which are listed in Appendix B.
5. FAQ
5.1 How to set K-mer size?
The program accepts odd numbers between 13 and 31. Larger K-mers would
have higher rate of uniqueness in the genome and would make the graph
simpler, but it requires deep sequencing depth and longer read length
to guarantee the overlap at any genomic location.
5.2 How to set library rank?
SOAPdenovo will use the pair-end libraries with insert size from
smaller to larger to construct scaffolds. Libraries with the same rank
would be used at the same time. For example, in a dataset of a human
genome, we set five ranks for five libraries with insert size 200-bp,
500-bp, 2-Kb, 5-Kb and 10-Kb, separately. It is desired that the pairs
in each rank provide adequate physical coverage of the genome.
APPENDIX A: an example.config
# maximal read length
max_rd_len=50
[LIB]
# average insert size
avg_ins=200
# if sequence needs to be
reversed reverse_seq=0
# in which part(s) the reads are used
asm_flags=3
# use only first 50 bps of each read
rd_len_cutoff=50
# in which order the reads are used while scaffolding
rank=1
# cutoff of pair number for a reliable connection (default 3)
pair_num_cutoff=3
# minimum aligned length to contigs for a reliable read location
(default 32)
map_len=32
# fastq file for read 1
q1=/path/**LIBNAMEA**/fastq_read_1.fq
# fastq file for read 2 always follows fastq file for read 1
q2=/path/**LIBNAMEA**/fastq_read_2.fq
# fasta file for read 1
f1=/path/**LIBNAMEA**/fasta_read_1.fa
# fastq file for read 2 always follows fastq file for read 1
f2=/path/**LIBNAMEA**/fasta_read_2.fa
# fastq file for single reads
q=/path/**LIBNAMEA**/fastq_read_single.fq
# fasta file for single reads
f=/path/**LIBNAMEA**/fasta_read_single.fa
# a single fasta file for paired reads
p=/path/**LIBNAMEA**/pairs_in_one_file.fa
[LIB]
avg_ins=2000
reverse_seq=1
asm_flags=2
rank=2
# cutoff of pair number for a reliable connection
# (default 5 for large
insert size)
pair_num_cutoff=5
# minimum aligned length to contigs for a reliable read location
# (default 35 for large insert size)
map_len=35
q1=/path/**LIBNAMEB**/fastq_read_1.fq
q2=/path/**LIBNAMEB**/fastq_read_2.fq
q=/path/**LIBNAMEB**/fastq_read_single.fq
f=/path/**LIBNAMEB**/fasta_read_single.fa
Appendix B: output files
1. Output files from the command "pregraph"
- a. *.kmerFreq
Each row shows the number of Kmers with a frequency equals the row
number.
- b. *.edge
Each record gives the information of an edge in
the pre-graph: length, Kmers on both ends, average kmer coverage,
whether
it's reverse-complementarily identical and the sequence.
- c. *.markOnEdge & *.path
These two files are for using reads to solve small repeats
- e. *.preArc
Connections between edges which are established by the read paths.
- f. *.vertex
Kmers at the ends of edges.
- g. *.preGraphBasic
Some basic information about the pre-graph: number of vertex, K value,
number of edges, maximum read length etc.
2. Output files from the command "contig"
- a. *.contig
Contig information: corresponding edge index, length, kmer coverage,
whether it's tip and the sequence. Either a contig or its reverse
complementry counterpart is included. Each reverse complementary contig
index is indicated in the *.ContigIndex file.
- b. *.Arc
Arcs coming out of each edge and their corresponding coverage by reads
- c. *.updated.edge
Some information for each edge in graph: length, Kmers at both ends,
index difference between the reverse-complementary edge and this one.
- d. *.ContigIndex
Each record gives information about each contig in the *.contig: it's
edge index, length, the index difference between its
reverse-complementary counterpart and itself.
3. Output files from the command "map"
- a. *.peGrads
Information for each clone library: insert-size, read index upper
bound, rank and pair number cutoff for a reliable link. This file can
be revised manually for scaffolding tuning.
- b. *.readOnContig
Read locations on contigs. Here contigs are referred by their edge
index. Howerver about half of them are not listed in the *.contig file
for their reverse-complementary counterparts are included already.
- c. *.readInGap
This file includes reads that could be located in gaps between contigs.
This information will be used to close gaps in scaffolds.
4. Output files from the command "scaff"
- a. *.newContigIndex
Contigs are sorted according their length before scaffolding. Their new
index are listed in this file. This is useful if one wants to
corresponds contigs in *.contig with those in *.links.
- b. *.links
Links between contigs which are established by read pairs. New index
are used.
- c. *.scaf_gap
Contigs in gaps found by contig graph outputted by the contiging
procedure. Here new index are used.
- d. *.scaf
Contigs for each scaffold: contig index (concordant to index in
*.contig), approximate start position on scaffold, orientation, contig
length, and its links to others.
- e. *.gapSeq
Gap sequences between contigs.
- f. *.scafSeq
Sequence of each scaffold.
|