BEDTools

[ Source: http://code.google.com/p/bedtools/ ]

Summary

The BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. The utilities are largely based on four widely-used file formats: BED, GFF/GTF, VCF, and SAM/BAM. Using BEDTools, one can develop sophisticated pipelines that answer complicated research questions by "streaming" several BEDTools together. The following are examples of common questions that one can address with BEDTools.

  1. Intersecting two BED files in search of overlapping features.
  2. Culling/refining/computing coverage for BAM alignments based on genome features.
  3. Merging overlapping features.
  4. Screening for paired-end (PE) overlaps between PE sequences and existing genomic features.
  5. Calculating the depth and breadth of sequence coverage across defined "windows" in a genome.
  6. Screening for overlaps between "split" alignments and genomic features.

The fact that all of the BEDTools accept input from “standard input (stdin)” allows one to “stream / pipe” several commands together to facilitate more complicated analyses. Also, the tools allow fine control over how output is reported. Most recently, I have added support for sequence alignments in BAM (http://samtools.sourceforge.net/) format, as well as for features in VCF and GFF, as well as “blocked” BED format. The tools are quite fast and typically finish in a matter of a few seconds, even for large datasets.

Brief example

As stated, much of the power in BEDTools comes from the ability to pipe multiple BEDTools together with UNIX commands. The following example will hopefully illustrate this strength.

Example: Imagine you have a BED file of SNP calls that were generated from some fancy new variant detection method. You are now doing an initial screen of the results. The SNP calls are genome-wide and of varied support and biological interest. The BED file of SNP calls might look like this, where the name field is the observed alleles and the score is the depth:

$ head
snps.bed 
chr1 100 101 A/G 100 
chr1 200 102 C/G 1000 
... 
chrX 300 301 C/T 500

Let's say you want to quickly find all transitions that are in exons. Using BEDTools and egrep, the command would be:

$ egrep "A/G|C/T" snps.bed | intersectBed -a
stdin -b exons.bed > snpsInExons.bed

Great, but now you want to get to the interesting bits for your big paper, so you want to screen for novel variants by excluding SNP calls that are already in dbSnp. In this case, the "-v" option reports only those SNPs passed to intersectBed that are NOT in dbSnp.

$ egrep
"A/G|C/T" snps.bed | intersectBed -a stdin -b exons.bed |
intersectBed -v -a stdin -b dbSnp130.bed >
novelSnpsInExons.bed

But now you subsequently detect an artifact where false positives are enriched in SNPs having coverage > 100. You refine my original query accordingly.

$ awk
'$5 < 100' snps.bed | egrep "A/G|C/T" | intersectBed -
a stdin -b exons.bed | intersectBed -v -a stdin -b dbSnp130.bed >
bonafideNovelSnpsInExons.bed

Table of supported utilities

(BAM) denotes tools that support BAM alignment files.

Utility Description
intersectBed (BAM) Returns overlaps between two BED/GFF/VCF files.
pairToBed (BAM) Returns overlaps between a paired-end BED file and a regular BED/VCF/GFF file.
bamToBed (BAM) Converts BAM alignments to BED6, BED12, or BEDPE format.
bedToBam (BAM) Converts BED/GFF/VCF features to BAM format.
bed12ToBed6 Converts "blocked" BED12 features to discrete BED6 features.
bedToIgv Creates IGV batch scripts for taking multiple snapshots from BED/GFF/VCF features.
coverageBed (BAM) Summarizes the depth and breadth of coverage of features in one BED versus features (e.g, "windows", exons, etc.) defined in another BED/GFF/VCF file.
genomeCoverageBed (BAM) Creates either a histogram, BEDGRAPH, or a "per base" report of genome coverage.
unionBedGraphs Combines multiple BedGraph files into a single file, allowing coverage/other comparisons between them.
annotateBed Annotates one BED/VCF/GFF file with overlaps from many others.
groupBy Summarizes data in a file/stream based on common columns.
overlap Returns the number of bases pairs of overlap b/w two features on the same line.
pairToPair Returns overlaps between two paired-end BED files.
closestBed Returns the closest feature to each entry in a BED/GFF/VCF file.
subtractBed Removes the portion of an interval that is overlapped by another feature.
windowBed (BAM) Returns overlaps between two BED/VCF/GFF files based on a user-defined window.
mergeBed Merges overlapping features into a single feature.
complementBed Returns all intervals not spanned by the features in a BED/GFF/VCF file.
fastaFromBed Creates FASTA sequences based on intervals in a BED/GFF/VCF file.
maskFastaFromBed Masks a FASTA file based on BED coordinates.
shuffleBed Randomly permutes the locations of a BED file among a genome.
slopBed Adjusts each BED entry by a requested number of base pairs.
sortBed Sorts a BED file by chrom, then start position. Other ways as well.
linksBed Creates an HTML file of links to the UCSC or a custom browser.


Documentation

Please read the BEDTools manual as well as the Usage and Advanced Usage pages. If you still have questions or issues, please use the BEDTools discussion list..

Notes regarding usage

  1. All BEDTools load the "B" file into memory and process the "A" file one-by-one against the features in "B". Therefore when possible, one should make set the smaller of the two files to be the "B" file. For example, you'll discover that finding overlaps between a list of 30,000 genes and 100 million aligned sequences will work much more efficiently with the genes file set as BED file "B".
  2. Most of the BEDTools have optional parameters that confer fine control over reporting and the subtleties of each tool. We suggest you look through them and if something you find necessary is missing, please let us know.
  3. Most of the BEDTools allow the "A" file to be passed via standard input for use in UNIX "streams" or "pipelines". In order to do this, use "-a stdin". For example:
  4. $ cat reads.bed | intersectBed -a stdin -b genes.bed >
    readsToGenes.bed

Citation

Quinlan, AR and Hall, IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842.

Contact

BEDTools was developed and is maintained by Aaron Quinlan, a postdoctoral fellow in Ira Hall's laboratory at The University of Virginia. Questions should be posted to the BEDTools discussion list. Alternatively, contact Aaron via email (firstlast at gmail.com).