Blast Match of Query Sequences to Pig Genome Build 10.2 Genes by Their Map Locations
-- A Pilot Analysis of 3 Pig Array Platforms for Genome Enabled Annotations | |
Results:
|
README:
Blast Match of Query Sequences to Pig Genome Build 10.2 Genes by Their Map Locations -- A Pilot Analysis of 3 Pig Array Platforms for Genome Enabled Annotations PURPOSE: Find out how the two Pig Affy array and Pig Oligo array elements are aligned with annotated genes on the pig genome assembly version 10.2. This is to enrich the gene annotation information on the Pig Affy array and Pig Oligo array, and facilitate annotation comparisons between the platforms. DATA: - Affy 2005: Affymetrix "new" pig array designed in 2005. 23,935 consensus sequences were used. - Affy 2010: Affymetrix "new" pig array designed in 2010. 1,142,126 sequences were combined from 3 data sets: o SNOWBALL_array_seqs.fa o SNOWBALL_consensus.fa -> unique_coding_seqs_for_array_v4.fa o SNOWBALL_miRNA.fa -> miRNAs_array_seqs_v4.fa (Data forwarded from Chris Tuggle, cktuggle@iastate.edu) - Oligo 2006: The "70-mer" oligonucleotide array designed in 2006 by a consortium group. Download: http://www.pigoligoarray.org/ In the 18,224 downloaded sequences, we do find many are longer than 70-mer. APPROACH: The mapping of the Pig Affy array and Pig Oligo array elements were performed by blast (NCBI blastall, v.2.2.22) again the SSC Build 10.2 for mapping. The mapping results were subsequently taken to query the NCBI Gene DB for genes that overlaps with the map coordinates. The blast criteria were set with these empirical thresholds: * Cut-off e-value: < 1e-3 * Identity > 80% * Minimum alignment length > 30bp Although the blastall options were set to take only the top hits, sub-optimal hits were often found "leaked out". A perl/MySQL procedure were developed to enforce these criteria. To analyze the overlaps between the blast map coordinates and the genes on the pig 10.2, a local database is set up with Ensembl pig gene mapping data. The overlap analysis were performed by querying the local database for locations of known Ensembl genes, compared against the blast match coordinates of the query sequences for reliable overlaps. Suppose a piece of genome sequence is represented with "--", on which a gene region is bordered by "#" on each side, labeled as "i-->j"; a matched query sequence is represented with "==" and labeled as "a-->b", various overlap situations can be illustrated below to figure the matched strand information (+/-, +/+, -/-): i j 5' ---------#-------------#-------- 3' (1) 5' -----a===#=====b-------#-------- 3' ..a..i..b..j.. (2) 5' ---------#----a========#==b----- 3' ..i..a..j..b.. (3) 5' ---------#--a======b---#-------- 3' ..i..a..b..j.. (4) 5' ------a==#=============#==b----- 3' ..a..i..j..b.. (5) 5' ------b==#=============#==a----- 3' ..b..i..j..a.. (6) 5' ------b==#========a----#-------- 3' ..b..i..a..j.. (7) 5' ---------#---b=========#==a----- 3' ..i..b..j..a.. (8) 5' ---------#---b=====a---#-------- 3' ..i..b..a..j.. i j An arbitrary 50bp were also set as the minimum required overlaps to increase the confidence of good overlap matches. RESULTS: The mapping data are made available on the NAGRP shared data repository: http://www.animalgenome.org/repository/pig/Genome_build_10.2_mappings/ Two perl scripts were developed to (1) calculate and identify acceptable overlaps, and (2) format data to facilitate further evaluations. The output includes 2 files, each contains identical results but in different formats: Note that some query sequences may have more than one gene matches. The query sequence names are post-fixed with a serial number following a double colon (::). For example, users can sort the list by query seq names to bring the same query sequences together. A higher than '1' serial count indicates there are multiple gene matches. Output format 1, file name: "genes.match.byPlatform1.xlsx" ---------------------------- This format is useful to compare matched elements among the 3 platforms on the same gene. The columns are: - ENS_stable_ids: Ensembl stable ID - External Names: "External" of Ensembl names, often HGNC symbols, although it's not always the case. - Genome_locations: Represented as "chromosome:start-end". - Matched data sets: all or partial genome locations overlaps. [Syntax: Seq_ID(overlap(strand):start-end(e-value))] o Affy 2005 matches o Affy 2010 matches o pig_Oligo matches Output format 2, file name: "genes.match.byPlatform2.xlsx" ---------------------------- This format is useful to allow user analysis by sorting the data in the way a user wishes to. The columns are: - ENS_stable_ids: Ensembl stable ID - External Names: "External" of Ensembl names, often HGNC symbols, although it's not always the case. - Genome_locations: Represented as "chromosome:start-end". - Gene length (bp): - Data sets: Contains one of (1) Affy 2005 matches; (2) Affy 2010 matches; (3) Pig_Oligo matches. - Sequence names: The IDs of one of the array element sequences. - Strand (quer/subj): - Overlap length (bp): - Overlap coordinates (bp): - e-values: Blast e-values. WORKING DIRECTORY: ~/projects/Tuggle_oligolocatn/ KNOWN BUGS: (none at this time) FUTURE WORKS: The methods and scripts used in this analysis may be further developed into a publically available tool for users to upload custom sequence data sets, or known coordinates, to return users with a list of matched genes + related information, possibly linking to GBrowse for visualization. Please send feedbacks and comments to cktuggle@iastate.edu or zhu@iastate.edu -- Zhiliang Hu May 29 09:42:18 CDT 2012 |
![]() |
© 2003-2025:
USA · USDA · NRPSP8 · Program to Accelerate Animal Genomics Applications.
|