Modification history: 2005-06-23 (New) wu-blastall now tries to make rationale settings of the BLASTMAT and BLASTFILTER environment variables, if they are not already set, before invoking the actual search program. 2005-06-09 (New) Support added for Mac OS 10.4.1 on Intel i386 processors and "Universal" binaries simultaneously supporting ppc, ppc64 and i386. 2005-05-10 (Fix) The output obtained with mformat=list had the entries for PostScript and neighborhood words swapped. The description of mformat in parameters.html has been corrected, as well. (Fix) Segmentation faults readily arose with short BLASTN query sequences and long word lengths, if the neighborhood word score threshold, T, was set. (Change) Parameter values are again reported in the event a FATAL error is encountered. This had been the practice prior to the addition of support for multiple output formats, but it got lost in the shuffle. 2005-05-09 (Fix) A thread deadlock could arise if the progress=# option was used along with multiple processors (cpus > 1). 2005-04-20 (Fix) In the default output format, when a single gap extended for a distance greater than the length of the current line of output, for those lines of the alignment that contained the extended gap, off-by-one errors in the coordinate numbers were reported for the gapped sequence (the sequence with the hyphens inserted). (Fix) Improved handling of interrupts when searching with multi-query files. (Change) Minor speed increase for some searches on at least some platforms. 2005-04-08 (Fix) When an invalid format was specified with the mformat option, the search program exited nonzero but did not describe the reason. 2005-04-06 (Fix) xdget was not honoring -m# format requests. 2005-04-05 (Fix) Corrected a bug in command line parsing that would erroneously cause a FATAL message to be reported concerning the lack of "asn1" support. 2005-03-30 (Fix) wu-blastall script was accepting invalid arguments to the -p option... and then failing. (Fix) In tabular output, identifiers beginning with backslash (\) also needed to be escaped with a backslash, just as those beginning with a pound sign (#) needed to be. (Fix) One warning message concerning the memory required for a search was reporting its requirements in KiB, while stating the units were MiB. thus, the requirements seemed to be 1024-fold larger than necessary. 2005-03-26 (Fix) In XML output, ampersand (&) was not being properly escaped! (Change) Non-printable ASCII characters in XML output are now all escaped as "&#x;". (Change) encoding="UTF-8" attribute now appears in XML output. (Change) An XML comment pointing to the use of the "xmlcompact" option is now included when the user has not invoked this option. 2005-03-25 (Fix) XML documents produced with mformat=7 were well-formed (compliant) but not strictly conforming to NCBI_BlastOutput.dtd. One entity was omitted from output and another was reported in the wrong order relative to other entities in the same block. "Parameters_matrix" was in the wrong order, relative to "Hit_expect", and is now correctly reported first. "Hit_accession" had been omitted and is now reported, but it is not always instantiated with the same information as the NCBI software reports; deviations tend to arise when the sequence identifier string does not formally contain an "accession" field. The fall-back action by WU is to report the same information for "Hit_accession" as for "Hit_id"; these fields will both be empty (null string) if no identifier is available. (New) Support for "xmlcompact" option, to eliminate newlines and indentation in XML output that improve readability by humans when using viewers that don't understand XML structure, but often comprise a substantial fraction of the output bytes and do nothing for viewers that /do/ understand the structure (e.g., many web browsers). 2005-03-23 (actually 2005-03-22 afternoon) (Change) In XML output, the "warnings", "notes" and "errors" options are now also obeyed. (Change) In XML output, individual messages are delimited from each other by a newline character in entities, such that the message keywords NOTE, WARNING, ERROR, FATAL and EXIT all begin in column one. (Fix) In XML output, the Control-A (hex 0x01) characters often found in nrdb multi-sequence deflines (but invalid in XML 1.0) are replaced with > (>). 2005-03-22 (Fix) Simultaneous output of multiple formats (multiple mformat specifications) sometimes produced truncated results (e.g., missing Parameters and Statistics section) for some format combinations. (New) Preliminary support for XML output, using NCBI DTD (mformat=7). (New) wu-blastall supports -m7 (xml). (Change) Messages sent to tabular output now always include the query ID, independent of the msgstyle parameter setting. msgstyle now only controls the message style in the default output format (mformat=1). 2005-03-14 (Fix) CPU times reported by binaries built for the Linux 2.4 kernel were incorrect when executed under a Linux 2.6 kernel. 2005-03-13 (New) Support added for the "qframe" option, to restrict BLASTX and TBLASTX searches to a specific reading frame (-3, -2, -1, +1, +2, +3) of the query sequence. (New) Partial support added to wu-blastall for the -m option. 2005-03-10 (New) Support added for the "mformat" command line option, to choose from a variety of output formats, including new tabular formats. Multiple formats can be selected and produced during a single program run, as long as different output files are assigned to each format. The syntax for this option is mformat=#[,outfile]. The default is mformat=1. mformat=0 will clear any prior mformat specifications appearing to the left on the command line. "outfile" can be omitted and implies the use of standard output or whatever output file is indicated with the -o option. (New) Support for the "msgstyle" command line option has been added, to select an alternate style of reporting informatory messages. msgstyle=0 is the default. msgstyle=1 will cause informatory messages to be reported on a single line without wrapping and for the identity of the query sequence to be included in messages when the identity is available. 2005-03-03 (Change) When the "echofilter" option is specified, in the FASTA-format reports of the query sequence that are produced in response, each strand or reading frame of the query is now appropriately labelled on the defline. If the query was indeed processed by a filter, the name reported for the sequence is "Filtered#", where # is replaced by the strand or reading frame. If no filter was used, the name for the sequence is given as "Unfiltered#". The reading frame for peptide query sequences in the BLASTP and TBLASTN search modes is shown as +0. 2005-02-23 (New) Warnings are displayed when user settings of S2 or gapS2 are reduced by the software to maintain consistency with settings of S. (Fix) Improved consistency maintained between command line settings of E, S, S2, and gapS2. The S2 and gapS2 score thresholds may be altered (downward) as a result of this change. The change is therefore conservative, in that nothing will be lost from the results, but searches might well run slower if one of S2 or gapS2 is reduced. (Fix) Values displayed for E2 and gapE2 are now consistent with the values used for S2 and gapS2. Importantly, the score thresholds used are not changed as a result of this specific fix; the E2 and gapE2 values reported have merely been corrected. 2005-01-21 (New) Support for -m parameter added to xdformat and xdget to allow the user to select different output formats when dumping or retrieving sequences. 2005-01-14 (Fix) Output was aborted under IRIX when redirected to a pipe. 2004-12-10 (Fix) Better estimation of maximum memory available on Mac OS X systems. 2004-11-20 (Fix) A "gapS" parameter that was advertised in program usage output was not actually being supported. Support has been added instead for a "gaps" option, that reverses the action of any "nogaps" option that may have been specified previously on the command line. ("nogaps" causes the default gapped alignment phase of the search programs to be skipped, so only ungapped alignments are produced). 2004-11-11 (Fix) Minor format string incompatibility in xdformat. (New) The included parameters.html file has been updated with descriptions of more options and parameters. The usage display now points users to the on-line web page http://blast.wustl.edu/blast/parameters.html. 2004-11-10 (Change) Marginal speed increase. 2004-11-08 (Change) Another speed improvement when searching XDF nucleotide database sequences that contain ambiguity codes. 2004-11-06 (Change) Small speed improvement when searching XDF nucleotide database sequences that contain ambiguity codes. 2004-11-04 (Change) Further improvement in the changes made 2004-11-03. 2004-11-03 (Change) Often improved speed when reporting BLASTN results that involve long database sequences. 2004-11-01 (Fix) For Linux 2.4 systems, which use "Linux threads", not POSIX threads, protection was added against potential accumulation of zombie processes. 2004-10-29 (Fix) Carriage return characters at end of lines (typical of MS-DOS or Windows files) were not being stripped from sequence descriptions in FASTA-format files. (The sequences themselves were parsed fine). 2004-10-26 (Fix) Genetic code initialization bug slipped in after 2004-10-23 and before its release on 2004-10-25. (Fix) Residual bug parsing the C=genetic code command line option. (Fix) Spurious diagnostic output was sometimes produced when dbslice option was used. 2004-10-23 (Fix) Command line settings of the genetic code were not effective in TBLASTN. 2004-10-22 (Fix) For word lengths 7 or greater, searches against nucleotide sequences containing ambiguity codes could fail due to a segmentation fault. 2004-10-18 (Fix) Warnings about the number of descriptions of database sequences not being reported due to the limiting value of the V parameter were over-counting by a factor of 2. (Fix) The gspmax parameter (not used by default) was being used to count gapped HSPs incorrectly when set to a non-zero value. This could result in desired HSPs being discarded. 2004-10-15 (Fix) The changes made on 2004-10-13 have been backed out and replaced by code that ensures seed word detection when searching compressed sequences using any value of wink. While the speed of BLASTN will suffer when wink is used (relative to previous versions), the behavior of the program should be more in line with user expectations -- as well as be more sensitive -- and will not require users to understand the nuanced logical pitfalls of particular parameter combinations when working with compressed sequences. (Fix) As of the changes made two days ago (2004-10-13), an obnoxious error message concerning the setting of the wink parameter was emitted by BLASTN, when wink was not even set by the user. 2004-10-13 (Fix) In the BLASTN search mode, when searching a nucleotide database in its compressed form, tests have been added to ensure that the value for wink is odd; and that the sum of the word length, W, plus wink is no more than 1/4-th the length of the query sequence. Otherwise, even long stretches of absolute identity can be missed entirely, simply due to phase mismatch in the compression between the query and database. If the user-specified value for wink does not satisfy these criteria, its value is automatically reduced and a warning is issued. 2004-10-07 (Change) Made the error message displayed by xdget more helpful when the sequence identifier index file it needs does not exist. 2004-10-06 (Fix) In some cases where a database had had its identifiers indexed on a different computer architecture than the one on which xdget was being executed, a cross-platform incompatibility existed in xdget, which would prevent the program from opening the index file. 2004-10-01 (New) Added support for "globalexit" option. When multiple query sequences are being processed and any of them encounters a FATAL error, if the globalexit option was specified, the line "EXIT CODE 12" will be appended to the output. This extra line of output only appears if a FATAL error was encountered. As described in the note on 2004-09-29, if a FATAL error is encountered at any time during a multi-query job, the BLAST process will provide a testable exit status 12. This new option merely causes the exit status to be saved in the output, where it can be interrogated later. 2004-09-30 (New) Added support for "haltonfatal" option. When multiple query sequences are to be searched using a multi-sequence input file, the default behavior is to continue with the next query in the input when a FATAL error is encountered. When haltonfatal is specified, the entire run is halted at the current query when a FATAL error is encountered; the testable exit status will be the EXIT CODE associated with the fatal error (not 12, as described below on 2004-09-29, when haltonfatal is not specified). 2004-09-29 (Fix) Previously, when a multi-sequence query file was specified, if the last query in the file completed without error, the overall program would exit zero (suggesting no error), even if one or more earlier queries did encounter fatal errors. The program now consistently exits with a distinct, non-zero exit status (exit status 12) if a fatal error arises with any of the query sequences in a multi-sequence situation. In such cases, the specific success and failure codes for the individual searches must be obtained by parsing the output for EXIT CODE lines. ("EXIT CODE 0" will be displayed in cases of success). To be clear, exit status 12 is not displayed on any of the EXIT CODE lines, but is only detectable by the program that invoked BLAST by testing the exit status of the BLAST process. Exit status 12 is used for no other purpose than to signal the occurrence of one or more fatal errors during the processing of multiple tasks, such as multiple query sequences in a single input file. (Fix) One fatal exit code of BLASTN was changed from 23 to 16, to make it consistent with the behavior of the four other search modes. The accompanying text in the fatal message was not changed. 2004-08-31 (Fix) Cosmetic bug in database filenames reported on "Database:" line, whereby double directory name delimiters (//) might be displayed. 2004-08-26 (Change) Altered one warning message to be a bit clearer. 2004-08-22 (Fix) Poisson P-values were sometimes computed incorrectly, for very low-scoring alignments. 2004-08-12 (Change) Slightly updated database I/O routines. 2004-08-10 (Fix) Lingering 64-bit crashing bug that wasn't addressed on 2004-07-14. 2004-08-04 (Fix) Fixed a bug introduced yesterday. 2004-08-03 (Change) Slight speed increase. 2004-08-01 (Fix) When hspmax was exceeded, the warned increase in the ungapped HSP score threshold could have been lower than the actual increase. (Change) The -c options of xdformat (to replace bad letter codes) and xdget (to choose a genetic code) have been renamed -C. 2004-07-31 (Change) Slightly reduced memory consumption and increased speed. 2004-07-28 (Fix) Corrected misspellings and a few omissions of available command line options in the usage instructions reported by xdformat and xdget. 2004-07-16 (Fix) The -mmio option was not working properly to turn off memory-mapped I/O on some platforms (principally Linux) and caused database-not-found errors instead. 2004-07-14 (Fix) Program crashes (segmentation faults) would often occur when using 64-bit binaries to analyze query sequences longer than the longest database sequence. (No impact on 32-bit binaries). 2004-07-12 (Change) Better estimation of the amount of free memory available under Linux. 2004-07-11 (New) Added warnings when long query and database sequences are to be analyzed and the hspsepQmax and hspsepSmax parameters, respectively, have not been set. (Fix) Better signal handling, especially under Linux 2.4 in the absence of native POSIX threads support. (Fix) Accounting of CPU time was incorrect under Linux 2.6 when using multiple CPUs/threads. 2004-06-29 (Change) Estimates of memory requirements are more optimistic, thus allowing more CPUs to be employed in some cases. 2004-06-25 (Change) Improved speed under Linux 2.6 on i686 and i786 platforms. (Fix) sysblast.sample file had been omitted from distributions 2004-06-23 (Change) Further (marginal) improvement in Sum statistics calculations and the link lists reported. 2004-06-22 (Change) Generally improved Sum statistics calculations, both in accuracy and speed. (Fix) Links information was sometimes corrupted. Added caveat to the description of the "links" option in parameters.html, concerning the potential inaccuracy of the HSPs listed for sets other than the most significant set. 2004-06-18 (Change) Support for ={min,max} for integer and floating-point command line parameter values. Support for ={infinity,-infinity} for floating-point values. 2004-06-16 (Fix) Grouping of low-scoring HSPs into consistent sets was not performed reliably. 2004-05-26 (Fix) Corrected the descriptions of the gspmax and spoutmax parameters in README.html to indicate that the default value for both is zero (0), not 1000. (The default value for hspmax is 1000). 2004-05-22 (Change) Under Mac OS X, the datasize resource limit is set to unlimited, to avoid some unusual habits of this operating system: setting a default datasize limit of 8 MB and not enforcing datasize. 2004-05-16 (Fix) In one warning message, memory sizes were being reported in units of KiB while stated to be in units of MiB. 2004-05-15 (Fix) A false I/O error was being reported on some platforms for piped output. 2004-05-14 (New) Out-of-memory errors in the search programs now refer the user to web pages for assistance. (Fix) The -wstrict option had only supported the classical one-hit BLAST algorithm, not the two-hit BLAST algorithm (re: the hitdist option), but now supports both. 2004-05-12 (Change) Added I/O safety checks for typically rare situations in xdformat. 2004-05-11 (Change) Contextually aware warnings and error messages are spewed forth in some out-of-memory situations. (New) nrdb and patdb now automatically detect and read compressed FASTA input filenames (assuming gunzip, zcat and unzip are in the command path). (Change) Updated the values for lambda, K and H associated with the BLOSUM62 scoring matrix in wu-blastall. 2004-05-10 (Change) Significantly reduced memory requirements. 2004-05-07 (New) Support for an optional "memmax" parameter in /etc/sysblast, to establish a limit on the amount of memory used by each individual BLAST job. On a system- wide basis, the memmax limit overrides any datasize resource limit established by a user's terminal shell. 2004-05-04 (New) Added more tests for memory requirements, making multithreaded use more reliable and convenient. 2004-05-02 (Change) Significantly reduced memory requirements and, generally, a bit of improvement in search speed. 2004-04-29 (Change) Speed tweak. 2004-04-29 (Change) Tweaked wu-blastall for some cases of nondefault scoring matrices being used. 2004-04-28 (Change) The gapsepqmax and gapsepsmax parameters are deprecated, to be replaced by hspsepqmax and hspsepsmax, respectively, which are now for use with both gapped and ungapped alignments. A warning to this effect is now produced if either gapsepqmax or gapsepsmax is used. 2004-04-26 (Change) Improved search speed for large jobs under typical usage, with the default sensitive parameters. 2004-04-20 (Fix) Some Linux systems would mung the display of the query sequence's effective length if it was larger than about 10^9. In addition, values beyond approximately 10^18 (typically obtained via the command line option Y=#) would not be not displayed correctly on any system. 2004-04-17 (Change) wu-blastall updated with support for -t option. (Fix) Reading of compressed FASTA files was broken on some platforms. 2004-04-13 (Fix) A potentially crashing and corrupting bug was introduced at the last minute to blastn, tblastn and tblastx on 2004-04-10. (Fix) The old setdb and pressdb programs were broken in the 2004-04-10 release -- they could not create their output files. 2004-04-10 (Fix) User settings of the hspsepqmax, hspsepsmax, gapsepqmax and gapsepsmax parameters were not consistently exploited for improving search sensitivity. When comparing long sequences (in the extreme case, whole chromosome or whole genome sequences, but also with great benefit for any sequences longer than most genes in the species under investigation), these parameters are useful for restricting consistent groups of alignments to being clustered within relatively short, gene-sized regions and (supposedly) increasing their statistical significance accordingly. Due to the aforementioned inconsistent use of *sepmax* values, however, significant groups of alignments could have been missed, as if the *sepmax* parameters had not actually been used. (The P-values reported were the expected, improved-significance values, which masked the presence of this bug). Highly similar alignments were unlikely to be missed in any case, but marginally significant alignments (e.g., short exons) within a group were likely to be missed. In the worst case, an entire group of alignments (a complete database "hit") could have been missed, if all members of the group were only marginally significant. [Note: If none of the *sepmax* parameters was used, this bug had no affect on results]. (Fix) When any of the *sepmax* parameters were used in conjunction with the -links option, the Links output line was sometimes truncated. (Fix) Erroneous alignments could be produced when the pingpong option was used. (This bug was introduced with the fix of another bug on 2004-03-24). (Fix) Less likely to segmentation fault when out-of-memory condition arises. (New) Added support for the Soffset= option, as an adjunct to the old Qoffset= option. Soffset causes coordinate numbers reported for all Subject sequences to be adjusted by the integer quantity . 2004-04-08 (Fix) Made two WARNINGs more accurate: when HSPs or GSPs were discarded because hspmax or gspmax was exceeded, instances when the alignments were discarded without actually increasing the associated threshold score (S2 or gapS2, respectively) are now warned about distinctly from cases when the threshold was transiently increased. 2004-04-07 (New) BLASTA can now read compressed query sequence files. 2004-03-27 (Fix) In relation to the support added recently for compressed input files in xdformat, gb2fasta, gt2fasta, and sp2fasta, compressed filenames containing special characters were not parsed correctly. 2004-03-24 (Fix) At low frequency, gapped alignments could be truncated, such that either the full extent of similarity was not displayed or (in extreme cases) no alignment at all would be reported because the affected score was below the threshold. 2004-03-23 (New) xdformat, gb2fasta, gt2fasta, and sp2fasta now recognize input file name extensions that are suggestive of the files containing compressed data. If the user specifies an input filename that ends with .Z, .z, -z, _z, .GZ, .gz, -gz, etc., the contents of the file are automatically piped through "gunzip". If the input filename ends with .zip or -zip, its contents are piped through "unzip" instead. Gunzip and unzip must of course be in a user's PATH for this to succeeed. (Change) wu-formatdb now displays the native xdformat command that is executed. (Fix) Better command line parsing by wu-formatdb, wu-blastall, and the "seg" filter programs. 2004-03-03 (New) The -wstrict option has been created. When -wstrict is invoked, all ungapped alignments found during the ungapped phase of a search are required to contain an identical word hit (in the usual case of BLASTN usage) or a neighborhood word hit (in the case of TBLASTN and TBLASTX), when searching a nucleotide database sequence that contains one or more ambiguity codes. When a database sequence contains one or more ambiguity codes, candidate alignments are first identified in a variant of the database sequence that contains entirely specific residue codes. Later, the ambiguity codes are put in place, which can obliterate the BLAST seed word hit that was originally used to find an alignment; nevertheless, the software by default will save and continue to work with an alignment seeded by a now non-existent hit, as long as the alignment continues to satisfy the score threshold, because sensitivity is often more important than strict adherence to the BLAST algorithm. The -wstrict option forces each ungapped alignment to contain a seed hit even after ambiguity codes have been put in place. Consequently, some alignments may be discarded when -wstrict is specified. This has downstream effects on the gapped alignments reported, because ungapped alignments provide the seeds for the gapped. The -wstrict option has no effect whatsoever on BLASTX and has no effect on BLASTP when gapped alignments (the default) are to be produced. Only when BLASTP is invoked with the -nogaps option does -wstrict turn off an otherwise unused, brute-force search step that the program performs in its BLAST 1.4-compatibility search mode. This brute force search step involves linear dynamic programming performed along the entire length of any diagonal found to contain an HSP. This heuristic was added to ungapped BLASTP 1.4 for increased sensitivity but omitted from standard BLASTP 2.0 operations for increased speed in the presence of the more-sensitive gapped alignment method. 2004-03-01 (Fix) Using BLASTN with a short word length (W < 7), sporadic crashes (segmentation faults) could arise when searching a nucleotide database sequence containing ambiguity codes. The likelihood of a crash increased with decreasing word length. 2004-02-07 (Change) Status messages from xdformat are more informative and consistent. (Change) xdformat examines the command line for duplicate input file names or database names; and complains and exits non-zero, if a duplicate is found. 2004-02-01 (Change) In xdformat and xdget, the upper bound has been increased on the in-memory cache size supported for sequence identifier indexes, particularly when these programs are executed in a 64-bit virtual addressing environment. See the -M option of xdformat and xdget (which are actually the same program, just invoked by two different names to yield two different behaviors). The default cache size for sequence identifier indexes (.xni and .xpi files) is 512M. (A smaller cache may be used under resource-limiting conditions). When a larger index will be produced, speed will continue to increase with increasing cache size (specified with the -M option), until the cache is as large as the ultimate size of the .x[np]i index file. (Change) Indexing of sequence identifiers by xdformat is slightly faster. 2004-01-30 (Fix) Index files used by the xdget program to retrieve sequences by identifier (i.e., the .xni and .xpi files produced by xdformat with its -I and -X options) were previously limited to being 4 GB in size. Furthermore, in situations requiring an index file larger than 4 GB, the contents of the file were silently corrupted, which made the index unusable. (Fix) The Start: and End: times reported by xdformat lacked data for minutes and seconds. 2004-01-28 (Change) For large databases, like nr or GenBank, xdformat now indexes sequence identifiers significantly faster, when invoked with either the -I option during database creation or -X (used for index creation at a later time or index reconstruction). 2004-01-21 (Fix) Contrary to its usage display, xdformat would not accept multiple database names when the -i operational mode was specified. 2004-01-20 (Change) Hopefully improved memory management in the nrdb program, to reduce memory fragmentation and increase the efficiency of memory utilization. 2004-01-15 (New) The syntax for dbslice usage is expanded to include a range of slices in the form dbslice=a-b/n, where 0 < a <= b <= n. For example, the expression "dbslice=11-20/500" designates slices 11 through (and including) 20 to be searched, out of 500 total slices. This permits a database to be more equitably divided between cluster nodes, when individual nodes have different performance characteristics. 2004-01-12 (New) The dbslice=/ option allows the database to be conveniently sliced at run time into n equivalent-sized partitions (counting sequences, not residues), where only the m-th partition (1 <= m <= n) is searched. 2004-01-07 (*** slipstream update for some distributions***) (Fix) Some distributions for Linux platforms lacked a complete all upper- and all lower-case complement of amino acid scoring matrix files. E.g., the file "blosum62" was present in matrix/aa but "BLOSUM62" was missing. 2003-12-15 (Fix) Too-short alignments were sometimes produced because gapped alignments were sometimes not extended in the reverse direction as far as expected (or as far as should be), given how far the alignments had been extended in the forward direction. 2003-10-22 (Fix) On encountering invalid nucleotide codes in its input, the "dust" filter program would sometimes crash. Any invalid code encountered is now treated internally to dust like an "N", although it will appear unaltered in the output (unless the nucleotide is masked). 2003-10-21 (Change) Any subsequent -matrix= command line options specified after the first were ignored by BLASTA. Now the last one specified is the matrix used for the search, instead of the first one. (New) Specifying -altscore=none on the command line will clear or nullify any prior altscore specifications on the command line. (Change) Better diagnostic message from xdformat when database file size exceeds the precision of file offsets being used. 2003-10-03 (Fix) The pam program raised array bounds errors for pam distances > 255. It now works as expected for PAM distances up to 4095(!). 2003-09-22 (Fix) when executed with the -X (index) option, xdformat was exiting non-zero, even if no error was actually detected. 2003-09-10 (Fix) Corrected the check for unambiguous output database file names in xdformat (use of the -o or -a option), when stdin or multiple input files are specified. 2003-09-04 (Change) Pragmatic default cache size selection for indexes in xdformat/xdget. The -M option, if used, overrides the default. 2003-08-21 (Change) Databases with a version assigned via the -v option of xdformat now have that version string reported at the end of blast search output. "Version: ..." will appear only when a version was assigned. (Change) Tweaked settings of resource limits. 2003-08-14 (Fix) Neighborhood word lists were sometimes not managed properly, since the 2003-03-27 release. This resulted in segmentation faults at low frequency. 2003-05-16 (New) A new "spoutmax" option limits the number of segment pairs reported per database sequence. The default is no limit (spoutmax=0). (Change) The "hspmax" and "gspmax" options now strictly limit the number of ungapped HSPs and gapped HSPs, respectively, that are saved for subsequent processing steps. The default value for hspmax remains 1000, while the default gspmax=0 imposes no limit. 2003-04-11 (New) More consistency checks of parameter settings. 2003-03-27 (Change) Minimal-to-greatly improved speed for most searches, with most improvement seen for larger searches. 2003-02-20 (Change) Improved speed of BLASTN with long queries and the default word length. (New) Implemented the "cdb" command line option to force BLASTN to search databases in their compressed form. **** Version [2003-02-16] Posted **** 2003-02-16 (Change) Improved speed and slightly reduced memory requirements for BLASTN, with short queries when the word length used is 6 < W < 11. This is achieved by searching the compressed form of the database sequences. Sensitivity in regions of the database sequences containing ambiguity codes may be compromised, however, relative to the previous default behavior of searching uncompressed database sequences with all ambiguity codes instantiated. (New) Implemented the "ucdb" command line option to have BLASTN unconditionally search databases in UnCompressed form, with ambiguity codes instantiated. This new option complements the change in behavior described above, so users can still obtain the previous default behavior. This option can significantly improve speed for long queries and database sequences -- albeit at the expense of memory. CAUTION: ucdb causes the database to be unconditionally searched in uncompressed form, regardless of word length or query length; this may result in increased memory use and execution time, in cases where the database would ordinarily be searched in its compressed form. This option offers improved sensitivity for databases in XDF format; no improvement will be seen for databases in the original BLAST 1.4 format, even though additional memory will still be used. (Change) Small speed-up in the "dust" complexity filter. **** Version [2003-02-04] Posted **** 2003-02-03 (Change) A small rearrangement to the source code for XDF database I/O has been found that avoids the Intel ecc compiler problem described yesterday. This now permits high optimization to be used. **** Version [2003-02-02] Posted **** 2003-02-02 (Fix) An apparent bug in the Intel C ("ecc") compiler optimizer for Linux on Itanium (IA64) was observed to produce ERROR messages and cause BLASTN to miss database hits potentially at high frequency (perhaps 10% or more). Aberrant code produced by the compiler at its highest optimization level was localized to a single function involved in the processing of nucleotide ambiguity codes during XDF database I/O. Reducing the optimization level seems to have resolved the matter. 2003-01-30 (Change) In protein-level search modes, the default neighborhood word score threshold, T, for W=2 is now set to the same value as when W=3. Previously, the default behavior for word lengths other than 3 and 4 was not to use any neighborhood words at all, just identical word hits. **** Version [2003-01-28] Posted **** 2003-01-28 (Fix) blasta: floating point exceptions often arose under Tru64 UNIX when the poissonp option was used. (Fix) xdformat: when indexing a database, the appearance of redundant or duplicate IDs in the input FASTA data could lead to an unnecessarily fatal condition being raised. 2003-01-27 (Fix) xdformat: corrected the auto-sizing of file offsets. **** Version [2003-01-18] Posted **** 2003-01-18 (Fix) For the alignments of a given database sequence reported in the output, when multiple HSPs had the same score and at least one of these same-scoring HSPs was ascribed an E-value of 0, these same-scoring HSPs may not have been sorted relative to each other by E-value. (Note: this had no effect on the relative ranking of database sequences). 2003-01-17 (Fix) A bug in the gapped alignment routines could sometimes cause satisfactory alignments to be missed. The problem could only arise when searching a database sequence containing one or more ambiguity codes; with typical scoring systems that are used, the problem could also only arise when ambiguity codes other than N (any) were present in the immediate vicinity of the aligned segment of the database sequence. 2003-01-16 (Fix) A bug in the gapped alignment procedures used in the BLASTN search mode produced sporadic FATAL errors ("Non-positive score returned from ExpandX"). The bug was introduced in the 2002-11-12 release, when the pingpong option was added. 2003-01-11 (Fix) Corrected inconsistencies in command line parsing that would cause some command line expressions to be rejected. 2002-12-17 (Fix) Cleaned up wu-blastall script's support for -A and -P options. (Change) Adjusted cutoff scores used by the wu-blastall script to be closer to the new values used by NCBI blastall, November 2002 release. **** Version [2002-12-07] Posted **** 2002-12-07 (Fix) Multi-sequence query files could catapult the search programs into endlessly searching the database with the same queries over and over and over... (since the [2002-11-12] release). **** Version [2002-11-21] Posted **** 2002-11-21 (Fix?) Eliminated the 64-bit virtual addressing "improvement" from IRIX binaries. I couldn't adequately test them. **** Version [2002-11-18] Posted **** 2002-11-18 (Fix) The improvement in 64-bit virtual addressing mode that was introduced on 2002-10-25 caused the search programs to fail on some platforms (e.g., Tru64 and HP-UX, but not Solaris or Linux) when large database files were involved. **** Version [2002-11-15] Posted **** 2002-11-15 (Fix) The -t option was being rejected by xdformat, when processing peptide sequence databases. This bug arose in the 2002-11-09 release. **** Version [2002-11-12] Posted **** 2002-11-12 (New) A new "pingpong" option invokes extra processing to help ensure the alignments produced are locally optimal. In essence, one time-saving heuristic is eliminated. However, the use of this option typically adds 3-10% to the execution time without altering or improving the results. On rare occasion, though, an alignment and its associated alignment score may be improved. **** Version [2002-11-09] Posted **** 2002-11-09 (New) Added empirical lambda, K, H values for gapped alignments using the "pupy" matrix and gap penalties Q=10 R=10 and Q=20 R=10 with BLASTN. 2002-11-07 (New) xdget now supports a -t option, to have retrieved nucleotide sequences translated in the standard genetic code or in an alternate code specified with the new -c option. For nucleotide sequences, if both start and end coordinates are specified (-a and -b options) and start > end, the reverse complement is implied, rather than being treated as an error (as it still is for peptide sequences). (Change) Synchronized genetic code definitions with the NCBI Toolbox. **** Version [2002-11-04] Posted **** 2002-11-04 (Fix) The "seqtest" option was ignored if it was specified prior to the Z=# option (to the left of Z) on the command line. 2002-10-31 (New) The "noedge_effect" command line option turns off Altschul's edge effect in statistical significant calculations. (Change) "xdformat -i", which retrieves descriptive information about a database, now reports duplicate, redundant, and missing identifier counts for the database, if the -I option is included on the command line. 2002-10-28 (New) Added a -sort_by_subjectlength option, to sort results by subject (database sequence) length, from longest to shortest. (Fix) An error message displayed on integer input errors from the command line was wrong. **** Version [2002-10-25] Posted **** 2002-10-25 (Change) Slight improvement in efficiency of the applications compiled to use 64-bit virtual addressing (P64). **** Version [2002-10-20] Posted **** 2002-10-20 (Fix) Errors in the reported alignments could arise when the -qoffset option was used. An internal consistency check would fail in these cases, such that any occurrence of these errors was always associated with the appearance of an ERROR message in the output; however, some users may unadvisedly have been using the -errors option, which suppresses such ERROR messages. When these errors arose, the alignment scores and start and end points were correct (optimal), such that the P-values and E-value statistics were unaffected, but the path of the alignment from start to end was incorrect (suboptimal) and this would result in suboptimal values reported for the number and percent Identities and Positives. 2002-10-15 (New) Added support for "-shortqueryok" option, to make situations where the query sequence is shorter than the word length a non-fatal error. (Change) More than 4 threads can now be explicitly requested in the BLASTN search mode, using the cpus=<n> option, with n > 4. BLASTN still will use no more than 4 threads even on computers configured with more than 4 processors -- unless so requested. **** Version [2002-10-11] Posted **** 2002-10-11 (New) xdget can now output specific sequence segments and optionally reverse-complement nucleotide sequences. When a segment is reported, the coordinates are appended to the defline with an "SQ" tag. If the sequence is reverse-complemented, this is indicated at the end of the defline by an "RC" tag. (Fix) wu-blastall was not interpreting the -f option in the same manner as blastall. **** Version [2002-10-07] Posted **** 2002-10-07 (Fix) Corrected the internal definition of the "M" nucleotide ambiguity code, from erroneous C/T to the correct A/C. **** Version [2002-10-05] Posted **** 2002-10-05 (Fix) Command line settings of the Y parameter (effective length of the query) were being ignored, dating back to the introduction of support for segmented query sequences (2000-05-20). 2002-09-26 (Change) Added support to xdformat/xdget for indexing of the new third-party annotation tags in FASTA identifiers: tpd, tpe, and tpg. (tpd = DDBJ, tpe = EMBL, tpg = GenBank) **** Version [2002-09-16] Posted **** (Fix) The -mmio option was broken by the Mac OS X work-around for large files made on 2002-09-10. This option is principally used to help diagnose problems, though, and not for routine use. **** Version [2002-09-10] Posted **** (Fix) Work-around found to support "large" files (files > 2 GB in size) over NFS by Mac OS X NFS clients. **** Version [2002-09-09] Posted **** 2002-09-09 (Change) Selenocysteine residues (IUPAC code 'U') are now scored by default like unknown residues (IUPAC code 'X'), with the exception that U-U pairs score 0. These default scores can be overridden by providing explicit scores for U in the scoring matrix file. Previously, even though the software has been "selenocysteine-aware" and allowed scores for U to be specified in the scoring matrix for years, none of the scoring matrices distributed in the WU BLAST package has ever specified substitution scores for U; and the default substitution scores involving U were chosen to be large negative values, such that selenocysteine could never appear in an alignment except in a gap. The new default scores will often permit U to appear in alignments aligned with a U or aligned with other residues, not just in gaps. (Change) The fraction of "Identities" (identical residues) reported for a sequence alignment is now computed slightly differently. Previously, an aligned pair of residues was called an "identity" if-and-only-if the substitution score was positive and the two residue codes were the same. Under the new rules, the substitution score is no longer relevant. An aligned residue pair will be called an "identity" only on condition that (1) the residue codes are the same and (2) the residue codes are not ambiguity codes (e.g., not B, Z, or X for amino acid sequences). The new rules permit aligned pairs of selenocysteine (U) residues to be counted as identities, even if the scoring matrix specifies a non-positive substitution score for a U-U pair. Computation of "Positives" remains unchanged. **** Version [2002-09-06] Posted **** 2002-09-06 (Fix) On Tru64 platforms only, some searches could abort due to detection of a floating point output error. 2002-09-04 (Change) gt2fasta now reports SOURCE information rather than ORGANISM, in response to the NCBI moving organellar qualifiers from ORGANISM to SOURCE in GenBank Major Release 131. (Change) Allow /etc/sysblast to be configured to prevent BLAST execution entirely on a computer, by setting cpusmax=<a negative integer>. (Fix) Slightly better rollback recovery if/when xdformat is interrupted. **** Version [2002-08-28] Posted **** 2002-08-28 (Fix) CPU time reporting was inaccurate under Linux when multiple threads were used. **** Version [2002-08-14] Posted **** 2002-08-14 (New) Support is provided for a system-wide configuration file named "/etc/sysblast". Parameters "cpus=<int>", "cpusmax=<int>", and "nice=<int>" can be set as desired, one parameter line. See the accompanying file README.html for details. **** Version [2002-06-24] Posted 2002-08-10 (slipstream) **** 2002-06-24 (slip-stream revision for IA64 platforms only, 2002-08-10) (Fix) Data conversion problem on some IA64 platforms when manipulating indexed databases with xdformat and xdget. **** Version [2002-06-24] Posted **** 2002-06-24 (Change) The maximum allowed value for the E=# command line option was increased from 10000 to DBL_MAX. "DBL_MAX" is a platform or hardware specific limit on double precision floating point values that is often greater than 1e300 (10**300). **** Version [2002-06-07] Posted **** 2002-06-07 (Fix) The cpus=# option, if specified, was not being honored by the search programs -- they proceeded to use the default number of CPUs or threads regardless. This bug was introduced in the 2002-05-15 release. **** Version [2002-05-29] Posted **** 2002-05-29 (Fix) When copious output was produced, users were advised to use a low complexity filter, even when the wordmask option had been specified. **** Version [2002-05-15] Posted **** 2002-05-15 (Fix) Another failure mode repaired where an invalid context in the query would produce a segmentation fault. This bug was originally fixed in the 2002-03-30 release, but re-introduced in a different form in the 2002-04-02 release. 2002-04-26 (Fix) When appending sequences to an existing database with xdformat, the originally set title of the database was not maintained, being instead replaced by the name of the current FASTA input file (unless the -t option was specified when appending). (Change) Default sort order of subject sequences has been improved. In the case that the best P-values are identical (e.g., 0), the subject for which the highest scoring alignment was found is reported first. (Change) Made Sum P-values a little more accurate when hspsep[sq]max and gapsep[sq]max parameters are used. **** Version [2002-04-15] Posted **** 2002-04-15 (Fix) Parsing of -altscore command line options was broken in 2002-04-14 release. **** Version [2002-04-14] Posted **** 2002-04-14 (Fix) More robust command line parsing, to reduce user input errors. **** Version [2002-04-12] Posted **** 2002-04-12 (Fix) Segmentation fault in xdformat when indexing empty identifiers (e.g., gb||). (New) Support for Intel Pentium4 (i786) under Linux. **** Version [2002-04-11] Posted **** 2002-04-11 (Fix) Parsing of FASTA deflines was errant when deflines began with white space instead of an identifier(s). (Fix) No longer trapping SIGFXFSZ (filesize) or SIGPIPE. **** Version [2002-04-06] Posted **** 2002-04-06 (Change) Made the filter= and wordmask= specifications more flexible, as far as how custom filters are supported. See http://blast.wustl.edu/blast/README.html#Filters for details. **** Version [2002-04-02] Posted **** 2002-04-02 (New) Significant speed improvement in all search modes on many classes of large problems. (New) Created "gspmax" command line option to govern the number of gapped HSPs reported. The "hspmax" command line option now strictly governs the number of ungapped HSPs that feed the gapped alignment phase. hspmax=0 and gspmax=0 imply no limit. **** Version [2002-03-30] Posted **** 2002-03-30 (Fix) Segmentation violation always occurred whenever some (but not all) contexts were "invalid" (e.g., could not satisfy the cutoff score). (Change) Small performance improvement. 2002-03-25 (Change) Tweaked wu-blastall script. 2002-03-22 (Fix) When dumping a database into FASTA format (-r option), xdformat was exiting non-zero when no error condition existed. **** Version [2002-03-19] Posted **** 2002-03-11 (Change) Sum statistics now take into consideration any settings of the following command line options: hspsepqmax, hspsepsmax, gapsepqmax, gapsepsmax. See the on-line README for the parameters' descriptions at http://blast.wustl.edu/blast/README.html These options impose distance limitations that are now factored into the search space size used in computing Sum P-values. The result can be improved sensitivity, when the set distances are shorter than the query and/or subject sequences. If none of these options has been specified in the past, then no change in P-values will be observed. 2002-03-06 (Change) Improved BLASTN search speed on some platforms (perhaps most notably on Solaris/SPARC), by recovering speed lost during an extensive code reorganization undergone in January 2001. 2002-02-20 (Change) Added new "dbchunks" command line option, to allow the database to be split into an aribitrary number of chunks for assignment to threads. Higher values may be advised when the database contains sequences that vary widely in length or composition. (Change) The effective number of database "chunks" was restored to 500 from its previous value of 1000. Chunks had been 500 for eons before raising it to 1000 during the past year. Raising it to 1000 has proven particularly inefficient on EST database searches, even though genomic searches may have proceeded more smoothly. (Change) Tweaked the wu-blastall script. (Change) Only one warning of Karlin-Altschul parameters not available, instead of one report for every reading frame or strand of the query. 2002-02-07 (Fix) BLASTN search mode was effectively filtering the reverse complement of the query sequence twice, in the case that both strands of the query were being used (which is the default) along with certain kinds of filter or wordmask programs. When a filter program such as "seg" (or more specifically "nseg" in this case of nucleotide sequences) was used -- a program which can further mask an already masked sequence -- anomalous results were sometimes obtained for the reverse complementary strand, along with occasional segmentation violations. This bug only affected BLASTN in conjunction with filter programs that behave like nseg (e.g., not the version of dust that is distributed with WU-BLAST 2.0). 2002-01-14 (Fix) Multi-sequence query files will no longer cause the blast search program to halt when a zero-length sequence is encountered. 2002-01-11 (Fix) Parsing of accessions from RefSeq flat files 2001-11-20 (Fix) Fixed MAJOR BUG introduced in BLASTA sometime after 2001-11-16 release. 2001-11-19 (Change) Updated xdget's usage display to include mention of the -M option. 2001-11-18 (Fix) rather than reporting the error and continuing with the next requested identifier (if any), when presented with an identifier of a class not found in the database or of improper syntax, xdget would report "Index error" and exit nonzero. 2001-11-16 (Fix) If a database had not originally been indexed, the first time xdformat was run with the -X option on the database, an empty index was created. Any subsequent invocations with -X would create a proper index, but the first time should have done it, too! 2001-11-14 (New) Added -F option to XDGET to write an ASCII formfeed or newpage character (Control-L) followed by a newline, and then flush the output stream after each request. This facilitates interaction with XDGET over a two-way pipe, such as between a client and server, in such a manner that deadlock can be avoided where each program gets stuck waiting for input from the other. (Fix) Marginal speed improvement in XDGET startup time. ********* Released version dated [2001-11-12] (slipstream release) 2001-11-13 (Fix) Database read error on large databases (file offsets > 4 bytes). ********* Released version dated [2001-11-12] 2001-11-11 (New) The wu-formatdb script now supports the indexing options (-o and -s) of the NCBI formatdb program. 2001-10-31 (Fix) The cpus=# option was being ignored under HP-UX. 2001-10-25 (New) Parsing of SV (sequence version) line in EMBL database files by the sp2fasta program. 2001-10-24 (New) Indexing of sequences by identifier in XDF databases, using enhanced "xdformat" program. Existing databases can be indexed without reformating. See enclosed README.html for further details. (New) Retrieval of sequences by identifier with new program "xdget". See enclosed README.html for further details. (Fix) Sporadic problem reading long definition lines from XDF databases. ********* Released version dated [2001-10-01] 2001-10-01 (Fix) Tru64 UNIX incompatibility with 32-bit ("p32") binaries built under version 4.0 for execution under version 5.0. (Fix) Code cleaned up for 100% compatibility with Mac OS X 10.1 ********* Released version dated [2001-09-23] 2001-09-23 (Fix) Long filename problem under HP-UX 11.0. (Fix) More error checks. (Fix) Tweaked defline parser. 2001-09-13 (Fix) Fixed a bug in blasta's parser of database definition lines that was introduced 2001-09-09. (Fix) Added more error checks to xdformat. This also corrects the dumping of any sequences back into FASTA format that contain NULL (nonexistent) definition lines. 2001-09-09 (Fix) Further tweak to get completely around Tru64 problem first thought to have been addressed on 2001-06-07. A problem (leading to FATAL error) was only apparent when using multi-sequence query files, only sporadically after the first query sequence, and only under Compaq Tru64 UNIX. 2001-09-04 (Fix) The query was not displayed with the advertised lower-case indication of soft masked (word masked) residues. Only hard masked (filtered) residues were indicated. 2001-09-02 (Fix) Free memory error introduced 2001-08-27. 2001-08-27 (Fix) Added more protections against anomalous data. (Fix) Improved compile- and link-time options for use of large memory under IBM AIX. 2001-06-07 (Changed) More consistency/error checking on database I/O. (Fix) Tweaked code to avoid what appears to be a bug in Tru64's buffered I/O routines that led to sporadic crashes and other indecent behavior. 2001-06-05 (Fix) Added support for "ref" (RefSeq) identifiers, which had been missing. (Fix) Unrecognized sequence identifier tags (such as "dog" in the identifier string "gi|1583|gb|AC1583|AC1583|dog|WOOF001") now cause the left-to-right parsing of identifiers to halt and no longer result in a space character (instead of the vertical bar) being displayed in BLAST output after the last recognized tag in the string. A warning or error message really ought to be displayed when unrecognized identifiers are encountered, but that would be inconsistent with previous behavior. (Perhaps it will be better anyway to warn of this problem if it's encountered while building the BLAST database). (New) pir2fasta now supports a -a option, to have ACCESSION omitted from the output identifiers. ********* Released version dated [2001-06-01] 2001-06-01 (Fix) xdformat can now find databases in the current working directory, without having to specify an explicit path to it. (Fix) A resource allocation problem existed under some command shells (only tcsh observed) on some computing platforms that led to an out-of-memory condition when query sequence filtering (or wordmasking) was requested. 2001-05-26 (Change) The pressdb and setdb programs in the software distributions have been relegated to pressdb.real and setdb.real, to be replaced by soft links named "pressdb" and "setdb" that point to xdformat. ********* Released version dated [2001-05-18] 2001-05-21 (New) Now bundling the "patdb" utility program, for removing redundancy from FASTA sequence database files. The program has been in routine use in the lab for about 7 years. Patdb performs a similar function to the older "nrdb" program, but it has the optional (via the -s option) ability to identify not just 100% identical sequences over their entire length but perfect (100% identical) substrings, as well. The algorithmic techniques employed include Patricia trees and deterministic finite-state automata (see Gish, 1989, http://blast.wustl.edu/blast-1.4/gish/doc/dfa.3.pdf). The net result is a program that is not appreciably different in speed from nrdb, yet it can glean an extra few percentage points of compression, depending on the input. However, patdb is less well suited to working with large nucleotide sequence data sets than is the nrdb program, because patdb works in memory with all sequences at 1 residue per byte, whereas nrdb can compact nt. sequences to 2 or 4 nucleotides per byte. Typical usage for protein sequence databases might be "patdb -s 20". Start and end coordinates of perfect substrings are currently appended to the defline for the associated sequence. While the lab has benefited from using patdb, before blindly using this program yourself, the speed benefits of substring elimination should be weighed against the potential post-database-search complications arising from hits against longer database sequences that contain as a perfect substring the actual sequence(s) of interest. 2001-05-18 (Change) The echofilter option now causes the query sequence to be reported in the output, regardless of whether any of the filter, lcfilter, wordmask, or lcmask options have been specified, but only after application of any/all requested filters and masks. Previously, the query was reported in the output iff filter or lcfilter was specified. As before, "masked" letters are displayed in lower-case, whereas "filtered" letters produced by the bundled low-complexity filter programs (e.g., seg, dust, xnu) will be displayed respectively as X or N, for amino acid and nucleic acid alphabets. 2001-05-17 (Fix) Removed a minor inefficiency in working with XDF nucleotide sequence databases with ambiguity codes. Accompanying this inefficiency on some (but by no means all) computing platforms and only with some databases (e.g., the UCSC whole chromosome sequences), BLASTN run time may previously have been severely impacted. (Fix) Corrected a bug in factorial calculations that was introduced to the bundled seg, pseg, and nseg programs on 2001-05-02. This is not expected to have any apparent effect, but simply makes the code correct. 2001-05-06 (Fix) Corrected the sizes (memory use) reported for DFA structures. ********* Released version dated [2001-05-02] 2001-05-02 (Fix) Fixed the source of segmentation violations in all of the bundled seg filter programs (seg/nseg/pseg), most easily seen under Linux when filtering long query sequences. 2001-04-28 (New) Added "evalues" option to have E-values instead of P-values reported in the first section of output; a currently redundant "pvalues" option is also available. (Fix) Cleaned up command line parsing. 2001-04-20 (Fix) More cross-platform large file access code clean up. ********* Released version dated [2001-04-12] 2001-04-12 (Change) The platform description reported at the beginning of program output now indicates sizes for integer, long, and pointer data types as ILP32 or ILP64. Large file (>2 GB) support is indicated by F64. (Fix) Fixed configuration for large file support under HP-UX when 32-bit virtual addressing is being used (re: the "-n32" distributions). (Fix) System processor count was incorrectly assessed under HP-UX. (Fix) Now trapping SIGPIPE. 2001-04-10 (Fix) Fixed configuration for large file support under IRIX when 32-bit virtual addressing is being used (re: the "-n32" distributions). (Change) Slight change to interrupt handler in xdformat. 2001-04-03 (Change) Eliminated temporary files altogether in the filtering of query sequences (re: the "filter" and "wordfilter" options). Filter programs now must read (write) sequences from (to) standard input (output), which is often signified by "-", "stdin", or "stdout" in UNIX parlance. ********* Released version dated [2001-03-31] 2001-03-31 (Fix) Potential crashing bug fixed in TBLASTN and TBLASTX search modes that was introduced in the [12-Dec-2000] release. ********* Released version dated [2001-03-27] 2001-03-27 (Fix) Temporary files (such as those created during sequence filtering) are now cleaned up (deleted) even if the search program is interrupted. 2001-03-23 (Change) Switch from using tmpnam() to tempnam(), so temporary files can be relocated if necessary to another directory, using the TMPDIR environment variable. (Fix) Made error message more specific to the situation when a temporary file can not be created or written to, either due to lack of permissions or insufficient free disk storage available. ********* Released version dated [2001-02-28] 2001-02-28 (Fix) BLASTA refused to search virtual databases if the input query file contained multiple sequences. 2001-02-20 (Fix) Corrected/clarified some of the usage information displayed by xdformat. 2001-02-15 (Fix) Sped up and reduced the memory requirements of the "dust" (Tatusov & Lipman) external filter program, particularly when operating on huge sequences. ********* Released version dated [2001-02-12] 2001-02-12 (Fix) Alignment errors could arise with "segmented" query sequences -- that is, with query sequences containing one or more hyphens -- if and only if the matching database sequence contained one or more ambiguity codes. If lucky, a prominent ERROR message would be displayed, pointing to a severe bug in the software, but most of the time when this problem arose, a satisfying alignment would simply be skipped without notice. (New) Added "nosegs" option (not to be confused with "noseqs") to turn off the default behavior of segmenting query sequences at any hyphen characters. See README.html for further description. (Change) Made "gapall" the default behavior, which will significantly slow down some searches, but this should be compensated at least in part by recent speed increases. To obtain the previous default behavior, set gapE=2000 on the command line. 2001-02-01 (Fix) the value of the command line option "vdbdescmax" was not being interpreted properly. 2001-01-11 (Fix) Reduced memory requirements for long sequences, although no change in speed is expected, except on systems with very limited memory, where speed should be marginally improved. (Change) The H=# option is now used to specify the Karlin-Altschul statistics H parameter value (in units of nats), for the evaluation of ungapped alignment scores. Previously, the H option had been used to turn on/off a histogram plot of the distribution of ungapped alignment scores. 2001-01-09 (Fix) Incorrect score sometimes displayed in one of the WARNING messages. ********* Released version dated [2001-01-03] 2001-01-03 (New) The PHAT scoring matrices of Ng, Henikoff, and Henikoff are now included. See Bioinformatics 16:760-766 (2000). NOTE: Empirical values for lambda, K, and H when searching for gapped alignments with these matrices are NOT currently available. 2001-01-02 (Fix) Crashing bug when both the "kap" option and either the "topcomboN" or "topcomboE" options were simultaneously specified. (New) Better reporting of exceptional conditions. ********* Released version dated [2000-12-13] 2000-12-13 (Fix) A crashing bug (created on 2000-12-12) if the compat1.4 option was specified. ********* Released version dated [2000-12-12] 2000-12-12 Speed bump in all search modes. Latest version of wu-blastall was inadvertantly setting W=3 for BLASTN searches, making them terribly slow. 2000-12-07 Fixed the condition upon which a NOTE suggesting usage of a low-complexity filter was displayed. 2000-12-05 Made search programs a bit more intelligent about how to find filter programs when the BLASTFILTER environment variable is not set. 2000-12-02 Fixed a bug in the hitdist (2-hit BLAST) algorithm that overcounted word hits and slowed down searches a bit while only very marginally increasing sensitivity. 2000-12-01 xdformat with -i option was not displaying the database's Release date, if a Release date had been set with the -d option. ********* Released version dated [2000-11-09] 2000-11-09 Fixed a significant HSP sort bug. In all search modes but BLASTP, HSPs may not have been sorted by score, while the primary sort key (strand) was correctly performed. 2000-11-08 When dbrecmin or dbrecmax was specified, the starting record number in the database was reported erroneously as being 1 greater than the actual starting record number. (The starting record was indeed the one requested on the command line). User settings for dbrecmin and dbrecmax were not being validated against the actual number of records in the database, nor were they being compared for their relative values making sense (dbrecmin <= dbrecmax). If the (WU)BLASTFILTER or (WU)BLASTMAT environment variables are not set, filter programs and scoring matrix files will now be found and used, if they are located respectively in filter/ and matrix/ subdirectories of the directory where the BLAST search program resides. This permits the BLAST software distributions to be unpacked and used immediately, without having to set these environment variables first -- the filter programs and scoring matrices should be found in the expected subdirectories after unpacking. 2000-11-06 If "filter=none" was specified on the command line, this fact was not reflected in the Parameters displayed in the search program output. ********* Released version dated [2000-11-04] 2000-11-04 Added support for controlling the maximum _absolute_ length of overlap between "consistent" HSPs, to complement the existing "olf" and "golf" parameters used to express the maximum overlap as a _fraction_ of the overall alignment length. The new parameters for expressing absolute length of overlap (measured in units of residues) are: "olmax" and "golmax", for ungapped and gapped HSPs, respectively. Fixed an adverse interaction between the topcombo options and the newer "links" option that could lead to erroneous link numbers -- or possibly even program crashes -- when topcombo and links were used together. ********* Released version dated [2000-11-03] 2000-11-03 Fixed an adverse interaction between the external "seg" filter program and BLASTA that caused seg to crash under the Linux operating system (and only under Linux) when BLASTA invoked it for filtering the query sequence. The same interaction could be constructed under other operating systems, though, so the potential bug has been fixed for all operating systems. If an error occurs during query filtering, the temporary file used to store the unfiltered query sequence now gets removed (unlinked) consistently. ********* Released version dated [2000-10-26] 2000-10-26 When an environment variable is not set, instead of reporting "<NOT-FOUND>", getenv now reports no value; and when the environment variable is set to an empty string, an empty string ("") is reported as its value. ********* Released version dated [2000-10-23] 2000-10-23 On fully 64-bit platforms (Alpha, Ultra64, MIPS R10000, HP PA-RISC), memory use for long query and long database sequences has been reduced by up to half, with a small attendant increase in speed. Added support for "links" option, to display consistent links of alignments. 2000-10-05 Added -i option to xdformat, to obtain information about an existing XDF database. The option works in setdb- and pressdb-compatibility mode, too, on databases that are internally XDF. 2000-09-29 Failures to fork child processes (e.g., due to the system-wide process table being full) are now logged using the syslog facility and re-tried after a 5 minute sleep. 2000-09-25 Ported to Mac OS X Public Beta 1. 2000-09-04 Produce only a warning, when the maximum ungapped score is less than the gapped alignment score threshold (gapS2). If S is less than S2, S2 is no longer reduced to S. This brings some consistency to comparisons of sequences, independently of the database size. Added "maskextra=<n>" option, to mask <n> flanking residues of those masked by the lcmask or wordmask=<masker> options. 2000-08-31 Command line options are often reported now, even when a fatal error is encountered, to facilitate diagnosis. 2000-08-30 Added the "getenv" command line option, for interrogating the value of an environment variable. Example usage: getenv=BLASTMAT ********* Released version dated [2000-08-27] 2000-08-27 Fixed an incorrect interaction between the Z and seqtest options, when both were used. Lower case word masks in a nucleotide query sequence, activated by the "lcmask" option, are now propagated to the conceptual translation products in BLASTX and TBLASTX. 2000-08-26 Added "vdbdescmax <n>" option (default n=1) to limit the depth of recursion in describing virtual database components in the output. Setting this limit to 0 means "no limit" and will cause all component databases to be described. Added "putenv" option for setting environment variables in BLASTA. As a security precaution against WWW users setting paths to undesirable directory locations, the "endputenv" option was added, to have ignored any putenv options that follow it on the command line. Added -E option to xdformat, for setting environment variables on the command line in a similar fashion to the "putenv" option of BLASTA. The layout of date strings in the output of BLASTA and xdformat can now be entirely controlled by the user, via the CFTIME environment variable, under operating systems that commonly support this mechanism. As an example of how this feature is useful, dates can now be displayed in ISO 8601 standard format by setting the environment variable CFTIME to "%Y-%m-%dT%T" before running BLAST, or by specifying putenv=CFTIME="%Y-%m-%dT%T" on the BLAST command line. In addition, setting the TZ environment variable to "GMT-0" may cause the date and time to be reported in Universal Coordinated Time (UTC); in this case, including a "Z" (for zero or Zulu) to the CFTIME specification will make it clear that UTC is being used, as in putenv="CFTIME=%Y-%m-%dT%TZ" putenv=TZ="GMT-0" 2000-08-21 Added descriptions of nwstart and nwlen parameters to the usage information for blastn, tblastn and tblastx search modes. While these parameters have been available in all search modes, they had not been advertised as such by the usage information. 2000-08-20 Restored reporting of neighborhood word counts when the "stats" option is specified. Activated "wink" option for BLASTN searches, which had mistakenly not been activated earlier. This has an effect on BLASTN's speed now! 2000-08-16 Added support for new "lcfilter" and "lcmask" options. lcfilter causes lower case letters in the query sequence to be converted to the appropriate "unknown residue" code (N for nucleotide sequences and X for protein sequences). lcfilter is similar to the NCBI blastall program's -U option. lcmask causes lower case letters to be masked from neighborhood word generation, without altering the sequence itself (e.g., see the wordmask=<masker> option). Added support for -U to the wu-blastall script, which converts it to the WU-BLAST equivalent "lcfilter". 2000-08-15 Fixed bugs in long deflines output by gt2fasta. gt2fasta now also appends /note= information (if available) to the DEFINITION string, when no /gene= or /product= information is available. 2000-08-11 Added support for multi-sequence query files. When the query FASTA file contains multiple sequences, they are individually compared against the database using all of the specified options. All results are sent to the same output stream. The individual results are delimited from one another by a single ASCII form feed character (control-L), allowing text pagers (such as "more") to be used conveniently to browse through each set of results. The cpu times and start/end times reported are for individual searches, except the last query's which is the total time for all searches. Added support for new command line options: qrecmin, qrecmax. If it is desired to compare only the first query sequence, specify qrecmax=1. 2000-08-10 Added support for -F "m ..." option to the wu-blastall script and use the BLASTA -kap option for all search modes. 2000-08-09 Added support for virtual databases, specified on the command line as white space-delimited lists of real database names. Example: if the protein sequences from the GenBank "pri", "rod", and "mam" divisions are organized in separate databases, then all 3 databases can be searched at one time with the following command: blastp "pri rod mam" query.aa Virtual databases can be comprised of real databases in either XDF or the classical BLAST 1.4 database formats; however, all real databases in a given virtual database must currently be of the same format. 2000-08-09 Added support for new "wordmask=<masker>" option, used to mask words from the neighborhood word list without altering the sequence itself (as sequence _filters_ would do). The acceptable maskers include the same list of filters as the classical "filter=<filter>" option. For example, wordmask=dust should be equivalent to the NCBI's -F "m D" option. Multiple wordmasks may be specified and (just as with the filter= option) wordmask=none cancels any wordmask specifications appearing earlier on the command line. Added check for all search contexts being wordless. 2000-08-08 Added support for multiple (basically unlimited) filter= specifications on a single BLAST command line. Filters are run separately on the native query, so the order in which multiple filters are applied does not alter the outcome (with the exception noted below). Each filter's result is OR-ed against the others. The "echofilter" option only displays the final result upon OR-ing all of the filter outputs. Exception: specifying "filter=none" effectively wipes out all filter specifications coming before it (to the left) on the command line. This for example allows a default filter to be specified in a script, which can then be completely overridden by a subsequent specification that might be optionally provided by the user. In the following case, the initial seg filter is cancelled by filter=none, to be replaced by xnu. blastp nr query.aa filter=seg filter=none filter=xnu 2000-08-02 Changed default gap penalties in blastn search mode of wu-blastall, to coincide with new defaults in the NCBI's blastall. ********* Released version dated [2000-08-01] 2000-08-01 Fixed bug in reading of file offsets from large XDF database files. Databases should not need to be reformatted. 2000-07-29 Eliminated all justification of deflines displayed in the one-line descriptions portion of output, to increase the amount of information displayed here. Display "----NO-DESCRIPTION-AVAILABLE----" when the description for a sequence is zero-length. ********* Released version dated [2000-07-27] 2000-07-27 Fixed one-letter truncation of sequence IDs when no additional text was present on the defline besides the ID. Eliminated an occasional, annoying warning from xdformat. 2000-07-25 Bumped the maximum permitted value for the word length (W command line parameter) up to 1024, from the previous maximum of 32. Added -e option to wu-blastall, which surprisingly had been neglected. 2000-06-19 Added knowledge of the NCBI's new coiled-coil protein filter ("ccp") to the BLASTA program, for invocation via a simple "filter=ccp" command line option. However, the user must explicitly build this filter themselves from the NCBI Toolbox (see makenet.unx). Why a simple analysis program should be doing network I/O, I dunno. 2000-06-16 Fixed VERSION line GI parse bug in gb2fasta. ********* Released version dated [2000-06-15] 2000-06-15 BLAST search programs would fail to open a pressdb database unless the FASTA file was present. Restored support for the -p option in nrdb program. 2000-06-07 xdformat automatically increases the precision with which file offsets are stored (re: the -O option), when the input data can be determined in advance to warrant such an increase. 2000-06-06 Added more largefile error protections to xdformat. 2000-06-02 XDF database routines had not expected zero-length deflines in the FASTA input data. 2000-06-01 Added support for "-kap" option for Karlin-Altschul (1990) statistics without the consideration of multiple hits as with Poisson or Sum statistics. This option complements the -poissonp and -sump options. 2000-05-20 Added support for segmented query sequences. Default is to segment on boundaries denoted by hyphens ('-'). Use -segment option to turn it off. 2000-04-23 Some lower-scoring HSPs spanning higher-scoring HSPs were occasionally being reported. 2000-04-21 xdformat now accommodates multiple database names when dumping XDF databases into FASTA format (using the -r option). this is useful, for example, when the dumped output is to be piped directly into another program, rather than having to save intermediate files to disk. xdformat accepts "-o outfile" option along with -r, to specify the name of the FASTA output file, overriding the default stdout. 2000-04-07 Fixed the parsing of command line options in sp2fasta. Fixed misassignment of "sp" tag to some nucleic acid sequence entries. ********* Released version dated [2000-04-05] 2000-04-05 Fixed trivial bug in TBLASTN search mode that prevented it from performing searches. 2000-04-02 Fixed a bug in the memory-mapped I/O used with XDF databases that could prevent XDF databases from being read properly if they contained any individual sequence(s) longer than about 2 Mbp or 512Kaa. Fixed the -c option of xdformat, which wasn't accepting nt. ambiguity codes as replacement characters. 2000-03-17 Cleaned up warnings produced in conjunction with nonnegative expected scores (re: the -nonnegok and -novalidctxok options). Users of the -novalidctxok option should probably make sure they're also using the -nonnegok option, too. 2000-02-27 Changed (actually corrected) the parameter name "edegrade" to "topcomboE" in the blast program usage display. 2000-01-27 Fixed a conflict between the use of "topcombo" post-processing (topcomboN or topcomboE options) and all of the sort_by_ options except the default sort criterion, sort_by_pvalue. 2000-01-25 Fixed a disabling bug in XDF support for small-file operating systems (those that are limited to files 2 GB or less in size, e.g., Linux-X86 and Solaris pre-2.6). 2000-01-24 Fixed a crashing bug when the query sequence contained one or more gap characters (-). When query filtering was used, the same bug manifested itself as a pre- and post-filtering mismatch reported in the query sequence length. Corrected an xdformat error message to report the correct input sequence number where the error arose (it had been off by 1). 1999-12-16 Fixed determination of "High Score" for the one-line description section of output. Fixed sort_by_highscore bug. Indent xdformat output. 1999-12-14 Fixed memory mapping problem under HP-UX. 1999-12-13 Fixed BlastStr free error. 1999-12-10 Added support for XDF (eXtended Database Format) databases. 1999-11-23 Standardized on new "WUSTLna" alphabet, which takes the "NCBI4na" alphabet and adds the letter code `X', which is interpreted as "any nucleotide" but can be scored differently than an `N'. 1999-11-22 Fixed minor command line parsing bug. 1999-11-18 Fixed bug in protein-level comparison modes (BLASTP/BLASTX/TBLASTN/TBLASTX) in the (uncommon) case word lengths larger than W=6 were requested. 1999-11-17 Put checks for corrupt (too large) BLAST database files in BLASTA that may have been produced by earlier versions of pressdb and setdb. 1999-11-16 Restored ability to compute lambda,K,H when the IDENTITY scoring matrix is requested. 1999-11-15 Added checks to pressdb for excursion beyond 1 GB (4 gigabases) for the *.csq file, 4 GB file size for the *.nhd file, and 4 GB size for the FASTA file. The program will now fail if any of these limits is exceeded, as the BLAST 1.4 database format does not support larger sizes. Added checks to setdb for excursion beyond 4 GB in the *.bsq and *.ahd files. 1999-11-11 Added version, date and platform descriptions to the trivial usage output of setdb and pressdb. ("trivial usage" == when the program is invoked without any arguments or options). 1999-11-10 Converted nrdb from tallying its statistics with "[signed] long" values to "unsigned long", to increase precision another factor of 2 to 2**32 - 1. 1999-10-11 Cleaned and sped up xnu filter program. 1999-10-10 Marginally sped up Sum statistics calculations. 1999-10-07 gb2fasta and gt2fasta now report the accession.version from the VERSION line, when available. 1999-10-06 Plugged a memory leak in old NCBI seg/nseg/pseg filter programs. 1999-09-29 Fixed a small bug in topcombo processing when some E-values were 0. 1999-09-28 Fixed a recently introduced bug in command line parsing of nwstart and nwlen options. 1999-09-14 Incorporated a different fix for the thread synchronization bug originally fixed on 1999-09-01, so individual threads can start searching a hair faster. 1999-09-09 Reverted to previous statistical calculation, which may overestimate the significance of hits but will produce more twilight zone output for users who are interested. The expected length of an alignment, relative to the lengths of the query and database sequences, had been weighted less heavily, yielding conservative P-values. 1999-09-01 Fixed a thread synchronization bug that caused sporadic "score mismatch" errors and occasional crashes, when multiple cpus/threads were used. Moved computation of default gapE after the point where E is determined, so user changes to E/S will effect gapE. Turned off support for gapS option, as it has been ignored, until a way to support it is figured out. 1999-08-26 Fixed a recently introduced bug in the -mmio option. Improved efficiency of gapped alignment processing. 1999-08-18 Fixed potential crashing bug when "postsw" option is used (this option is currently available only in the blastp search mode). 1999-08-10 Fix-ups to avoid a floating point exception arising under Linux for Alpha. Database names specified on the command line can now contain a relative path, which is tacked onto any directory name(s) specified in the BLASTDB environment variable. Example: if setenv BLASTDB /usr/db; and the database name specified on the command line is "human/chr1"; then the human chr1 database will be found if it resides at the location /usr/db/human/chr1. 1999-08-05 Made Sum p-value calculation more robust to potential floating point exceptions -- overflow and underflow -- in particular to stabilize Linux on Alpha. 1999-06-26 Fixed crashing memory bug in BLASTX, TBLASTX, and BLASTN when filter= option resulted in completely masked (or nearly so) reading frames. 1999-06-23 Fixed a slow-speed bug in TBLASTN and TBLASTX search modes. 1999-06-07 Restored original interpretation of hitdist, but retained new lower limit. 1999-05-23 Migrated "wink" code into distribution/production. Changed the interpretation slightly of hitdist and relaxed the lower limit on its allowed value. If one was using hitdist=n before, they should use hitdist=n-1 now to maintain 100% consistency in their results. The value for hitdist can now be as small as the wordlength, W. Setting hitdist=W in BLASTN effectively demands a word match be twice as long as W to seed the alignments. 1999-05-18 Added an ftell error check to pressdb.c Removed the unused inclusion of <ndbm.h> from gish/include/gishsys.h, for RedHat Linux 6.0 compatibility. 1999-05-16 gt2fasta now parses /protein_id="ACCESSION.VERSION" from CDS features. 1999-05-14 sp2fasta was assigning "sp" tags to circular DNA and RNA sequences. 1999-05-12 Search programs weren't looking in current working directory for databases when BLASTDB environment variable wasn't set. 1999-04-09 Updated gb2fasta and gt2fasta to parse GI identifiers from the new VERSION line and /db_xref="GI:#" qualifiers introduced in GenBank Release 111.0. Started including nrdb program in archives of executables, for FASTA database compression. Started distributing seg, xnu, and dust sequence filtering programs in the filter/ subdirectory. 1999-03-01 altscore modifications to scoring matrix were not honored when nondefault scoring matrix was used. 1999-02-23 Fixed assignment of K and L (K and lambda values for ungapped alignments) from the command line. 1999-02-22 Fixed assignment of gapK, gapL, and gapH values from the command line. Eliminated possible negative 0 (-0 [sic]) probabilities reported when Poisson statistics used. 1998-11-21 Fixed TBLASTX's HSP cutoff score (was using S instead of S2). Implemented "nwstart" and "nwlen" across all of the search programs. They had been available only for BLASTP and BLASTX. 1998-11-17 Fixed error in parsing gapL, gapK, and gapH. 1998-11-04 Normalized all note/warning/error/fatal message reporting. 1998-11-03 Replaced standard popen() used to execute filter programs with a home-built version that doesn't spew noise to stderr and cause parsers to gag. 1998-10-30 Converted all remaining calls to fprintf(stderr... in the *blast* search programs to standard ERROR, WARNING and FATAL messages. 1998-10-23 Fixed potential for UMR error in BLASTN when only bottom strand is being searched. 1998-10-22 Removed possible source of floating point errors. 1998-10-22 Added test to ensure gap penalties Q >= R. 1998-10-21 Added "ERROR" messages to the list of message types reportable, which now also include WARNING and FATAL. 1998-10-20 Fixed a segmentation fault bug in blastn, tblastn and tblastx when database sequences contained ambiguity codes; in addition, blastn required a nondefault wordlength W < 11 to evoke this bug. 1998-10-14 Added determinism to the compressed databases produced by pressdb in the presence of nucleotide ambiguity codes in the input sequences. Stopped truncating the very last nucleotide from the very last sequence in the input FASTA file to pressdb, when the last line of the file didn't end with a newline character. 1998-10-03 Fixed bug in nucleotide sequence neighborhood word generation (i.e., in BLASTN) when a query filter (e.g., dust or seg) is used. 1998-08-27 Slightly better elimination of overlapping/redundant alignments. 1998-08-13 Fixed duplicate gi identifier bug in gt2fasta.c 1998-07-02 Implemented "-novalidctxok" option and added description of the "-nonnegok" option to the program usage display. 1998-06-17 Set the default value for the "progress" command line option to 0, because most users may not be using this feature (which was meant to produce keepalive messages in client/server environments) and some unpatched operating systems may otherwise produce spurious Alarm Clock errors. 1998-06-15 Fixed sort_by_highscore 1998-04-08 Increased max. number of threads or processors to 64 (still subject to the number of processors available in the installed computer). Tweaked E-value computations for Sum and Poisson statistics. 1998-04-06 Made tweak in code to create single "blasta" executable. The blastp, blastn, blastx etc. executables are just soft links to blasta. 1998-04-01 Fixed -span option for gapped alignments. -span1 sorta works. 1998-03-24 Added first vestiges of dynamic link libary (DLL) support for output. 1998-02-17 The optional Poisson statistics wasn't returning correct results; and turning off consistency also wreaked havoc. Thanks to Zhirong Bao for pointing out this bug. Somewhat improved the usage information displayed when the search programs are invoked without options. 1998-02-08 Sped up FASTA reading routines a hair for BLASTN, TBLASTN and TBLASTX. 1998-02-05 Posted version 2.0a19 1998-02-04 Fixed a bug Mike Cherry reported that sometimes produced a FATAL error in TBLASTN (and TBLASTX) on the very last sequence in a nt. database, if that sequence contains any ambiguity codes. It's conceivable that this same bug could cause a segmentation fault under some conditions when examining the longest sequence in the database. Small amount of cruft removed. 1998-02-02 Posted version 2.0a18 1998-01-31 The included "pam" program can optionally report floating point (fractional) values. 1998-01-28 The "Searching" crash problem under Linux might be fixed -- we shall see! 1998-01-16 Scoring matrix files may now contain floating point values. Scoring of alignments is still performed using integral values. Fractional values are rounded to the nearest integer, e.g. 1.5 is rounded up to 2 and -1.5 is rounded down to -2. 1998-01-08 Fixed HSP list truncation procedure when there are more HSPs than hspmax allows. In the programs that search more than one strand, HSPs on the minus strand were sometimes discarded when they were more significant than HSPs on the plus strand that were being retained. 1997-12-07 Fixed buffer over-run in gt2fasta. Fixed empty database bug in setdb. gb2fasta now parses PID lines, in case input is "GenPept". Fixed cosmetic bug in the display of "V=#" value in a WARNING text for DEC Alpha platforms. 1997-11-12 Fixed the accumulation of matches beyond the number reportable, which consumed unnecessary memory. 1997-11-10 Added tests for maximum achievable score in each context or reading frame. Searches are not attempted if the cutoff score can not be achieved. Value specified for gapH on the command line was erroneously being plugged in for gapK -- fixed. 1997-10-30 Added knowledge of the "dust" low-complexity filter to BLASTN, so users can specify "filter=dust" command line option. This filter program must still be installed in the /usr/ncbi/blast/filter directory -- or in whatever directory the BLASTFILTER environment variable points to -- just like all other filters (i.e., seg, nseg, and xnu). Current users of dust will need to update their copy, as well, because dust was not calling exit(0), leading to an undefined exit status that BLASTN interpreted as an error occurring in dust. Roman Tatusov has modified the dust source code posted at the NCBI; and modified source code has been posted on the WU-BLAST Archives. 1997-10-30 Top combinations of HSPs are now sorted by their Group when topcombon feature is used. 1997-10-21 Posted version 2.0a17 Deleted a straggling test left behind from debugging that could cause BLASTN, TBLASTN, and TBLASTX to abort searches -- "Non-positive score returned from ExpandX" -- particularly when searching ambiguity-code-containing sequences like ESTs. 1997-10-15 Posted version 2.0a16 Fixed bug in alignment span detection when comparing gapped vs. ungapped alignments. 1997-10-14 When unacceptable nt. codes were encountered in the input FASTA file, pressdb wasn't reporting the proper error. 1997-10-13 Made fixes to POSIX threads support, which may improve threads performance under Digital UNIX 4.0. 1997-10-11 Speed tweak to BLASTP. Speed tweak-ette to the other search programs. 1997-10-05 Expanded pressdb error messages. Added a platform description to the "Build" string in the introductory output from the search programs -- e.g, "sol2.5-x86" -- and reordered the month, day, and year in the build date. 1997-09-27 Optimized BLASTN a little. Added double-hit method to BLASTN. Cleaned up a little the tabular display of Parameters. Fixed pattern recognition of some string=string command line parameters, e.g. "nogap" or "nogaps" are now acceptable. 1997-09-23 Fixed the behavior of Z parameter in BLASTP. It was being ignored. 1997-09-22 Made the search programs better able to work in some obscure cases with scoring matrix files that are incompletely specified, in that scores are not provided for absolutely all acceptable letter pairs. 1997-09-21 Posted version 2.0a14. 1997-09-18 Sped up BLASTX, TBLASTN and TBLASTX a little. 1997-09-12 Fixed error in HSP linked list management that on rare occasions caused crashes in the code introduced 6/12/97. In some rare instances, BLASTX was crashing in Solaris qsort, and Purify reported UMR errors in the Solaris qsort() library function. Crashing and UMR errors went away when HeapSort was substituted. PureAtria staff say Solaris qsort() is safe, but my experience says otherwise, so I'm going back to using Old Reliable, HeapSort. 1997-06-11 Fixed minor error in Smith-Waterman score test. Added berror() function for reporting non-fatal ERROR messages, in addition to the existing WARNING messages and FATAL errors. Some internal tests that formerly would have produced FATAL error reports will now simply report the ERROR and continue execution. New "-errors" command line option suppresses ERROR messages, in case they get in someone's way. Got rid of the annoying copyright notice being sent to /dev/tty. 1997-06-12 Eliminated reports of superfluous, inferior alignments. 1997-06-11 Modified memfile.c for HP/Convex SPP compatibility. 1997-06-10 Posted version 2.0a13 Added "postsw" option for Smith-Waterman algorithm to be applied to pairs of sequences that will be reported by BLASTP. The S-W score and alignment, if different from the 2-d BLAST score and alignment, are used to re-rank the database matches before output. Eliminated reports of some superfluous, inferior alignments contained within longer ones. Added error checks to all read, write, and seek operations in pressdb and setdb. 1997-06-09 Posted version 2.0a12 1997-05-31 Fixed interactions between gapE2/gapS2 and E2/S2 command line parameters. 1997-05-29 Speed bump for BLASTP, BLASTX, TBLASTN, and TBLASTX (not BLASTN). 1997-05-22 Speed tweak for BLASTP, BLASTX, TBLASTN, and TBLASTX (not BLASTN). 1997-05-15 Posted version 2.0a10 Speed tweak. 1997-05-12 Posted version 2.0a9 Word-hit statistics gathering is now OFF by default, since it consumes about 2% of total cpu time and most users never use the results. Use the -stats option to turn this feature back on. (This reverses the usage of the -stats option, which formerly was used to turn OFF the statistics gathering). In BLASTP, the full-diagonal search for ungapped alignments is skipped when the gapped alignment procedure is in effect -- saves a few % cpu time. Made the number of blank lines output between ungapped and gapped HSP alignments consistently 1. Fixed an inconsistency in the mid-lines of BLASTN alignments. Residue codes instead of vertical bars (|) were sometimes being displayed when no gaps were present in the alignment. The convention is supposed to be that residue codes appear only when there are one or more gaps in the alignment. 1997-04-14 Added nonnegok option for permitting nonnegative expected score cases to halt without exiting nonzero. 1997-02-25 Posted version 2.0a8 1997-02-20 Fixed HSP memory management bug that tended to cause crashes after 100% search completion when the list of database matches needed to be truncated. Removed HSP memory management bug in HSPTruncate related to fwdptr/revptr. Removed duplicate free of a KarlinBlk at end of blastn. Removed memory leak of scoring matrix name info. Added more timing statistics to the end of output 1997-02-06 Eliminated any reports of exact duplicate gapped alignments when "span" option is used. Added -s option for simple sequence identifiers to gb2fasta, gt2fasta, sp2fasta, and pir2fasta programs. Added -g option to omit NCBI gi identifiers in output from gb2fasta and gt2fasta. 1997-01-23 Posted version 2.0a7 Fixed GSP (gapped alignment segment pair) consistency check, which worked inconsistently when -span or -span1 command line options were used. No effect on HSP consistency in best P-value calculations, and no effect when span and span1 options were not used. 1996-12-13 Posted version 2.0a6 Fixed a minor file permissions error in setdb and pressdb. 1996-12-04 Added "noseqs" option to produce abbreviated output that may be still parseable by legacy parsers. 1996-12-03 Posted version 2.0a5 Added a "compat1.4" option to revert easily to version 1.4-like behavior, but with relevant bugs fixed. Improved the distribution of database sequences to the threads. Tweaked the search progress indicator so it always goes to "100%" even for databases of less than 100 sequences. 1996-11-27 Initial posting of version 2.0a4. Found and fixed another file addressing bug that could occasionally cause BLASTP and BLASTN to crash. 1996-11-24 Fixed a file addressing error that could yield segmentation faults with the initial 2.0a3 release, particularly when searching small databases. Slip-stream revision posted. 1996-11-22 Fixed a long-standing, occasional inconsistency in the sum statistics reported (since version 1.4). 1996-11-19 Initial posting of version 2.0a3 1996-11-19 When the BLASTDB environment variable has been set, which is a path of database-containing directories, the current working directory is automatically appended to the path. This provides some backward compatibility with previous versions of BLAST software, which looked in the current working directory by default. 1996-11-19 Incorrect bounding diagonals were often being used to constrain alignments with database subsequences for display. This affected the appearance of the alignments reported by those programs that search nucleotide sequence databases (BLASTN, TBLASTN, and TBLASTX) -- the programs that buffer database sequences in pieces for display. SCORE_ERROR messages would be seen when the error arose, but the scores reported as "Score = #" and used in the statistics were not affected. 1996-11-14 sp2fasta parses NCBI gi identifiers from the SWISS-PROT 34 flat file. 1996-11-13 Decreased the granularity of the threads. 1996-11-12 Minor rework of database access routines, to reduce virtual address space requirements. 1996-11-12 Removed two sources of slowness in BLASTN 2.0 relative to version 1.4. First, a high default value of 0.5 was being used for E2, which is 10-fold higher than the default value used in BLASTN version 1.4. Worst case, this could slow the program down by a factor of 10. Second, the default word length W has been increased to 11, restoring it to the same default value used by BLASTN 1.4. While these changes reduce the sensitivity of the program, they make direct comparisons easier of the relative performance of versions 1.4 and 2. 1996-11-12 Fixed a bug in sequence numbering (in BLAST version 2.0 ONLY) that caused the right-side coordinate numbers to be in error by 2 nucleotide positions in alignments of translated sequence. This bug could affect both the Query and Subject coordinate numbering, but only on the right side, not the coordinates displayed immediately following the "Query:" and "Sbjct:" strings. Coordinates were only wrong when the alignment contained one or more gaps; and the bug only affected the numbering of sequences that had undergone translation prior to being compared -- e.g., only the query sequence in a BLASTX search. 1996-10-29 Fixed the display of gapped alignments involving long sequences. With coordinate numbers greater than 5 digits in length, the alignments were skewed to the right. 1996-10-28 Sped up the gapping version of BLASTN and verified that it works properly when wordlength W is varied. SCORE_ERROR bug/feature (sometimes seen with database sequences that contain ambiguity codes) is now history. Increased BLASTN's default value for W from 10 to 11, so it is the same default value used by BLASTN version 1.4, to facilitate and equalize the inevitable comparisons to be made between the two versions. For additional speed, W can now be increased up to 32, albeit at a significant decrease in sensitivity and increase in memory use; the time saved during the search can also be lost in setting up for the search with long word lengths. 1996-10-27 Implemented gapK, gapL and gapH command line options to enable the user to manually set values for the Karlin-Altschul statistics' K, lambda and H parameters used in evaluating the significance of gapped alignment scores. The units of gapL and gapH are nats/score and nats/alignment position, respectively. (1 nat ~= 1.443 bits; 1 bit ~= 0.693 nats) For any of the 3 parameters' values that are not set on the command line, their default values will be obtained from precomputed tables as before. 1996-10-20 Added -mmio option to turn off memory-mapped I/O in all of the *BLAST* programs. For some users, this means the programs may coexist better with other programs or with other users on a shared system (e.g., on a system that is not a dedicated blast server). As a part of using this option, consumption of virtual memory address space is also reduced, which is becoming increasingly important as database files grow in size; some operating systems or system administrators will not necessarily allow per-process memory needs to increase concordantly; but frequently the shell's "limit" command can be used to increase "memorysize" and "datasize" limits, rather than resorting to turning off memory-mapped I/O. The potential for a problem arises most often with nucleotide sequence database files, when the original FASTA-format file is available. When holding all of the nt. sequences of GenBank, a single FASTA file is currently about 1 GB in size. Memory-mapped I/O is still used by all of the programs by default, as it is faster and doesn't seem to be a problem for most users. 1996-10-18 Added Lambda, K, H entries for gapped alignments with BLOSUM80 scoring matrix. Precomputed values exist for Q=7, 5<=R<=7; Q=8, 4<=R<=8; Q=9, 3<=R<=9; Q=10, 2<=R<=10; Q=11, 2<=R<=11; Q=12, 2<=R<=12. 1996-09-17 Fixed an anomaly that arose at low frequency with the gapped blast heuristic. 1996-09-10 Changed blast sort routine to avoid possible arithmetic overflow on some platforms (e.g., Solaris for x86). 1996-09-03 Brought all genetic codes into synchrony with the NCBI Version 3.3. 1996-07-09 Fixed crashing of pressdb when the FASTA input file was zero-length. * * * * * * * * * * 1996-05-10 Posted WU-BLAST 2.0d1, the first publicly available BLAST with gapped alignments and statistics. Announced in talk at Cold Spring Harbor Genome Mapping and Sequencing conference and on Usenet bionet.software newsgroup. * * * * * * * * * * 1996-05-09 Added an "identity" scoring matrix for BLASTN searches. Not perfect, though, it ascribes a penalty of only -10000 to mismatches. It's possible then to have one mismatch every 10 KB or so and still achieve a positive score. 1996-04-29 Fixed statistical calculation in the case of multiple consistent HSPs and sum statistics. When r consistent alignments were combined, the p-values computed were too low by a factor of about r!. 1996-02-13 Added "Edegrade" command line parameter for regulating the quality of HSP combinations reported per database sequence. 1995-11-04 Fixed a bug in the parsing of sequence identifiers that could yield incorrectly justified text in the initial, one-line summary section of blast program output. When this bug arose, there were 25 columns of white space at the beginning of each line. 1995-11-03 Updated the list of built-in genetic codes in blast/blast/gcode.h using the latest NCBI Toolbox ASN.1 data (toolbox/data/gc.prt). 1995-10-26 Fixed a multiprocessing bug in the blast programs that could arise when searching small databases (<500 sequences). 1995-10-03 Added support for NCBI (Wootton & Federhen) "nseg" program on the BLASTN command line, using "-filter seg" option. 1995-09-27 Added "-WashU" tag to the program version numbers, to ensure there is no mistaking WashU distribution of these programs from the NCBI distribution. 1995-09-26 Fixed a long-standing bug in pressdb regarding which sequences are tagged as having "ambiguous" nucleotide codes. Thanks to Colin Watanabe at Genentech for pointing this out. 1995-09-18 The PRESSDB program (pressdb.c) can now append sequences to an existing BLAST database, using the -a option. (The SETDB program has not been so modified yet). 1995-08-22 The file locking described on 6/7/95 has been disabled at least temporarily because it is not functioning in the intended manner with files that reside on NFS-mounted partitions. 1995-08-14 gb2fasta now parses NCBI "gi" identifiers from the GenBank flat files. 1995-06-07 See note on 8/22/95! Database file locking has been added to the BLAST search programs and to the database maintenance programs setdb and pressdb, to eliminate (or optionally reduce) the opportunity for collisions between database search and database maintenance activities. Previously, a setdb or pressdb invocation would cause active BLAST searches of the same database to fail. File locking now prevents the blastable database files from being modified by setdb/pressdb until they are no longer in use by a search program. This doesn't necessarily come without some risk. With strict file locking in force (the default), deadlock or near-deadlock may now be a concern within a production environment, as multiple simultaneous BLAST search production lines involving one database can effectively block setdb or pressdb forever -- unless all production lines happen to finish their searches at the same time. Having all production lines finish at virtually the same time may be an infrequent event if more than just a couple are running. This new situation seems more desirable, though, than not using file locks and unwittingly allowing setdb and pressdb to blow away databases out from under any searches. As an aid to diagnosing deadlock situations should they arise, when blocked, setdb and pressdb report their blocked status every 60 seconds. If deadlock is a real problem, one can revert to the former, ungoverned situation by completely disabling file locking with the new -l option to the setdb/pressdb programs. Significant file lock protection can still be obtained, though -- and without the risk of deadlock -- by using the -b option to setdb/pressdb instead of completely disabling it with -l. The -b option simply blocks any subsequently invoked BLAST searches until the current setdb/pressdb operation is finished, however any search that happened to be in progress when setdb/pressdb was invoked will get trashed. Through the use of locks, it is possible to update databases that are actively being searched or that reside on-line in a production area, without the need for off-line, ancillary working storage equivalent to a full copy of the database. N.B. One area not addressed by the present file locking is that of the FASTA-format nt. sequence file accessed by BLASTN, TBLASTN, and TBLASTX, which still causes problems if updated in the middle of a search. 1995-06-01 Fixed a long-standing deadlock problem in the Solaris multithreaded executables (and more recently the OSF/1 executables). 1995-05-28 Removed the link between X & S that existed in blastapp/lib/context.c. 1995-05-24 Threads support (parallel processing) added for DEC OSF/1 3.0 (Digital UNIX). 1995-05-20 Switched to using Robinson&Robinson (PNAS 1991) amino acid residue frequencies. Fixed a minor slowness problem in BLASTN, TBLASTN, and TBLASTX (all of the programs that would access the FASTA-format database file, doing so more often than necessary). Changed the name of the recently added "pgsper" command line option to the simpler name "progress". It's now described in the documentation file, blast.1, too. 1995-04-26 Added "-pgsper #" command line option to adjust the time-out period in progress messages. Alarm clock errors when using Solaris threads prompted the creation of this parameter. To avoid any possibility of the alarm clock error, set a time-out of 0. Changed basename() to misc_basename() for Linux compatibility. 1995-03-30 Made memory management a little more flexible and robust. V & B command line options are supported in the ASN.1 form of the output now. Made changes for VMS compatibility kindly suggested by Scott Rose (GCG, Madison, WI). 1995-03-08 pressdb and setdb now parse arbitrarily large FASTA input databases, expanding their memory buffers as much as necessary. No more need to modify ENTRY_MAX. 1995-03-07 I lied on 2/1/95. Solaris threads support promises to be robust now. Famous last words. 1995-02-13 The dfa library was consolidated into the gish library. 1995-02-01 Too optimistic on 1/24/95 -- the Solaris threads/alarm problem was not fixed then. It truly seems to be fixed now. Also, fixed a bug in BLASTN's calculation of the Karlin-Altschul K value. Plus some slight performance improvements to BLASTN, TBLASTN and TBLASTX, related to the FASTA file access; because of this improvement, BLASTN is set to use up to 4 processors by default instead of the previous default of 3. 1995-01-24 Fixed (for the last time?!) the interaction between Solaris threads and SIGALRM signals in the "gish" library. 1994-12-19 Fixed a multiprocessing bug in all of the programs. The bug would often produce crashes (segmentation faults) when searching tiny databases. hsp_max is now used to truncate HSP lists _after_ statistical significance estimates have been made and after the list has been sorted for output. 1994-12-16 Fixed handling of gap characters in the query sequence by blastx, tblastn, and tblastx. 1994-12-15 blastp was stripping gap characters (-) from the query sequence. fixed. 1994-10-16 Fixed a severe bug in the support for multiprocessing under Solaris 2. Some of the code involved in this bug fix is in the "gish" library. Program version numbers are unchanged by this fix; but the code release date displayed in the programs' introductory output is updated to day's date. 1994-10-06 First "final copy" release of BLAST 1.4 software. 1994-10-04 Changed "-overlap", "-overlap1", and "-overlap2" command line option names to "-span", "-span1", and "-span2", respectively. "-span2" is the default. 1994-09-30 I'm now employed by the Department of Genetics, Washington University School of Medicine, St. Louis, MO 63108 1993-09-03 Fixed bug in gb2fasta's concatenation of long definitions. 1993-08-08 Added -qoffset option to BLASTP, BLASTX, TBLASTN, and BLASTN, to permit segments of long sequences to be used as queries and still have their residues numbered correctly in alignments. 1993-07-28 Changed the format of substitution matrix files read by BLASTP, BLASTX, TBLASTN and BLAST3. Substitution scores in the matrix files can now properly have non-integral values. The blast program still do their scoring using integral data types. Upon being read by the blast programs, each score value is rounded to the nearest integer. Matrices in the new format are generated by the pam program. Fixed the display of query sequence segments in BLASTX when its -codoninfo option is invoked. 1993-07-07 Prompted by Erik Sonnhammer, a "-overlap2" command line option (also available as simply "-over2") was added to make the criteria for HSP overlap detection tighter. This option has a positive effect on the number of HSPs reported (fewer of them will satisfy the overlap2 criteria) for sequences that contain internal repeats, but will have a negative effect on their associated statistics. The additionally reported HSPs may have Poisson statistics inappropriately applied, because the HSPs may be incompatible with others in the same global alignment and hence can not be considered as independent events. For query sequences too short to satisfy the cutoffs or expectation thresholds, the minimum acceptable expect values that were reported by BLASTP, BLASTN, and TBLASTN were incorrect, now fixed. 1993-07-02 Changed the way the cutoff score, S, and expectation cutoff, E, are reported. All output is now filtered based on its estimated statistical significance (E value), rather than using cutoff scores directly. 1993-06-22 Fixed bug in consistp.c's implementation of R(i,3) found by Phil Green. Followed another suggestion of Phil Green's for making Poisson probability calculations more efficient. 1993-06-21 Fixed bug in the calculation of "consistent N counts" for those HSPs found on minus strands in BLASTN, BLASTX, and TBLASTN. Plus strand hit counts were not affected. Pressdb on 64-bit platforms now produces databases that are readable on all platforms. 1993-06-16 Fixed a conflict between static and global variables in bldaa.c and bldxa.c This produced a bug in the blast software under DEC Alpha OSF/1. 1993-06-09 Added "-gapdecayrate" parameter (default=0.5), as suggested by Phil Green (Washington University, St. Louis). This parameter defines a geometric progression used to adjust Poisson probabilities upward, to account for the fact that many values for the N parameter in Poisson P(N) are considered when choosing the "best" alignments. If r is the decay rate (0 < r < 1) for the progression and n is the number of segments under consideration, then the number of gaps is n-1 and the Poisson probabilities will be _divided_ by the quantity: n-1 (1-r) r For n=1 (one HSP) and the default r=0.5, the adjustment is by a factor of 1/(1-0.5) = 2. Fixed a bug in lib/consistp.c that produced undetected overflows in factorial calculations. This was occasionally problematic in TBLASTN queries with hits against extremely long database sequences. 1993-05-09 In TBLASTN, fixed discrepancies in alignments when a database sequence contained one or more ambiguity (non-ACGT) codes. Previously, the original FASTA format database sequence was only examined at the end of the search; now it is examined during the search, so that it is known up front what the real alignment score and extent of alignment is. The HSP cutoff score in TBLASTN is now S2. Previously, there had to be at least one match scoring at least as high as S, after which the database sequence was re-scanned using a cutoff of S2. Now each database sequence is scanned only once, using the lower cutoff. Better sensitivity results for short exons. Something not done now, however, is to scan the entire diagonal on which an HSP is found. 1993-05-08 Fixed severe bug in BLASTN. Word hits on the plus- and minus- strands were being managed in a single pool, rather than separate pools. Consequence: hits on one strand could obscure hits on the other strand. In typical use, this would rarely cause a problem because of the improbably long wordlength used by BLASTN (W=12) and the requirement for the word hits to appear in a particular order. This bug was present since BLASTN's inception. In BLASTN, fixed discrepancies in alignments when a database sequence contained one or more ambiguity (non-ACGT) codes. Previously, the original FASTA format database sequence was only examined at the end of the search; now it is examined during the search, so that it is known up front what the real alignment score and extent of alignment is. 1993-05-06 Fixed a bug introduced to BLASTN on 5/4/93, wherein the first residue in the complementary strand (i.e., the complementary residue to the last residue on the "plus" strand) was not initialized. This bug would reveal itself iff the query contained one or more non-ACGT codes and the first residue on the complementary strand should have continued a matched with a database sequence. Tweaked the default value of E2 upward from 0.1 to 0.15, in reaction to the bug-fix on 5/5/93 which had raised the value of S2 calculated from E2. 1993-05-05 Stupid bug fixed in all blast programs. The units that had been assumed for the Karlin-Altschul H statistic in the function stolen() were "nats per position", whereas the karlin() function was calculating H in units of "bits per position". The karlin() function was modified to calculate H in nats, and all equations that were functions of H and had been (correctly) assuming H was in units of bits were modified to account for the change to nats. H is still reported in units of bits, because of the automated parsers in the world. The consequences of this error were (1) that the expected length estimated for an alignment of any particular score was too short by a factor of log(2); and (2) the probability estimates reported by the programs were often higher (lower in statistical significance) than they should have been. 1993-05-04 In BLASTN, ambiguous nucleotides in the query sequence are handled consistently throughout the program as mismatching all other letters, so that, e.g., strings of N's can be used to mask a query sequence. In addition, gap letters (hyphens) in the query sequence will never appear in an alignment (although they may appear in the database sequence half of an alignment). Ambiguity codes in the database sequences (only) can still lead to discrepancies between the scores obtained during the search and the scores reported after the search. 1993-04-23 Recently, in all of the blast programs, a "consistent" N parameter was used in the Poisson statistics, to reflect the number of HSPs likely to be consistent with one another in the same gapped alignment. Now, all of the blast programs build upon this by using another enhancement of Stephen Altschul's, which is to adjust the Poisson probabilities downwards (making them more significant) to account for the consistency requirement. There is no effect on single-HSP probabilities. Some reordering of the database sequences will be observed in the output, with multiple-hit cases often moving up a few notches relative to the single-hit cases. With the consistency-adjusted Poisson P-values, sensitivity is expected to be marginally improved, being practically confined to matches which would anyway come close to satisfying the statistical significance threshold. If the threshold is set at a point within or just above background, it will be more common to see the new program report false positives than the previous version. Improved sensitivity will also be noticed more often with longer sequences, which provide greater opportunity to accumulate multiple hits with a single database sequence. The consistency feature (which includes both the consistent N and consistent Poisson statistics) can be turned off with the "-consistency" command line option. The statistics of consistent HSPs is discussed by Karlin and Altschul in a manuscript recently submitted to Proc. Natl. Acad. Sci. USA. 1993-04-06 HSP == high-scoring segment pair, the unit of BLAST output In all of the BLAST programs, the Poisson event count (or the N parameter used in the Poisson statistics) assigned to each HSP is now estimated more accurately, using positional information as well as scores. A simple midpoint rule of Stephen Altschul's design is used to estimate the number of HSPs that would be consistent with each other in the same gapped alignment. Let (x,y) represent the location in 2-dimensional space of the midpoint of an HSP. In a "consistent" set of HSPs, if the HSPs are sorted in increasing order of their x coordinates, then the y coordinates of the sorted list also produce a strictly increasing sequence. For any given HSP, the maximum number of other HSPs that can be made consistent with it (plus 1 for the HSP under consideration) becomes the Poisson N parameter. The effect of this change is to reduce the number of false positives reported (improved selectivity), which sets the stage for the following... In BLASTP and TBLASTN, a much lower cutoff score (S2 instead of S) for reporting HSPs is used in conjunction with the consistent event count. HSPs are filtered from the output based on their statistical significance as estimated using Poisson statistics. Due to Altschul's consistency rule, a lower cutoff score can be used without introducing too much extra noise in the output, while providing increased sensitivity in detecting homologs in the presence of insertion/deletion errors and mutations. This change has not yet been documented in the blast manual page, and the values of S2 and E2 (E2 defined to be the number of chance matches expected when comparing two random sequences each 300 amino acids in length) can not currently be modified from their default values through the NCBI BLAST E-mail Service. With previous versions of BLASTP and TBLASTN, a database sequence had to produce at least one segment (HSP) scoring at least as high as the cutoff score, S, in order to be reported. And if this high threshold was met, the database sequence was scanned a second time using a lower cutoff, S2. This repeat scanning no longer occurs--all database sequences are scanned using the lower cutoff. The former cutoff score parameter, S, and expect parameter, E, now establish a threshold of statistical significance that must be satisfied by the Poisson P-values of the HSPs regardless of their individual scores. The evaluation of HSPs works like this: if a single database sequence yields one or more HSPs each scoring S2 or higher with the query, the list of HSPs is first sorted by score just as before; consistent event counts are then assigned; Poisson probabilities are calculated; and finally the list is truncated after the last HSP having a Poisson P-value that satisfies the S or E significance threshold. If no Poisson P-values satisfy the threshold, then the whole list is thrown away and none of the HSPs is reported. S might be thought of as the score that must be achieved by an HSP observed in isolation (Poisson event count = 1) for it to be reported. While use of a lower cutoff score is the default for BLASTP and TBLASTN, a similar low cutoff has been made an option for BLASTX, which may become the future default. It is presently only an option because it is feared that some automated parsers of BLASTX output might break if the lower cutoff method was suddenly instituted as the default. To invoke the option in BLASTX, specify a value for either E2 or S2 on the BLASTX command line. E2 is the number of HSPs expected to be observed by chance when comparing a random sequence 100 codons in length against another random sequence 300 amino acids in length. A suggested starting choice for E2 is 0.1. This change to BLASTX has not yet been documented in the blast manual page, and the option is also not presently selectable through the NCBI BLAST E-mail Service. A lower cutoff was not introduced to BLASTN, because the sensitivity of this program with its fixed wordlength W=12 is low. BLAST3 has always used a low cutoff. Symmetric multiprocessing can now be employed by the BLAST programs under SunSoft's Solaris 2.2 operating system, as well as the previous Silicon Graphics' IRIX operating system. The code has only been tested under a beta release of Solaris 2.2. Code is also included to putatively use threads in an OSF/1 environment such as Digital's OSF/1 on the Alpha AXP platform, however it has not been possible to test this code. Many more enhancements in the software are included, not all of which are documented yet or bundled here--e.g., support for the low-compositional complexity SEG filter of Wootton and Federhen (wootton@ncbi.nlm.nih.gov) and the short-periodicity repeat XNU filter of Claverie and States (jmc@ncbi.nlm.nih.gov). Also, optional use by BLAST of codon bias information read from *.cdi files (States and Gish, manuscript submitted). The interfaces to these features are not well developed, subject to change, and are presently provided "as is" in an effort to expedite moving the earlier-mentioned improvements into users' hands. 1993-03-25 The default neighborhood word score threshold (T parameter) was raised a notch in TBLASTN only, to obtain a roughly compensatory increase in speed for the performance hit that was incurred in the switch to using the new default BLOSUM62 matrix on 3/19/93. 1993-03-19 Changed the default substitution matrix used by BLASTP, BLASTX, TBLASTN and BLAST3 from PAM120 to BLOSUM62. Speed declines by about 30-40% as a result. 1993-03-05 Changed the format of the sequence identifiers output by the programs gb2fasta, gt2fasta, pir2fasta, and sp2fasta. LOCUS and ACCESSION identifiers are now included. 1992-12-08 sp2fasta now strips carriage-return characters from the definition lines, so the program now works well when parsing sequences files on the EMBL CD-ROM. 1992-11-16 BLASTP prunes its hitlists at the point where the expectation E/S is no longer satisfied. E2/S2 is now the cutoff for saving HSPs for subsequent pruning by the E/S criterion; after pruning, no HSPs may remain. Noise is reduced by the pruning, and better sensitivity is obtained by using a lower cutoff score followed by filtering on Poisson P-values. 1992-11-05 Moved lib/shmutil.c and lib/mfile.c into the "gish" library, and removed the USE_SHM macro. 1992-11-04 Renamed include/blast.h to include/blastapp.h, to prepare for migration to using a blast function library which contains blast.h. 1992-10-26 Fixed a bug in searcha.inc regarding the handling of segmented sequences in BLASTP and TBLASTN. During examination of a diagonal for hits while ignoring X, the programs had been halting the diagonal search when a gap character was encountered in either the query or the database sequence. 1992-10-02 Made code compatible with architectures having 8-byte long integers, e.g. DEC Alpha. 1992-10-01 Added gt2fasta program for extracting coding sequence (CDS) feature translations from files in the GenBank(R) flat file format, saving the results in a FASTA format file. 1992-09-07 Moved bulk of the low-level multiprocessing support into the "gish" library. 1992-09-04 Corrected a bug in lib/hsppool.c that caused occasional bus errors and segmentation violations. 1992-09-04 Added several BLOSUM matrix files to the distribution. Moved all matrix files into a new "matrix" subdirectory. Renamed BLASTPAM environment variable to BLASTMAT, and changed its default value from "/usr/ncbi/blast/pam" to "/usr/ncbi/blast/matrix". 1992-09-03 Corrected the substitution scores for B-X and Z-X reported by pam program. Current version of pam is 1.0.5. 1992-08-25 Made the software compatible with DEC Ultrix and other operating systems running on "little endian" platforms. BLAST databases, which contain binary encoded integers, can be shared between big and little endian platforms. Big endian platforms will be only marginally more efficient. 1992-08-14 Changed one fatal error message to what should have been merely a warning in BLASTN. Added a warning message to BLASTP and TBLASTN. No change in version numbers. Default value for the H (histogram) parameter is now 0 to omit reporting the histogram. 1992-08-05 Fixed a bug in the single-processor version of blast3(out3.c) that produced an infinite loop. (How does this bug keep reappearing??) 1992-07-01 Corrected a bug in lib/getseq.c that would cause BLASTN and TBLASTN to crash when reporting hits on single-processor platforms when the compressed nucleotide database file *.csq was loaded in shared memory. No effect if shared memory was not actively in use. 1992-06-18 In blastx, corrected the statistic reported for the highest observed score in each reading frame. 1992-06-16 Added several Hitlist sorting options to each of the BLAST programs except BLAST3. -sort_by_pvalue is the default for all. -sort_by_count sorts by the number of HSPs in each database sequence's hitlist. -sort_by_highscore sorts by the highest HSP score in a hitlist. -sort_by_totalscore sorts by the total of all HSP scores in a hitlist. Example: blastp pir myquery -sort_by_totalscore 1992-06-25 Corrected the way averaging was performed to calculate substitution scores against letters B and Z in the matrices produced by the pam program (pam.c). Standard Dayhoff PAM-250 matrix is now included in the distribution, under the filename "dayhoff". 1992-05-15 Fixed a bug in blast3 that caused it to produce an unexpected number of pair-wise alignments. Often no pairwise alignments were displayed at all. This bug had no effect on the 3-way alignments produced. 1992-04-17 Fixed a bug in the single-processor version of blast3(out3.c) that produced an infinite loop. 1992-04-08 Pressdb still requires sequence lines to be of equal length (except for the last line of each sequence, which can be shorter), but it now tolerates one or more blank lines at the end of each sequence. 1992-04-02 Added function etop(), which uses new function fct_expm1() in the gish library, to calculate probabilities from expect values. Changed the letter 'X' in the nucleotide alphabet to '-', which is supposed to represent a gap (as it does in the amino acid alphabet), but currently is treated by BLASTN like a mismatch character. 1992-03-31 Added a "gap" character, '-', to the amino acid alphabet used by BLASTP, BLASTX, TBLASTN, and BLAST3, which breaks alignments into separate segments. BLASTN does not support gap characters. Fixed a severe bug in the multiprocessing version of TBLASTN: the translate() function failed to set s_len, the database sequence length, in frame 1. Until the gap letter was introduced to the amino acid alphabet today, it is not clear that this deficiency caused any problems. It certainly did not affect the results on uniprocessing platforms. 1992-03-30 Fixed bug in blastn's overlap checking function, ovlap_n(), that caused minus-strand HSPs to be reported that were intended to be filtered out. Merged versions of pvals_a(), pvals_n(), and pvals_t() into a single pvals() function. Fixed a bug in pressdb that would appear only if each sequence in the input FASTA-format database file resided on a single (possibly very long) line. 1992-03-29 blastp, blastn, blastx, tblastn, and blast3 have no theoretical limit on the line length in the query sequence file; setdb and pressdb have no theoretical limit on the length of lines in the input FASTA database files. Several programs were modified to accommodate a change in the gish library's misc/basename() function--an updated copy of the gish library must be obtained for compatibility. 1992-03-28 Better handling by TBLASTN of cases where the database sequence contains nucleotide ambiguity codes. Now neither BLASTN nor TBLASTN requires the original FASTA-format nucleotide sequence database file. Long strings that had been static are now allocated dynamically. 1992-03-27 Better handling by BLASTN of cases where the database sequence contains ambiguity letters. BLASTN now does not require the original FASTA-format nucleotide sequence database file. (TBLASTN still does, however). 1992-03-09 Faster K calculations now performed. Accuracy is 2+ decimal places for the PAM120 and 2- places for PAM250. This generally translates into only a small error (<1%) in the dependent P-values, expectations, and bit scores, which seems acceptable for an approximate 20-fold improvement in the speed of calculating K. Furthermore, the error in K is on the high side, so P-values etc. tend to be conservative. The speed is achieved by performing fewer iterations in the main K loop and compensating for this by adding in several corrective terms from a geometric progression of Altschul's design. 1992-02-20 Made changes to the Makefiles. Verified that all required libraries (ncbi, gish, dfa) and programs can be built. New copies of all dependent source code should be gotten. 1992-02-18 Switched the BLAST application programs over to using a new version of the dfa library. The new dfa library is required. 1992-02-10 Changed SGI IRIX compiler optimization flag from -O3 to -O2 in main copy of Makefile.sgi, for compatibility with IRIX 4.0. 1992-01-23 Fixed bug in sp2fasta.c that caused the last character of each DE line to be omitted. 1992-01-17 Minor bug fix in lib/mfile.c and a major bug fix in BLAST3's out3.c. Both bugs were introduced recently; the former one prevented compilation of mfile.c; the latter one sent the 3-way search phase of BLAST3 into an infinite loop on single-processor architectures. Version numbers are not being incremented. 1991-12-31 In searchn.inc, which is used by BLASTN, the strand (frame) of each HSP was not being set. 1991-12-30 Added sp2fasta utility for converting SWISS-PROT text format into FASTA format. 1991-12-29 Fixed bug in blastx.c and others, in vicinity of isspace() macro usage. 1991-12-24 Fixed filesize bug in shmutil.c. Only applicable to users of shared memory. 1991-12-23 Improved commande line parsing. New -overlap option added to all blast programs to turn off HSP overlap detection and removal. 1991-12-18 Improved signal handling in multiprocessing situations. 1991-12-11 Fixed bug in blast3.print_p which arose if USE_MPROC was _not_ defined and the database was not resident in shared memory. Fixed semaphore SETVAL bug in shmutil.c and minor bug in memfile.c. 1991-11-13 The mode parameter of mfile.mfil_open() was not being passed to fopen() when USE_SHM was undefined. 1991-11-11 Neglected to initialize the pts[] array to NULL pointers in blast3.c. 1991-10-23 Fixed frame reference bug in blastx.print_parms. 1991-10-04 Hits on opposite strands of a query or database sequence are now considered to be distinguishable events, and so are counted separately in the Poisson statistics calculations. The default value for E used by BLASTP, BLASTN, BLASTX, and TBLASTN has been reduced from 25 down to 10, to avoid reporting quite so many hits which are statistically insignificant under the random sequence model. The experienced user may well want to routinely use even a lower value for E, e.g. E=1 or E=2. 1991-09-27 BLASTN is now rigid in its interpretation of matching/mismatching. Residues must be either A, C, G, T(U) to match with any other residue. And T now matches U. There is no concept of a partial match with BLASTN. For example, R (purine) does not half-match with a G or A, but rather is scored as a complete MISMATCH. The blast.1 manual page is better. 1991-09-25 Improved reporting of individual HSP statistics (including the number of bits of information associated with the alignment scores), and a more consistent report style across all blast programs. 1991-09-23 Marginal improvement in speed of BLASTP and TBLASTN (re: zero-ing of diagonal hit structures in search_aa()), with a concomittant correction to the hit statistics reported by these programs. Only a minor change was made with respect to BLAST3, but since all three of these programs include the same searcha.inc file, the version number on BLAST3 was bumped up one. 1991-09-20 Better compatibility with Cray UNICOS (version 7.0) 1991-09-19 Removed one last dependency of the software on the alphabetical case of residues in the FASTA databases. This change was localized to one line in blastn.c. 1991-01-06 Only the frequencies of occurrence of unambiguous letters (non-X for protein and non-N for nucleotide sequences) are used to calculate the Karlin parameters K and Lambda (and H). This change can lead to occasional warning messages (usually not fatal errors and not serious) about the score probabilities not adding up to 1.0. The "pam" v1.0.3 utility program now calculates a weighted average substitution score against the ambiguity letter X; a command line option permits the user to set a constant substitution score instead. Several .h and .c files had some ANSI-incompatibilities fixed; in particular "Boolean" parameters were changed to "int" because of the use of old-style function declarations. 1991-01-02 Fixed severe multiprocessing bug in TBLASTN--has no effect on uniprocessing.