BLAST2 parser (c) 1999-2001 Arne Mueller
****************************************


[VERSION 1.2.3], Fri Aug 17 16:06:07 BST 2001


A parser for NCBI GAP-BLAST (BLAST2) and PSI-BLAST written in and 
for python.



This software is distributed under the terms of biopython:

                Biopython License Agreement

 Permission to use, copy, modify, and distribute this software and
 its documentation with or without modifications and for any purpose
 and without fee is hereby granted, provided that any copyright
 notices appear in all copies and that both those copyright notices
 and this permission notice appear in supporting documentation, and
 that the names of the contributors or copyright holders not be used
 in advertising or publicity pertaining to distribution of the software
 without specific prior permission.

 THE CONTRIBUTORS AND COPYRIGHT HOLDERS OF THIS SOFTWARE DISCLAIM ALL
 WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED
 WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE
 CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY SPECIAL, INDIRECT
 OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
 LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT,
 NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION
 WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

See http://www.biopython.org for details of the biopython project.

Files included in this distribution:

README			this file
fio.py			source for buffered file input/output
misc.py			other (useful) routines and classes 
blast.py		source for the blast parser
blastflt.py		source for an example program using fio.py and blast.py
Rv3829c.blast		example BLAST version 2.10.0 output 
Rv3829c_blast.flt	Rv3829c.blast processed with blastflt.py
Rv3829c.psiblast	example PSI-BLAST version 2.10.0 output 
Rv3829c_psiblast.flt	Rv3829c.psiblast processed with blastflt.py


INTRODUCTION:
-------------

The  blastparser  has been  developed  to  automatically post  process
output  from BLAST/PSI-BLAST  as implemented  in the  blastpgp program
from NCBI (ftp://ncbi.nlm.nih.gov/blast/). Output files from PSI-BLAST
easliy  can bigger  than 50  MB.  For  an automated  genome annotation
project I had to scan  more than 4000 Protein sequences with PSI-BLAST
through a database. This raises  three problems: 1. The overall output
is extremly  large 2. The  relevant information is hiddden  among huge
redundancy  in the  files.  3.   The  relevant information  has to  be
represented as a datastructure in programs dealing with evaluating the
results of homology searches.

In my  case I wanted  all non redundant  sequence hits to each  of the
genomic queries  that has an  e-value <= 0.0005. The  PSI-BLAST result
files contined up  to 20 iterations, and each  of these iterations may
contain  ALL sequnces  of  the former  iteration  (but with  different
alignments and scores/e-values).  Theirfore I have written the program
'blastflt.py' included  in this package that uses  the BLAST-parser to
get the  sequence name, the  database the sequence comes  from, score,
percent sequnece identity, start/end of the the alignment with respect
to query/subject  (target) and the  alignment it self.  The alignments
are represented  as 'stacked' multiple  sequnece alignments containing
leading and  trailing gaps  (all sequences have  the same  length) and
intervening  gaps. Gaps  in the  query are  represented as  lower case
letters in the subject sequence. In short words: The whole result look
a         bit         like         mview's        'new'         format
(http://mathbio.nimr.mrc.ac.uk/~nbrown/mview/). 

Here's a fragment of an example output of blastflt.py:

NR   Rv3829c         E          SCORE   ID  QSTRT QEND SSTRT SEND MTGYDAIQA
1.1  emb:CAB10023|   0.000e+00  1031.00 94  1     536   1    536  MTGYDAIVA
2.1  gi:2749982      3.000e-33  143.00  24  4     514   15   533  ---YDAII-
3.1  dbj:BAA10561|   9.000e-23  108.00  24  5     516   5    529  ----DVVI-
4.1  dbj:BAA80328.1| 3.000e-20  100.00  23  9     515   1    501  ---------
5.1  dbj:BAA10798|   5.000e-12  73.00   22  4     466   7    443  ---YDAIV-

The first line  describes the query, the rest are  hits from the blast
extracted  from the  BLAST  file, each  hit  gets a  single line.  The
processed  output can  be read  with an  editor and  contains  all the
information I needed  for my project, it's much  more compact than the
original BLAST output.

The file  Rv3829c.fasta is a  query sequence which was  processed with
blastpgp,  and  Rv3829c.(psi)blast  is  the  (PSI-)BLAST  output.  The
filtered output (from blastflt.py) can  be found in the two files with
extension 'ftl'.

You can rerun the program:

$ ./blastflt.py --fasta Rv3829c.fasta --blast Rv3829c.blast > Rv3829c_blast.flt
parsing blast file Rv3829c.blast ...
generating output ...
$ ./blastflt.py --fasta Rv3829c.fasta --blast Rv3829c.psiblast > Rv3829c_psiblast.flt
parsing blast file Rv3829c.psiblast ...
adding iteration 1 ...
adding iteration 2 ...
adding iteration 3 ...
adding iteration 4 ...
adding iteration 5 ...
generating output ...

The resulting multiple sequence alignment (or stacked alignment) contains
the full length query sequence and 713 subject sequences.

A new feature in version 1.2 of the blast parser is to cluster redundant 
hits and then just to take one representative for each of these clusters,
e.g. the above 713 hits can be made non redundant to 550 hits with the 
following command:

$ ./blastflt.py --fasta Rv3829c.fasta --blast Rv3829c.psiblast  --cluster --keep '\|(pdb|sp)\|'> Rv3829c_psiblast.flt2
parsing blast file Rv3829c.psiblast ...
adding iteration 1 ...
adding iteration 2 ...
adding iteration 3 ...
adding iteration 4 ...
adding iteration 5 ...
cpu time: 76.34 sec
Rv3829c: clustering redundant hits ...
Rv3829c: sequences = 713, clusters = 550, hits = 541
cpu time for clustering: 68.64 sec
generating output ...

The clusters all hits from source databases other than pdb or sp (swissprot).
You may be interested in especially hits to pdb or swissprot - so we accept 
full redundancy. All other hits are put into the same cluster if they share
at least 80% sequence identity and if they align within 10 residues offset
against each other in the query sequence (i.g. they are roughly in the same 
region of the query). Also sequences in the same cluster all overlap at least
90%.

The above output tell you that from 713 initial hits 550 clusters where built
(representing 541 hits, there are more clusters than hits because there may be 
different clusters per HSP of a hit - in a blast output the number of HSPs is 
always >= the nmber of hits!).

NOTE: clustering is rather slow at the moment!

(Stderr is logged to file 'blastflt.log' in the current directory.)

WHAT THE BLAST-PARSER OFFERS:
-----------------------------

The parser needs a file object of class FIO (included in this package)
that represents the BLAST/PSI-BLAST  output. BLAST output is PSI-BLAST
output  with only  one iteration.  The parser  reads in  the  file and
represents  it as datastruture  which is  directly accessible  for you
(the programmer). 


THE  BLAST/PSI-BLAST DATASTRUCTURE: 
----------------------------------

The  datastrucutre is a  tree. The  root is  the entire  BLAST results
which  contains a  list  of  iterations (only  one  for BLAST).   Each
iteration  contains a dictionary  of hits  (hit name  + database  as a
key). Each hit  in an iteration contains a list  of HSPs (high scoring
paires). Each HSP contains an alignment, e-value, percent id ... .


HOW YOU CAN USE THE PARSER:
---------------------------

Please have a look in the files fio.py, blast.py and blastflt.py for more 
information. 

CHAHGES:
--------

blast.py, since version 1.0:

DATE: 16.11.99
   - self.tokens contains all tokens of the above parsing state (e.g. iteration
   contains all tokens of blast), all tokens of the above state are associated
   with method self.exitParser. This makes parsing much more felixble since the
   state parser (e.g. Hit) can jump out as soon as it detects something that
   belongs it's outer parsing state.
   - Class Blast doesn't get an optional argument IterationClass used to generate
   Iteration objects, also Iteration doesn't get a HitClass and Hit doens't get a
   HSP class to generate HSP objects.
   - The blast object gets an optional iteration, hit and hsp class which is
   used to generate the apporiate objects.
   - added drift check to blast class. Sometimes PSI-BLAST looses hits collected
   during the first iteration. The blast-object class provides a method to stop
   collecting hits before a drift is detected (e.g. when hits of the first iteration)
   get lost in iteration 4 parsing will be aborted after iteration 4). 
   - attributes hit.name and hit.db are removed. Only hit.id exists. That makes
   parsing more flexible with respect to different database formats (e.g. NCBI
   NRPROT)
   - change of ending parsing state, blast objects are persistante and can be
   pickled.
   - number of hits in summary block don't have to be equal number of sequences
   in alignment block (-v 0 -b 2000 is possible and fice versa)
   - parsing and storage of blast run information, blast header and footer are stored
   in blast object.
   
   DATE: 19.11.99
   - Compilation of patterns for Tokens takes place inside the class and not in
   object construction (__init__). That means the regular expressions are compiled
   only once! Classes inhereting from these classes can still change the tokens list
   and define their own patterns.
   - class Token now accepts a string or a pre compield re object as first argument
   (necessary for implementation of tyhe previous item).
   - parseAlignments in Class hit is changed from recursive to iterative implementation
   to avoid large execution stack

blastflt.py, since version 1.0:

DATE: 07.11.99
   - changed commandline options from short to long format
   - added commandline options: minpid, evalue, help - see help text or source
   code for details.
   - added function 'Thresholds' to to delete data outside given thresholds
   'evalue' or 'pid' from blast strucuture
   - function 'getHSPlist'. See function's docstring for details. Currently
   not used (only for development purpose)

blastflt.py, since version 1.1:

DATE: 14.12.99
   - Removing redundant hits from iterations (in class MyIteration) conflicted
     with driftfilter (when activated) because redundatn hits removed from an
     iteration will be recognized as a dirft. Fixed that bug by setting
     redundant hits to None and rmove these after parsing is complete.

CHANGES in version 1.2:

    (the gapBlast and PSI-blast features of this version was tested with NCBI
     blastpgp 2.0.11 and 2.0.12)

  - New module misc.py with some useful routines (class independant) and new 
    class 'AlignmentContainer' which takes several sequences clusters them. 
    This is useful to make multiple sequence alignment non-redundant. This is 
    now used by blastflt.py.

  - blastflt.py: several bug fixes + optional non-redundancy filter of the 
    resulting multiple sequence alignment. You can set thresholds for e-value
    and percent sequence identity that are kept in the blastflt output.

  - blast.py: handles tblastn and blastn and blast output from multiple runs.

CHANGES in version 1.2.2 - 1.2.3:

  - Changed values for blast object's 'status' attribute

  - added option to Blast constructor, e.g. b = Blast(f, limit = 500)
    limit list of HSPs to 500

  - changed method 'drift' and added methods 'checklimit' and 'HitCmp' to
    class Blast.

  - Buf fix to deal with blastpgp-2.2.*

  - blastflt.py: Merged in changes from Stephan Herschel from 29.01.2000.
    New switch 'nostacked' to turn off stacking of pair alignments.

CHANGES in version 1.2.4

  - blastflt.py: added option '--descrip' to append description line to output

  - blast.py: fixed a problem with intgeger overflow in "letters in databases"

  PLEASE look at the modules itself (fdocstrings and comments) to get more 
  detailed information about changes a+ fixes + examples.


Acknowledgements: Thanks to Stephan Herschel from ProCeryon Biosciences GmbH
for providing changes and giving feedback.

CONTACT:
--------

help needed ? Questions, suggestions, bugs and enhancements to: 
Arne Mueller, arne.muller@aventis.com
