"""blast2 (gap blast), TBLASTN & PSI-BLAST parser

   blast.py (c) 1999 - 2003 Arne Mueller

   NAME   : blast.py
   TYPE   : module 
   PYTHON : 1.5.2
   VERSION: 1.2.2
   AUTHOR : Arne Mueller
   DATE   : 16.04.99

   DESCRIPTION:
   A blast file is completely parsed and represented as a tree
   data structure. The root of the tree is the blast object (b),
   which contains a list of iterations (i) which in turn keeps
   a dictionary with hits (h) with 'database:name' as key. Each
   hit contains an array with hsp's (hsp). The following classes
   are provided (for a description look at the doc string of
   these classes):

   Blast, Iteration, Hit, HSP

   EXAMPLE:
   from fio import *
   from blast import *
   f = FIO('blast_file', 'r')      # buffered file object 
   b = Blast(f)                    # create empty blast object 
   b.createBlastTree()             # generate the blast tree

   # iterate over the whole data structure
   for i in b.iter:
        for h in i.hits.keys():
            for hsp in i.hits[h].hsp:
                 print i.hits[h].id, hsp.score
   hsp = b.createHSPlist()         # get a list with all HSPs

   NOTES:
   Only blast output generated with pairewise alignments (-m 0)
   is treated, for other alignments the effect is undefined (not
   implemented)! The file objects used in this module has to be
   of class FIO of module fio.

   Since blastpgp-2.0.12 the sequence entries (name and description) from
   a binary formated ncbi-nrprot are separated by a new line (for sequences
   with more than one redundant entries). The current version of this parser
   only parses the first line (starting with '>') which is used to compile
   the id and the description of this hits. The other sequence names and
   descriptions for this entry are ignored.

   The module was tested with blastpgp version 2.0.6, 2.0.8, 2.0.9,
   2.0.10, 2.0.11, 2.0.12, 2.0.13, 2.0.14, 2.1.*, 2.2.*

   TODO:
   o Implement a dynamic token specific match routine that doesn't
   use regular expressions where it can be avoided, becauses regexes
   are too slow in python!
   o Add proper exception handeling to cope with corrupted blast files.
   o Out-source regular expressions for Tokens into a separate file so
   the blast parser itself is independant of any (smaller) changes in 
   the output format.
   references and Class references?
   o Speed up parsing! Starting points:
         - Currently all regular expressions are compiled when construction
         a new object! That's not always requeired - it may be possible to
         put some regex compilation into class name space
         - Tokens (regexe's) are sort of inherited from blast-obj to
           iteration-obj to hit-obj to hsp-obj! There might be a way to
           optimize Token testing (find a shortcut).
   o there's no support for PSI-tblastn (in blast since version 2.2.*)
   
   CHANGES:
   ========
   
   VERSION 1.0:
   ------------

   DATE: 20.09.99
   - in class HSP: added self.q_start, self.q_end and self.s_start,
     self.s_end, to store information about position of alignment with
     respect to query AND subject. self.start/end is still supported and
     is the tsame as  self.q_start/self.q_end. (See also class docstring.)
     
   DATE: 13.10.99
   - added attribute  dbsep to  constructor  of class
     Parser.dbsep  specifies the string  that separates the  database name
     from the sequence name in blastoutput.  The string is set to '\|' (has
     to be  escaped). it  recognizes entries of  form: sp|P36334|VGL2_CVHOC
     where  sp  is the  database  (swissprot)  and  P36334|VGL2_CVHOC is  a
     sequence  name.  All   regular  expressions  includinng  the  database
     separator were updated by  '...%s...'%self.dbsep !  NOTE: this is only
     a quick  workaround, for the  long term it's  useful to set  the blast
     objcts's dbsep  to a certain value  so that all  dependant objects are
     updated automatically!

   DATE: 04.11.99
   - revised the change from 13.10.99, 'dbsep' is not included in the Parser
     __init__ function anymore - it's now a Class attribute. In a client
     program you can now set the database/name separator doing:
        Parser.dbsep = string
   DATE: 07.11.99
   - ParseAlignment in Class Hit now sets self.e, self. score, self.status
   to values of it's first HSP if not already set by ParseHits function in
   Class Iteration. That is to cope with situations where hits listed in
   summary block is smaller from number of hits listed in alignment block.

   VERSION 1.1:
   ------------

   DATE: 16.11.99
   - self.tokens conatins all tokens of the above parsing state (e.g. iteration
   contains all tokens of blast), all tokens of the above state are associated
   with method self.exitParser. This makes parsing much more felixble since the
   state parser (e.g. Hit) can jump out as soon as it detects something that
   belongs it's outer parsing state.
   - Class Blast doesn't get an optional argument IterationClass used to generate
   Iteration objects, also Iteration doesn't get a HitClass and Hit doens't get a
   HSP class to generate HSP objects.
   - The blast object gets an optional iteration, hit and hsp class which is
   used to generate the apporiate objects.
   - added drift check to blast class. Sometimes PSI-BLAST looses hits collected
   during the first iteration. The blast-object class provides a method to stop
   collecting hits before a drift is detected (e.g. when hits of the first iteration)
   get lost in iteration 4 parsing will be aborted after iteration 4). 
   - attributes hit.name and hit.db are removed. Only hit.id exists. That makes
   parsing more flexible with respect to different database formats (e.g. NCBI
   NRPROT)
   - change of ending parsing state, blast objects are persistante and can be
   pickled.
   - number of hits in summary block don't have to be equal number of sequences
   in alignment block (-v 0 -b 2000 is possible and fice versa)
   - parsing and storage of blast run information, blast header and footer are stored
   in blast object.
   
   DATE: 19.11.99
   - Compilation of patterns for Tokens takes place inside the class and not in
   object construction (__init__). That means the regular expressions are compiled
   only once! Classes inhereting from these classes can still change the tokens list
   and define their own patterns.
   - class Token now accepts a string or a pre compield re object as first argument
   (necessary for implementation of tyhe previous item).
   - parseAlignments in Class hit is changed from recursive to iterative implementation
   to avoid large execution stack

   DATE: 14.12.99
   - added method getGaps to class HSP

   VERSION 1.2:
   ------------

   DATE: 15.02.2000
   Added support for TBLASTN. If a tblastn file is parsed the blast.type is set to
   'TBLASTN' (for BLAST2 or PSI-BLAST it's 'BLASTP'). In each HSP the frame is stored
   in self.frame.

   NOTE: In TBLASTN a target sequences that matches in different translation frames
   is treated as different HSP of that sequences (Hit). It's also possible that one
   translation frame can have multiple HSPs.

   VERSION 1.2.1:
   ------------

   DATE: 13.04.2000
   Changed method HSP.getGaps() - this now returns two lists of gaps in a tuple:
   First element is the gap list for the query, the second for the subject. The gap
   list keeps the start position and the length (i.g. extend) of the gaps, e.g.

   gaps = [23,2,34,1] where gaps[0] = start of gaps, gap[0+1]
   is the gap length, gap[0+2] = start of next gap, etc. ...

   DATE: 14.04.2000
   An Iteration object now sets the status attribute in the blast object that was
   passed to the Iteration constructor. As soon as an iteration detects results (i.g
   hits) it sets the status attribute to 'collecting', when the 'CONVERGED' keyword
   is detected the status is set to 'converged'. The initial status of a blast object
   is 'empty' - if this is the final status, there were no hits found for the query!
   
   Changed method Blast.getVersion(), if a version string is detected (e.g.
   BLASTN 2.0.11 [Jan-20-2000]) and Blast.version is already set, this indicates
   the end of parsing and the last parsed line is put pack into the FIO object. This
   is useful to read in multiple blast files (files containing the results of several
   blast runs). Together with the new 'test()' method in FIO you can use that in the
   following way:

   f = FIO('blast.hits.out')
   multi_blast = []
   while f.test():
      current = Blast(f)
      current.createBlastTree()
      multi_blast.append(current)
   
   DATE: 04.05.2000
   Bugfix in Iteration.__init__ and Blast.parseIterations, Blast.exitParser. When
   running PSI-BLAST and there no hits found in iteration 1 a second trial is still
   done (don't ask me why!). That means you've got two 'Searching...' lines and two
   '***** No hits found ******' in your PSI-BLAST output. This causes the parser to
   create two empty iterations (which I think is an incorrect behaviour). Now, when
   there's a 'No hits found' detected in an iteration, parinsg that iteration is
   aborted and is appended to the blast objects iter list. The blast object's status
   is 'empty' (if there is no non empty iteration!).

   The blast object's status is 'crash' if there was no 'S2' string found inthe blast
   output. This is usually the last line of a blast file (part of the statistics).

   DATE: 05.05.2000
   In Iteration.parseAlignments() hits now get the attribute 'old' or 'new'
   depending on whether a hit is new or was found previously. Note this feature
   dependands on -v == -b options in PSI-BLAST!!!

   VERSION 1.2.2:
   --------------

   DATE: 30.01.2001

   Added status 'blast' and 'limit', the parsers status (blast.status) can now have
   7 different values:
   'blast' means output comes NOT from Psi-Blast, but maybe from blast
           - output is ok (i.g. this is a valid blast or tblast run)
   'empty' means not hits have been found (in PsiBlast, Blast, tBlastN)
   'crash' invalid output - something is missing
   'drift' if the drift-filter was used and a drift wa detected
   'limit' if an HSP limit was set and the limit was reached
   'collecting' PsiBlast only, model did not converge
   'converged'  PsiBlast only, model converged

   The Blast object can be created with a limit: b = Blast(f, limit = 500),
   in Blast/PsiBlast every iteration is checked not to exceed this limit, otherwise
   the hits of this iteration are sorted via Blast.HitCmp and the list of hits is
   truncated to the limit, no further iterations are processed, the parser proceeds
   to the end of the Blast/Psi-BlastRun (i.g. skips further iterations).

   You can reduce the number of alignments in a blast-output by setting the -b
   option to a limit, but blast will not report if it reached the limit!

   If a drift was detected the parser proceeds the same way as for an exceeded
   limit.

   VERSION 1.2.3:
   --------------

   DATE: 15.05.2001

   Added method GetLKH to Blast class to parse the 'Lambda K H' statistics line

   DATE: 17.08.2001
   
   Adjusted method 'GetNumber' in class Blast to cope with series blastpgp-2.2.
   The method now removes ',' in the (integer) numbers reported in the blast
   statistics, so that these numbers are in fact integers!

   VERSION 1.2.4:
   --------------

   DATE: 17.07.2003

   dbletters can be very big and  this now is stored in a long integer.
   
"""
from re import *
import sys
import string
import types
import time


###
### Class definitions
###
###############################################################################

class RegString:

    """Regular Expressions & Strings

       TYPE       : class
       CONSTRUCTOR: list
       DESCRIPTION:
       Extension to regular expressions and strings.

       NOTE:
       Still under constructon
    """

    def catBuffer(self, buffer):
        "Returns a catination of all elements of it's argument which is a list!"
        cat = ''
        for i in buffer:
            cat = cat + i
        return cat

###############################################################################

class Token:

    """Token for the modules' Parser class 

       TYPE       : class
       CONSTRUCTOR: 1. an uncompiled perl regular expression (of modul re)
                    2. reference to a functions that's associated
                       with the regular expression
       DESCRIPTION:
       This class represents a token that's used by the module's
       Parser class. A token keeps a regular expression which is
       actually the token (or keyword) and a reference to a function
       that's associated with that keyword. The function is called
       with the token object as argument.

       A Token object provides 4 members:
       self.token : a regular expression object (see modul re)
       self.action: reference of the  associated function
       self.m     : a matching object (see module re)
       self.str   : the last string that was passed for comparison
                    with the token's regular expression

       EXAMPLE:
       token = Token('^data: ', self.processData)
    """
    
    def __init__(self, token, action=None):
        if type(token) == types.StringType:
            self.token = compile(token)
        else:
            self.token = token
        self.action = action
        self.m = None
        self.str = None

    def match(self, str):
        "returns self object if argument (string) matches token else None"
        if not str: return None
        self.str = str
        self.m = self.token.search(str)
        if self.m:
            return self
        else:
            return None

###############################################################################

class Parser(RegString):

    """Simple keyword based Parser  

       TYPE       : class
       CONSTRUCTOR: object of class FIO
       DESCRIPTION:
       The Parser class is intended to be inherited to other classes.
       It can be used to searche a text file provided by an FIO object
       for keywords (tokens) given by self.tokens which is a list of
       valid objects of class Token. An object which uses the parser
       class has to provide the list self.tokens! See the function
       doc strings for different methods to parse text.
       
       EXAMPLE:
       class Talk(Parser):
           def __init__(self, f):
               self.f = f
               Parser.__init__(self, f) # initialize the Parser!
               # the list of tokens to be recognized
               self.tokens = [ Token('^Question: ', self.processQuestion),
                               Token('^Answer: ', self.processAnswer) ]
               self.parser() # call the parser to process text
           # function to execute when found an 'Answer' token
           def processAnswer(token):
               print 'Answer is: ', token.str
               self.parse() # recall the parser to get next token

       NOTE:
       Be careful using the above example way to parse text because it's
       recursevely! It's also possible to use 'while' loops to do the job!
    """
    
    def __init__(self, f):
        self.f = f

    def parserError(self, token=None):
        "Prints it's argument (token object or None) and exits the program."
        if token:
            sys.stderr.write('Parser Error at: %s, %s\n' % (token.token.pattern, token.str))
        else:
            sys.stderr.write('unknown parser Error!\n')
        sys.exit(1)

    def parseInt(self, i):
        "Returns it's string argument as integer."
        return string.atoi(i)

    def parseFloat(self, f):
        "Returns it's string argument as float."
        if f[0] == 'e': f = sub('^e', '1e', f)
        return string.atof(f)

    def nextContensLine(self):
        "Returns next non whitespace line of object's filestream."
        while( 1 ):
            l = self.f.getLine()
            if not l or string.split(l):
                return l
            
    def match(self, str):
        "Looks for a parser token that matches argument (string) and returns it else None."
        for i in self.tokens:
            if i.match(str):
                return i
        return None

    def readToToken(self, token):
        "Reads to next line before matching argument (token), returns list of read lines."
        buffer = []
        l = self.nextContensLine()
        while( l ):
            if token.match(l):
                self.f.ungetLine(l)
                return buffer
            buffer.append(l)
            l = self.nextContensLine()
        return None    

    def returnToken(self, token):
        "Returns the next token of type 'token' or None."
        l = self.nextContensLine()
        n = 0
        while( l ):
            if token.match(l):
                return token, n
            l = self.nextContensLine()
            n = n + 1
        return None   

    def searchToken(self):
        "Reads to next line matching a token in token list, returns token or None."
        l = self.nextContensLine()
        while( l ):
            found = self.match(l)
            if found: return found
            l = self.nextContensLine()
        return None

    def delToken(self, token):
        "Removes argument (token) from token list (self.tokens)"
        n = 0
        for i in self.tokens[:]:
            if i == token:
                del tokens[n:n+1]
                return
            n = n + 1

    def parse(self):
        "Calls 'searchToken' and executes function associated with token, return found token."
        # sys.stderr.write('parsing token ...\n')
        t = self.searchToken()
        if t:
            t.action(t)
        else:
            self.exitParser(None)
        return t 
        
    def dummy(self, token):
        "A Dummy function, doing nothing."
        pass    

    def ExitTokens(self, tokens):
        exit_tokens = []
        for t in tokens:
            exit_tokens.append(Token(t.token, self.exitParser))
        return exit_tokens
    
    def exitParser(self, token):
        "Exits current parsing state, resets token list and puts argument (token) back to file stream."
        # sys.stderr.write('exiting iteration ...\n')
        if token: self.f.ungetLine(token.str)
        delattr(self, 'f') # get rid of file descriptor (usefull for pickling!)
        self.tokens = []
        return token
        
###############################################################################

class HSP(Parser):

    """HSP (Homologous Sequence Paire) for a hit

       TYPE       : class
       CONSTRUCTOR: a hit object to which the HSP object will belong
       DESCRIPTION:
       Objects of this class keep information about all the hsp entries
       of a hit and provide methods to parse/handle the hsp's. An hsp
       is the list of alignments for a hit including it's associated
       information like e value, score, %id.

       See function's doc strings for details.

       Iteration object provide 7 members:
       1. self.score   : score of hsp (float)
       2. self.e       : evalue of hsp (float)
       3. self.identity: percent id of hsp (in)
       4. self.q_start : first query residue in alignment (int)
       5. self.q_end   : last query residue in alignment (int)
       6. self.s_start : first subject residue in alignment (int)
       7. self.s_end   : last query residue in alignment (int)
       6. self.query   : (gapped) query sequence of alignment (string)
       7. self.subject : (gapped) subject/hit sequence of alignment (string)
       optional:
       8. self.frame   : translation frame in TBLASTN mode
       
       EXAMPLE:
       Function parseHSP should be the function to call when parsing
       all hsp entries for a hit.
    """

    ClassTokens = [compile('^Query: (\d+)\s*([a-zA-Z\-\?\.\*]+) (\d+)$'),
                   compile('^Sbjct: (\d+)\s*([a-zA-Z\-\?\.\*]+) (\d+)$'),
                   compile('Score\s+=\s+(\S+).+?\s+(\S+)$'),
                   compile('Identities .+?\((\d+)%\)'),
                   compile('Frame\s+=\s+(\S+)')]
    
    def __init__(self, hit):
        self.score = None
        self.e = None
        self.identity = None
        self.query = ''
        self.subject = ''
        self.start = 0
        self.end = 0
        self.q_start = 0
        self.q_end = 0
        self.s_start = 0
        self.s_end = 0
        
        Parser.__init__(self, hit.f)
        self.tokens  = [ Token(HSP.ClassTokens[0], self.parseAlignment),
                         Token(HSP.ClassTokens[1], self.dummy),
                         Token(HSP.ClassTokens[2], self.parseScoring),
                         Token(HSP.ClassTokens[3], self.parseIdentity),
                         Token(HSP.ClassTokens[4],
                               self.parseFrame)] + self.ExitTokens(hit.tokens)
      
    def parseHSP(self, token):
        "Start parsing an hsp given by argument (token)."
        #sys.stderr.write('parsing hsp ...\n')
        # token.action = self.exitParser
        self.parse()

    def parseScoring(self, token):
        "Extract score and e value from argument (token)."
        #sys.stderr.write('parsing hsp scoring ...\n')
        self.score = self.parseFloat(token.m.group(1))
        self.e = self.parseFloat(token.m.group(2))
        token.action = self.exitParser
        self.parse()
        
    def parseIdentity(self, token):
        "Extract percent identiy from argument (token)."
        #sys.stderr.write('parsing hsp identities ...\n')
        self.identity = self.parseInt(token.m.group(1))
        token.action = self.exitParser
        self.parse()
        
    def parseFrame(self, token):
        "Extracts the translation frame when in TBLASTN mode."
        #sys.stderr.write('parsing frame ...\n')
        self.frame = token.m.group(1)
        token.action = self.exitParser
        self.parse()
    
    def parseAlignment(self, token):
        "Parses start,end, query, subject of hsp starting with argument (token)."
        #sys.stderr.write('parsing hsp alignment ...\n')
            
        ### set start and end position of alignment
        ### with respect to query sequence
        if not self.q_start: self.q_start = self.parseInt(token.m.group(1))
        self.q_end = self.parseInt(token.m.group(3))

        ### query sequence
        self.query = self.query + token.m.group(2)

        ### subject/hit sequence
        token, n = self.returnToken(self.tokens[1])
        if n > 1: self.parserError(self.tokens[1])
        ### set start and end position of alignment
        ### with respect to query sequence
        if not self.s_start: self.s_start = self.parseInt(token.m.group(1))
        self.s_end = self.parseInt(token.m.group(3))
        ### append alignment line to subject
        self.subject = self.subject + token.m.group(2) 
        ### next token
        self.parse()

    
    def getGaps(self):
        "Returns the list of gaps for query and subject in a tuple"
        return (self._getGaps(self.query), self._getGaps(self.subject))
        
    def _getGaps(self, sequence):
        "internal routine, useg by 'getGaps'"
        gaps   = []
        start  = -1
        n      = 0
        length = 0
        for i in sequence:
            if i == '-':
                if start == -1: start = n
                length = length + 1
            else:
                n = n + 1
                if start > -1:
                    gaps = gaps + [start, length]
                    length = 0
                    start = -1
        if start > -1: gaps = gaps + [start, length]
        return gaps

    def exitParser(self, token):
        "Exits parsing state and puts back argumet's (token) string to stream."
        #sys.stderr.write('parsing hsp exitParser ...\n')
        self.start = self.q_start
        self.end = self.q_end
        if len(self.query) != len(self.subject):
            self.parserError(self.tokens[1])
        Parser.exitParser(self, token)
        
###############################################################################

class Hit(Parser):

    """Blast Hit

       TYPE       : class
       CONSTRUCTOR: An iteration object the new Hit objedct will belong to 
                   
       DESCRIPTION:
       Objects of this class keep information about a Hit and
       provide methods to parse/handle hits. A Hit is an entry
       in the detailed part of blast output (the alignments). 

       See function's doc strings for details.

       Iteration object provide 8 members:
       1. self.id     : identifier of hit ('database_name:name_of_hit')
       2. self.db     : database name
       3. self.name   : name of hit (NOT supported anymore !!!)
       4. self.descr  : complete description string of hit sequence
       5. self.length : full length of hit/subject sequence
       6. self.hsp    : list of hsp entries for hit (usually only one hsp)
       the following members are not set by class methods, but may be set
       by the objects caller (Iteration):
       7. self.score  : blast score (in bits) for best hsp of hit (float)
       8. self.e      : e value for best hsp of hit (float)
       
       EXAMPLE:
       Function parseAlignment should be the function to call when parsing
       a hit.
    """

    ClassTokens = [compile('Length\s+=\s+(\d+)$'),
                   compile('Score\s+=')]

    def __init__(self, iteration):
        self.HSP = iteration.HSP # class to construct HSPs within this hit
        # self.DbName = compile('(^[\w\d]+)[\|\:](\S+)$') # needed in getDbNamefromId()
        # self.name = ''
        # self.db = ''
        self.id = ''
        self.score = None
        self.e = None
        self.descr = ''
        self.length = 0
        self.hsp = []
        Parser.__init__(self, iteration.f)
        self.tokens = [ Token(Hit.ClassTokens[0], self.parseLength),
                        Token(Hit.ClassTokens[1], self.parseHSP) ] + self.ExitTokens(iteration.tokens)

    ### getDbNamefromId not used yet
    def getDbNamefromId(self):
        "Returns name of database and name of sequennce in a tuple (based on self.id)."
        m = self.DbName.search(self.id)
        if m:
            return m.groups()
        else:
            return ('','')

    def parseAlignment(self, token):
        "Parses a hit in the alignment block, argument (token) keeps first line of hit."
        #sys.stderr.write('parsing alignment ...\n')
        ### maybe set id, name, db
        if not self.id  : self.id   = token.m.group(1)
        # if not (self.db and self.name): self.db, self.name =  self.getDbNamefromId()
        ### this part creates the complete 'description'
        ### string from the alignment header
        token.str = sub(token.token.pattern, '', token.str)
        buffer = self.readToToken(self.tokens[0])
        if buffer == None: self.parserError(self.tokens[0])
        buffer[:0] = [token.str]
        description = self.catBuffer(buffer)
        self.descr = sub('\s+', ' ', description)
        self.parse()

    def parseLength(self, token):
        "Extracts the Length from argument (token)."
        #sys.stderr.write('parsing length ...\n')
        self.length = self.parseInt(token.m.group(1))
        token.action = self.exitParser
        self.parse()
        
    def parseHSP(self, token):
        "Creates and parses hsp with class given by argument 2 of class constructor or HSP."
        #sys.stderr.write('parsing HSP ...\n')
        self.f.ungetLine(token.str)
        hsp = self.HSP(self)
        hsp.parseHSP(token)
        self.hsp.append(hsp)
        self.parse()
        
###############################################################################
        
class Iteration(Parser):

    """Iteration, handels blast/psiblast iterations

       TYPE       : class
       CONSTRUCTOR: blast object the iteration belongs to
       DESCRIPTION:
       Objects of this class keep information about an iteration and
       provide methods to parse/handle hits.

       See function's doc strings for details.

       Iteration object provide 4 members:
       1. self.it    : serial number of that iteration
       2. self.status: 'old' for hits found in previous iterations
                       or 'new' for new hits
       3. self.hits  : dictionary with 'database:name_of_hit' as key
                       containing objects of class Hit or class given
                       by the second argument to Iteration class
                       constructor
       4. self.last_insert_id: Id (key) of most recently inserted hit

       EXAMPLE:
       Function parseBlock should be the function to call when parsing
       a complete iteration.
    """
    
    def __init__(self, blast):
        self.blast = blast
        self.Hit = blast.Hit # class to construct Hit within this iteration
        self.HSP = blast.HSP # class to construct HSPs withinh hits of this iteration
        self.it = 0
        self.status = 'new'
        self.hits = {}
        self.last_insert_id = ''
        Parser.__init__(self, blast.f)
        x = time.clock() 
        self.tokens = [ Token('^>(\S+)', self.parseAlignments),
                        Token('^CONVERGED', self.SetConverged),
                        Token('^(\S+)\s+.+?\s+(\d+)\s+((?:\d+(?:\.\d+)?)|(?:\d*e[\+-]?\d+))$',
                              self.parseHits),
                        Token('\*\s+No\s+hits\s+found', self.NoHits),
                        Token('^Results from round (\d+)$', self.addIt),
                        Token('^Sequences used', self.parseOld),
                        Token('^Sequences not', self.parseNew),
                        Token('^Sequences producing', self.parseBlock) ] + self.ExitTokens(blast.tokens)

    def NoHits(self, token):
        self.blast.status = 'empty'
        self.exitParser(None)
        
    def parseBlock(self, token):
        "Parse a whole Iteration (or contens of simple blast file)."
        # sys.stderr.write('parsing block ...\n')
        self.status = 'new'
        self.parse()
        
    def parseOld(self, token):
        "Sets hit's status to 'old' and parses hit (given by argument token)."
        # sys.stderr.write('parsing old hits ...\n')
        self.status = 'old'
        self.parse()
       
    def parseNew(self, token):
        "Sets hit's status to 'new' and parses hit (given by argument token)."
        # sys.stderr.write('parsing new hits ...\n')
        self.status = 'new'
        self.parse()

    def parseAlignments(self, token):
        """
        Parses alignment part of an iteration, may add hits to dictionary.
        Return the id of the last inserted hit. 
        """
        # sys.stderr.write('parsing alignments in Iteration ...\n')        
        original = token
        ### parseAlignments is not recursive to avoid large execution stacks!
        while token == original:
            id = token.m.group(1)
            if not self.hits.has_key(id):
                self.hits[id] = self.Hit(self)
                self.hits[id].status = self.status
                self.last_insert_id = id
            self.hits[id].parseAlignment(token)
            ### set some attributes if not alreday set
            if self.hits[id].e == None:
                self.hits[id].e = self.hits[id].hsp[0].e
            if self.hits[id].score == None:
                self.hits[id].score = self.hits[id].hsp[0].score
            token = self.searchToken()
        if token: token.action(token)
        return id
        
    def addIt(self, token):
        "Extracts serial number of iteration from argument (token)."
        self.it = self.parseInt(token.m.group(1))
        # sys.stderr.write('adding iteration %d ...\n' % (self.it))
        self.blast.status = 'collecting'
        self.parse()

    def SetConverged(self, token):
        self.blast.status = 'converged'
        self.parse()
        
    def parseHits(self, token):
        "Parses hits in summary block, argument (token) has to be a hit token."
        ### this function doesn't work recursively
        ### because it could easyly put thousands
        ### of function calls onto execution stack!
        # sys.stderr.write('parsing hits ...\n')
        original = token
        while token == original:
            id = token.m.group(1)
            self.hits[id] = self.Hit(self)
            self.hits[id].id = id
            # self.hits[id].db, self.hits[id].name = self.hits[id].getDbNamefromId()
            self.hits[id].score = self.parseFloat(token.m.group(2))
            self.hits[id].e = self.parseFloat(token.m.group(3))
            self.hits[id].status = self.status
            self.last_insert_id = id
            token = self.searchToken()
        if token: token.action(token)
    
###############################################################################

class Blast(Parser):

    """Blast, handling blast output

       TYPE       : class
       CONSTRUCTOR: 1. FIO object
                    2. keyword: iteration, the class to be used to
                    construct iteration objects (default is class Iteration
                    of this module)
                    3.  keyword: hit, the class to be used to
                    construct hit objects (default is class Hit of this
                    module)
                    4. keyword: hsp, the class to be used to
                    construct hsp objects (default is class HSP of this
                    module)
                    
       DESCRIPTION:
       Objects of this class are the root of the blast tree structure.
       See function's doc strings for details.

       Blast objects provide one member:
       1. self.iter : a list of iterations
       [ ... ] all information of blast header and footer is kept and stored
       as object atributes!

       EXAMPLE:
       See module's doc string for an example.
    """

    ### exceptions
    Limit = 'Limit'
    Drift = 'Drift'

    def __init__(self, f, iteration=Iteration, hit=Hit, hsp=HSP, limit=None):
        ### defaults for classes to be used to generate objects
        self.Iteration = iteration # Iteration class
        self.Hit = hit  # Hit class
        self.HSP = hsp # HSP class
        self.tokens = [Token('^(BLASTN|TBLASTN|BLASTP)\S* (\S+)', self.GetVersion),
                       Token('^Query= (\S+)', self.GetQueryName),
                       Token('^Database: (\S+)', self.GetDBname),
                       Token('^\s*\((\d+) letters\)$', self.GetQueryLength),
                       Token('(\S+) sequences; (\S+) total letters$', self.GetDBsize),
                       Token('^Searching\.*done$', self.parseIteration),
                       Token('Posted date:\s+(.+)$', self.GetDate),
                       Token('^Lambda\s+K\s+H$', self.GetLKH),
                       Token('^Matrix:\s+(\S+)', self.GetMatrix),
                       Token('^Gap Penalties: Existence:\s*(\d+), Extension:\s*(\d+)', self.GetGapPenals),
                       Token('^Number of\s+(.+)$', self.GetNumber),
                       Token('^effective\s+(.+)$', self.GetEffective),
                       Token('^([A,T]):\s+(\d+)', self.GetAT),
                       Token('^([X,S])([1,2,3]):\s+(\d+)\s+\(\s*(\S+) bits\)', self.GetBits)]
        Parser.__init__(self, f)
        self.iter = []
        self.driftcutoff = 0.0005 # set this to your h value in PSI-BLAST runs!
        self.status = 'empty' # empty, collecting, converged, drifted 
	self.lost = None # name of hit if drift was detected
	self.limit = limit # maximum number of HSPs in an iteration before aborting
        ### in header/beginning of blast output
        self.qname       = None # Query name
        self.qlength     = None # length of Query
        self.type        = None # blast family (BLASTP or TBLASTN or BLASTN)
        self.version     = None # Blast version
        self.db          = None # database used (full path)
        self.dbsequences = None # number of sequences in db
        self.dbletters   = None # number of letters in sequences
        ### footer/summary of blast output
        self.date               = None # Posted date of blast run
        self.Lambda             = None # Lambda constant in statitics
        self.K                  = None # kappa in statistics
        self.H                  = None # entropy in statistics
        self.matrix             = None # initial matrix
        self.gap                = None # Gap Penalties: Existence
        self.gapext             = None # Gap Penalties: Extension
        self.n_hits             = None # Number of Hits to DB
        self.n_seq              = None # Number of Sequences:
        self.n_ext              = None # Number of extensions
        self.n_good_ext         = None # Number of successful extensions
        self.n_seq_better_eval  = None # Number of sequences better than evalue
        self.n_hsp_better_eval  = None # Number of HSP's < evalue without gapping
        self.n_hsp_gap_pre      = None # Number of HSP's gapped in prelim test
        self.n_hsp_att_gap_pre  = None # ...  attempted gapping in prelim test
        self.n_hsp_gapped       = None #  Number of HSP's gapped (non-prelim)
        self.eff_qlength        = None # effective length of query
        self.eff_hsp_length     = None # effective HSP length
        self.eff_db_length      = None # effective length of database
        self.eff_search_space   = None # effective search space:
        self.eff_space_used     = None # effective search space used
        ### other  statistis of blast footer
        self.T  = None 
        self.A  = None 
        self.X1 = (None, None) 
        self.X2 = (None, None)
        self.X3 = (None, None)
        self.S1 = (None, None)
        self.S2 = (None, None)

    def createBlastTree(self):
        "Creates the blast data strucure (tree)."
        self.time_start = time.clock()
        self.parse()

    def GetVersion(self, token):
        "BLAST version (e.g. 2.0.10)"
        if self.version:
            self.exitParser(token)
            return
        self.type = token.m.group(1)
        self.version = token.m.group(2)
        self.parse()

    def GetQueryName(self, token):
        "Internal, gets the query's name."
        self.qname = token.m.group(1)
        self.parse()

    def GetQueryLength(self, token):
        "get length of query"
        self.qlength = string.atoi(token.m.group(1))
        self.parse()

    def GetDBname(self, token):
        "Name and location of database"
        self.db = token.m.group(1)
        self.parse()

    def GetDBsize(self, token):
        "size of database in sequences and letters (residues)"
        self.dbsequences = string.replace(token.m.group(1), ',', '')
        self.dbletters = string.replace(token.m.group(2), ',', '')
        self.dbsequences = string.atoi(self.dbsequences)
        self.dbletters = string.atol(self.dbletters)
        self.parse()

    def GetDate(self, token):
        "Date of Blast run"
        self.date = token.m.group(1)
        self.parse()

    def GetMatrix(self, token):
        "Matrix for %% id calculation in alignments"
        self.matrix = token.m.group(1)
        self.parse()

    def GetLKH(self, token):
        """
        Gets Lambda, K (kappa) and H (entropy) -
        only the 1st 'Lambda' line is parsed!
        """
        if self.Lambda == None:
            l = self.nextContensLine()
            l, k, h =  string.split(string.strip(l))
            self.Lambda = string.atof(l)
            self.K = string.atof(k)
            self.H = string.atof(h)
        self.parse()

    def GetGapPenals(self, token):
        "Existence and extension penalities for gaps"
        self.gap = string.atoi(token.m.group(1))
        self.gapext = string.atoi(token.m.group(2))
        self.parse()

    def GetNumber(self, token):
        """
        Gets all 'Number of ...' lines in statistics of blast footer.
        """
        words = string.split(token.m.group(1))
        word = string.replace(words[len(words)-1], ',', '')
        value = string.atoi(word)
        if   words[0] == 'Hits'         : self.n_hits            = value  
        elif words[0] == 'Sequences:'   : self.n_seq             = value
        elif words[0] == 'extensions:'  : self.n_ext             = value
        elif words[0] == 'successful'   : self.n_good_ext        = value
        elif words[0] == 'sequences'    : self.n_seq_better_eval = value
        elif words[1] == 'better'       : self.n_hsp_better_eval = value
        elif words[1] == 'successfully' : self.n_hsp_gap_pre     = value
        elif words[2] == 'attempted'    : self.n_hsp_att_gap_pre = value
        elif words[2] == '(non-prelim):': self.n_hsp_gapped      = value
        self.parse()

    def GetEffective(self, token):
        """
        Gets all 'effective ...' lines in statistics of blast footer.
        """
        words = string.split(token.m.group(1))
        value = words[len(words)-1]
        if   words[0] == 'HSP'      :
            self.eff_qlength      = string.atoi(value)
        elif words[2] == 'query:'   :
            self.eff_hsp_length   = string.atoi(value)
        elif words[2] == 'database:':
            value                 = string.replace(value, ',', '')
            self.eff_db_length    = string.atoi(value)
        elif words[1] == 'space:'   :
            self.eff_search_space = string.atol(value)
        elif words[2] == 'used:'     :
            self.eff_space_used   = string.atol(value)
        self.parse()

    def GetAT(self, token):
        x = token.m.group(1)
        if x == 'T':
            self.T = string.atoi((token.m.group(2)))
        elif x == 'A':
            self.A = string.atoi((token.m.group(2)))
        self.parse()

    def GetBits(self, token):
        val = string.atoi(token.m.group(3))
        bits = string.atof(token.m.group(4))
        exec('self.%s%s=(%s,%s)'%(token.m.group(1),token.m.group(2),val,bits))
        self.parse()

    def createHitList(self):
        "Returns a list of all hits in blast object."
        hitlist = []
        for i in self.iter:
            for hit in i.hits.values():
                hitlist.append(hit)
        return hitlist

    def createHSPlist(self):
        """
        Returns a list of all HSPs in blast object.
        
        Each HSP gets a reference (hit) to it's corresponding hit object
        to give access to name, db, etc attributes of the hit.
        Note, this modifies the blast datastructure!
        """
        hsplist = []
        for i in self.iter:
            for hit in i.hits.values():
                for hsp in hit.hsp:
                    hsp.hit = hit
                    hsplist.append(hsp)
        return hsplist

    def checklimit(self, this_iter):
        if this_iter:
            n = len(map(lambda hit: hit.hsp, this_iter.hits.values())) # number of HSPs in this_iter
            if n > self.limit:
		sys.stderr.write('iteration %d has %d HSPs, limit is  %d\n' % (this_iter.it,n,self.limit))
                return n
	return 0

    def drift(self, this_iter):
        """
        Checks for drift (PSI-BLAST feature only). Detects when hits with evalue <=
        self.driftcutoff of the first iteraton are not found in later iterations.
        """
        if not self.iter:
            sys.stderr.write('initializing driftfilter with evalue %g\n'%self.driftcutoff)
            self.first_iter = []
            for hit_key in this_iter.hits.keys():
                if this_iter.hits[hit_key].e <= self.driftcutoff:
                    self.first_iter.append(hit_key)
            return ''
        elif hasattr(self, 'first_iter'):
            for hit_key in self.first_iter:
                if not this_iter.hits.has_key(hit_key):
                    return hit_key
        return ''
    
    def HitCmp(self, a, b):
        "comparison function used by sort for Hit dictionaries"
        ### compare by e-value first
        if a.e < b.e: return -1
        if a.e > b.e: return  1
        ### than compare by score
        if a.score < b.score: return  1
        if a.score > b.score: return -1
        ### if an HSP is available compare by identity
        if a.hsp and b.hsp:
            if a.hsp[0].identity < b.hsp[0].identity: return  1
            if a.hsp[0].identity > b.hsp[0].identity: return -1
        ### same rank
        return 0 
    
    def parseIteration(self, token):
        "Internal function, don't use it directly!"
        # sys.stderr.write('parsing Iteration ...\n')
        self.status = 'blast'
        it = self.Iteration(self)
        it.parseBlock(token)
        if self.limit:
            if self.checklimit(it):
                self.status = 'limit'
                ### sort list of hits amd truncate to limit
                # sys.stderr.write("proceeding with status set to 'limit' ...\n")
                hits = it.hits.values()
                hits.sort(self.HitCmp)
                it.hits = {}
                m = 0
                for hit in  hits:
                    m = m + len(hit.hsp)
                    if m >= self.limit: break # the list of HSPs within a hit will not be truncated
                    it.hits[hit.id] = hit
                self.iter.append(it)
                self.readToToken(self.tokens[6]) # skip further iterations
                # return
        hit_key = ''
        if self.drift:
            hit_key = self.drift(it)
            if hit_key:
                sys.stderr.write('%s: drift detected in iteration %d (%s)\n' % (
                    self.qname, it.it, hit_key))
                self.status = 'drifted'
	    	self.lost = hit_key
                # self.readToToken(self.tokens[6]) # skip further iterations
	    	# return
                self.readToToken(self.tokens[6])
        if not hit_key and it.hits and self.status not in ('drifted', 'limit'):
            self.iter.append(it)
        # sys.stderr.write("hits in iteration: %d\n" % len(it.hits))
        self.parse()

    def exitParser(self, token):
        "Exits parsing state and finailze blast file processing"
        #sys.stderr.write('parsing hsp exitParser ...\n')
        if not (self.status == 'limit' or self.status == 'drifted') and not self.S2[0]:
            ### 'S2' is the last line in a blast output, if it's not
            ### there somethings wrong (or the run reached limit or drifted)!
            self.status = 'crash'
        if  self.__dict__.has_key('first_iter'):
            delattr(self, 'first_iter')
        Parser.exitParser(self, token)
        # sys.stderr.write('cpu time: %g sec\n'%(time.clock() - self.time_start))
        
###############################################################################
