
-----------------------------------------------
Capturing expert knowledge with argumentation:
a case study in bioinformatics
-----------------------------------------------

Benjamin R. Jefferys(1), Lawrence A. Kelley(1), Marek J. Sergot(1),
John Fox(2), Michael J. E. Sternberg(1)

(1) - Imperial College, London
(2) - Cancer Research UK, London

-----------------------------------------------



This is the output of the 3DPSSM for the 123 searches performed for validation
of the argumentation system. Each search protein sequence  was taken from the
database itself, therefore the first match is always trivially exactly the
correct answer. Any of the remaining 19 matches with the same SCOP superfamily 
as the number 1 match is therefore a "correct prediction" - that is a 
homoologue has been found which could be a good model for the protein.

Each filename is a 16-digit hexadecimal number with ".xml" on the end.

The file contents are formatted as XML, output using the Perl XML::Simple XML 
output module, on a Perl data structure constructed from parsing the various 
files which make up the output of 3DPSSM. The XML format is therefore quite 
naive and inefficient, but easily parsed. If using Perl, XML::Simple can 
actually read the XML back in and reconstruct the original data structure 
precisely.

The format should be obvious from looking at the file, however here is some 
explanation. Each section has a tag name as the heading, followed by the 
attributes it may have, then other tags it may contain. The top-level tag is 
<opt>. The textual order of most tags is significant, and therefore should be 
preserved on parsing.

For more information see:

http://www.sbg.bio.ic.ac.uk/~brj03/argumentation/paper/

For further help email:

benjamin dot jefferys at imperial dot ac dot uk

-----------------------------------------------
<opt>
This is the single top-level tag

ATTRIBUTES
numHomologues:
  (unsigned integer) number of homologues which the query sequence PSSM is
  made from

CONTAINS
20 * <matches>
Multiple <query>

-----------------------------------------------
<matches>
Each one of these represents a SINGLE match of a template against the query.
 
ATTRIBUTES
name:
  (string) name of the template, usual looks like SCOP reference - e.g. 
  d1a77_1

rank:
  (unsigned integer) rank of the template match in the 3DPSSM result table. 
  Where rank=1, the match is exactly the sequence that was used as a query.

eval3D:
  (float) 3DPSSM E-value

confidence3d:
  (float) 3DPSSM percentage confidence that the match is correct, derived 
  from the E-value.

numHomologues:
  (unsigned integer) number of homologues which the template PSSM is made 
  from

qlen:
  (unsigned integer) number of amino acids in query sequence

tlen:
  (unsigned integer) number of amino acids in template (matched) sequence

segOutput:
  (string) output from SEG low-complexity masking tool, for debugging 
  purposes

rZs, corescore, csc, normsc, lognormsc:
  ignore

method:
  ignore - indication of algorithm used to find match

CONTAINS
tlen * <template> 
qlen * <queryToTemplate>
tlen * <templateToQuery>

-----------------------------------------------
<template>
There is one of these for each amino acid in the template. These appear in
<matches> in the same order as in the template protein sequence.

ATTRIBUTES
aa:
  (string) the single-letter symbol for the amino acid

core:
  (float) buriedness of this amino acid in the protein, from 0 (surface) to 
  1 (core)

simple:
  0 if amino acid is part of a statistically complex region
  1 if amino acid is part of a statistically simple region
  ... as determined by seg

struct:
  Secondary structure this amino acid is part of:
  C if coil
  H if alpha helix
  E if beta strand

tm:
  M if amino acid is part of a transmembrane region
  ... otherwise it is not

CONTAINS
Nothing

-----------------------------------------------
<queryToTemplate>
Inverse of <templateToQuery>. There is one of these for each amino acid in the
query. These appear in <matches> in the same order as the amino acids in the
query protein sequence.

ATTRIBUTES
None

CONTAINS
Signed integer index in the template sequence of the amino acid that this
amino acid in the query sequence is aligned with. -1 if this amino acid is not
aligned to anything in the template sequence (gap in template sequence).

-----------------------------------------------
<templateToQuery>
Inverse of <queryToTemplate>. There is one of these for each amino acid in the
template. These appear in <matches> in the same order as the amino acids in
the template protein sequence.

ATTRIBUTES
None

CONTAINS
(signed integer) index in the query sequence of the amino acid that this amino
acid in the template sequence is aligned with. -1 if this amino acid is not
aligned to anything in the query sequence (gap in query sequence).

-----------------------------------------------
<query>
There is one of these for each amino acid in the query. These appear in <opt>
in the same order as in the query protein sequence.

ATTRIBUTES
aa:
  (string) the single-letter symbol for the amino acid

simple:
  0 if amino acid is part of a statistically complex region
  1 if amino acid is part of a statistically simple region
  ... as determined by seg

struct:
  Secondary structure this amino acid is predicted to be part of:
  C = coil
  H = alpha helix
  E = beta strand

tm:
  M if amino acid is part of a transmembrane region
  ... otherwise it is not
  ... as predicted by TMHMM

something:
  ignore






