Bacterial Operon and Gene Prediction.

FgenesB - Suite of Bacterial Operon and Gene Finding Programs

FgenesB is the most accurate ab initio prokaryotic gene prediction engine (see Table 1 at the bottom for its comparison with two other popular gene prediction programs). FgenesB gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites. The program uses genome-specific parameters learned by FGENESB-train script, which requires only DNA sequence from genome of interest as an input. (If you need parameters for your new bacteria, please contact Softberry.) FgenesB also includes simplified prediction of operons based only on distances between predicted genes.

For community sequence annotation, ABsplit (www.softberry.com/berry.phtml?topic=absplit&group=programs&subgroup=gfindb) program can be used that separates archaebacterial and eubacterial sequences.

FgenesB was used in first ever published bacterial community annotation project: see Tyson et al., (2004) Nature 428(6978), 37-43.

Example of FgenesB output:


     1     1 Op  1  21/0.000   +    CDS        407 -      1747   1311 
     2     1 Op  2   3/0.019   +    CDS       1926 -      3065   1237 
     3     2 Op  1   4/0.002   +    CDS       3193 -      3405    278
     4     2 Op  2   4/0.002   +    CDS       3418 -      4545    899
     5     2 Op  3  16/0.000   +    CDS       4578 -      6506   2148
     6     2 Op  4     .       +    CDS       6595 -      9066   2957
     7     3 Op  1     .       -    CDS      14175 -     14363    158 
     8     3 Op  2     .       -    CDS      14353 -     15249    351
     9     3 Op  3     .       -    CDS      15170 -     15352     99 


Table 1. Accuracy of prediction estimated on B.subtilis sequence: Frequency of genes starting from start codon other than first - 19.1% Borodovsky et al. (see GeneMark WEB pages (opal.biology.gatech.edu/GeneMark/genemarks.cgi)) has calculated accuracy for all genes, and has constructed three sets of difficult short genes (L ? 300bp) that have protein similarity support. There genes were used to demonstrate that short genes also can be predicted reasonably well. First set (51set) has 51 genes with at least 10 strong similarities to known proteins. Then, 72set has 72 genes with at least two strong similarities, and 123set has 123 genes with at least one protein homolog.

Here are the prediction results on these three sets for GeneMarkS and Glimmer (calculated in Nucleic Acids Research, 2001, Vol. 29, No. 12, 2607-2618.) and FgenesB (calculated by Softberry, three iterations of FgenesB-train script):


               Sn (exact        Sn (exact+overlapping
                  predictions)      predictions)


 123set: 
 Glimmer         57.0%           91.1 
 GeneMarkS       82.9            91.9 
 FgenesB         89.3            98.4


 72set: 
 Glimmer         57.0%           91.7 
 GeneMarkS       88.9            94.4 
 FgenesB         91.5            98.6


 51set: 
 Glimmer         51.0%           88.2 
 GeneMarkS       90.2            94.1
 FgenesB         92.0            98.0


 All genes of B.subtilis genome(GenBabk annotation): 

 Glimmer         62.4%           98.1 
 GeneMarkS       83.2            96.7  
 FgenesB         83.8            98.7   

Please note that many genes in GenBank were annotated using GeneMark program, which should result in overestimation of its accuracy.