New Fast Tool for Aligning Proteins with Genome and Accurately Reconstructing Exon-intron Gene Structure

ProtMap program maps a set of protein sequences to a genomic sequence, producing gene structures and corresponding alignments of coding exons with the similar or identical protein queries. ProtMap uses a genomic sequence and a set of protein sequences as its input data, and reconstructs gene structure based on protein identity or homology, in contrast to a set of unordered alignment fragments generated by Blast. The program is very fast, and it produces gene structures similar to those of Genewise program, which is hundreds times slower (see Table 1 for speed comparison). Accuracy can be further significantly improved by use of Fgenesh+ on ProtMap output: see Table 2 fro accuracy comparison).

ProtMap is used as a part of Softberry automatic genome annotation pipeline, Fgenesh++C. We also use it for generating putative gene models for genefinding parameters training on new genomes, for which few or no known genes are available. ProtMap is also very useful for finding pseudogenes as corrupted gene structures that map to known protein sequences.

Figure 1. Example of mapping a protein sequence to human chromosome 19.


L:3000000    Sequence Chr19 [cut:1 3000000]
[DD] Sequence:       1(      1), S:      105.56, L:1739
IPI:IPI00170643.1|SWISS-PROT:Q8TEK3-1 Tax_Id=9606 Splice isoform 2 of Q8TEK3
Summ of block lengths: 1284, Alignment bounds:
On first  sequence: start   2146727, end   2167197, length 20471
On second sequence: start       263, end      1682, length 1420
Blocks of alignment: 21       
    1 E: 2146727      70 [ca GT] P: 2146727     263 L: 23, G: 101.574  S:14.75
    2 E: 2147573     107 [AG GT] P: 2147575     287 L: 35, G: 103.465, S:18.56
    3 E: 2148934      42 [AG GT] P: 2148934     322 L: 14, G: 103.043, S:11.68
    4 E: 2150399     111 [AG GT] P: 2150399     336 L: 37, G: 102.130, S:18.82
    5 E: 2150620     235 [AG GT] P: 2150620     373 L: 78, G: 101.500, S:27.15
    6 E: 2151098     114 [AG GT] P: 2151100     452 L: 37, G: 106.924, S:19.76
    7 E: 2151750      92 [AG GT] P: 2151752     490 L: 30, G: 101.424, S:16.82
    8 E: 2153538     102 [AG GT] P: 2153538     520 L: 34, G: 100.496, S:17.73
    9 E: 2153848     138 [AG GT] P: 2153848     554 L: 46, G:  99.003, S:20.30
   10 E: 2154470     126 [AG GT] P: 2154470     600 L: 42, G: 101.283, S:19.87

          1        11   2146713   2146723   2146739   2146769
          gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg
          ---------------(..)evdhqlkerfanmke  GGRIVSSKPFAPLNFRINSRNLS-
        248       248       249       259       267       277

    2146797   2146806   2147558   2147568   2147581   2147611
          ]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK
           ---------------(..)--------------- -dIGTIMRVVELSPLKGSVSWTGK
        286       286       286       286       289       299

    2147641   2147671   2147686   2148919   2148926   2148937
          PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP
          PVSYYLHTIDRTI ---------------(..)--------------- LENYFSSLKNP
        309       319       322       322       322       323

    2148967   2148982   2150384   2150391   2150402   2150432
          KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP
          KLR ---------------(..)--------------- EEQEAARRRQQRESKSNAATP
        333       336       336       336       337       347

    2150462   2150492   2150513   2150523   2150609   2150619
          TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK
          TKGPEGKVAGPADAPM ---------------(..)--------------- DSGAEEEK
        357       367       373       373       373       373

Table 1. Speed of processing sequences by Prot_Map, Fgenesh+ and GeneWise.

  Fgenesh+ Prot_map GeneWise
88 sequences of genes < 20 kb ~1 min ~1 min ~90 min
8 sequences of genes > 400000 kb ~1 min ~1 min ~1200 min

Table 2. Comparison of accuracy of gene identification programs: ab initio Fgenesh and prediction with protein support: Fgenesh+ , GeneWise and Prot_Map on a set of human genes using mouse or drosophila homologous proteins. Sn ex, Sensitivity on exon level (exact exon predictions); Sno ex, sensitivity with exon overlap; Sp ex, specificity, exon level; Sn nuc, seisitivity, nucleotides; Sp nuc, specificity, nucleotides; CC, correlation coefficient; %CG, percent of genes predicted completely correctly (no missing and no extra exons, and all exon boundaries are predicted exactly correctly).

Mouse homologs: 60% < similarity level < 80% - 1425 sequences

  Sn ex Sno ex Sp ex Sn nuc Sp nuc CC %CG
Fgenesh 83.4 90.9 86.8 93.2 94.9 0.937 30
Genewise 88.1 96.5 90.5 97.8 99.2 0.984 43
Fgenesh+ 93.9 97.9 94.9 98.4 99.3 0.988 65
Prot_map 87.0 96.5 86.6 97.0 98.5 0.976 40

Drosophila homologs: similarity level > 80% - 66 sequences.

  Sn ex Sno ex Sp ex Sn nuc Sp nuc CC CG%
Fgenesh 90.5 93.8 95.1 97.9 96.9 0.950 55
Genewise 79.3 83.9 86.8 97.3 99.5 0.985 23
Fgenesh+ 95.1 97.8 97.0 98.9 99.5 0.9914 70
Prot_map 86.4 95.3 88.1 97.6 99.0 0.982 41