Recognition of human Pol II promoter region and start of transcription

Method description:

Algorithm predicts potential transcription start positions by linear discriminant function combining characteristics describing functional motifs and oligonucleotide composition of these sites. TSSP uses file with selected factor binding sites from RegSite DB (Plants) developed by Softberry Inc.

References:
1. Solovyev V.V., Salamov A.A. (1997)
The Gene-Finder computer tools for analysis of human and model organisms genome sequences.
In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (eds.Rawling C.,Clark D., Altman R.,Hunter L.,Lengauer T.,Wodak S.), Halkidiki, Greece, AAAI Press,294-302.

2. Solovyev V.V. (2001)
Statistical approaches in Eukaryotic gene prediction.
In Handbook of Statistical genetics (eds. Balding D. et al.), John Wiley & Sons, Ltd., p. 83-127.

3. Solovyev VV, Shahmuradov IA. (2003)
PromH: Promoters identification using orthologous genomic sequences.
Nucleic Acids Res. 31(13):3540-3545.

TSSP output:

First line - name of your sequence;
Second and Third lines - LDF threshold and the length of presented sequence
4th line - The number of predicted promoter regions
Next lines - positions of predicted sites, their 'weights' and TATA box position (if found)
Position shows the first nucleotide of the transcript (TSS position)
After that functional motifs are given for each predicted region; (+) or (-) reflects the direct or complementary chain; Fields like "RSP00004 tagaCACGTaga" mean a particular motif identificator with found similar sequence from the Softberry Regsite-Plant data base.

For example:


tssp  Wed Jul 10 02:52:32 EDT 2002
>gi|1902902|dbj|AB001920.1| Oryza sativa (japonica cultivar-group) gene for phos
 Length of sequence-      5871
 Thresholds for TATA+ promoters -  0.02, for TATA-/enhancers -  0.04
     2 promoter/enhancer(s) are predicted
 Promoter Pos:   1522 LDF-  0.13 TATA box at   1488    18.93
 Enhancer Pos:   1597 LDF-  0.12
 Transcription factor binding sites/RegSite DB:
for promoter at position -    1522
  1468 (-) RSP00004     tagaCACGTaga
  1459 (+) RSP00010     cACGTG
  1456 (+) RSP00011     ctccACGTGgt
  1461 (+) RSP00016     caTGCAC
  1468 (-) RSP00016     caTGCAC
  1256 (-) RSP00026     gcttttgaTGACtTcaaacac
  1460 (+) RSP00065     ACGTGgcgc
  1460 (+) RSP00066     ACGTGccgc
  1459 (+) RSP00069     tACGTG
  1341 (+) RSP00071     GACGTC
  1346 (-) RSP00071     GACGTC
  1452 (-) RSP00096     GGTTT
  1432 (+) RSP00129     CACGAC
  1281 (+) RSP00148     CGACG
  1284 (+) RSP00148     CGACG
  1315 (+) RSP00148     CGACG
  1335 (+) RSP00148     CGACG
  1340 (+) RSP00148     CGACG
  1365 (+) RSP00148     CGACG
  1434 (+) RSP00148     CGACG
  1458 (+) RSP00148     CGACG
  1347 (-) RSP00148     CGACG
  1474 (+) RSP00162     ACACccGagctaaccacaac
  1348 (+) RSP00241     CGGTCA
  1387 (+) RSP00339     RTTTTTR
  1264 (-) RSP00397     AGTGGCGG
  1268 (+) RSP00422     ACCGAC
  1459 (+) RSP00423     GACGTG
  1464 (-) RSP00424     CACGTC
  1369 (-) RSP00431     rdygRCRGTTRs
  1278 (-) RSP00432     cVacGGTaGGTgg
  1249 (-) RSP00436     TTGACT
  1260 (+) RSP00463     atttcatggCCGACctgcttttt
  1260 (+) RSP00464     acttgatggCCGACctctttttt
  1260 (+) RSP00465     aatatactaCCGACcatgagttct
  1265 (+) RSP00466     actaCCGACatgagttccaaaaagc
  1440 (+) RSP00469     GNGGTG
  1260 (-) RSP00469     GNGGTG
  1440 (+) RSP00470     GTGGNG
  1263 (-) RSP00470     GTGGNG
  1257 (-) RSP00470     GTGGNG
  1390 (+) RSP00477     TTTAA
  1385 (+) RSP00508     gcaTTTTTatca
  1502 (-) RSP00508     gcaTTTTTatca
  1469 (+) RSP00518     tccctACACgcGtcacaattc
  1465 (+) RSP00519     caattcaggACACgtGccctcttca
  1474 (+) RSP00521     ACACccG
  1474 (+) RSP00523     ACACgcG
  1474 (+) RSP00524     ACACgtG
for promoter at position -    1597
  1468 (-) RSP00004     tagaCACGTaga
  1459 (+) RSP00010     cACGTG
  1456 (+) RSP00011     ctccACGTGgt
  1461 (+) RSP00016     caTGCAC
  1468 (-) RSP00016     caTGCAC
  1460 (+) RSP00065     ACGTGgcgc
  1460 (+) RSP00066     ACGTGccgc
  1459 (+) RSP00069     tACGTG
  1341 (+) RSP00071     GACGTC
  1346 (-) RSP00071     GACGTC
  1452 (-) RSP00096     GGTTT
  1432 (+) RSP00129     CACGAC
  1315 (+) RSP00148     CGACG
  1335 (+) RSP00148     CGACG
  1340 (+) RSP00148     CGACG
  1365 (+) RSP00148     CGACG
  1434 (+) RSP00148     CGACG
  1458 (+) RSP00148     CGACG
  1347 (-) RSP00148     CGACG
  1474 (+) RSP00162     ACACccGagctaaccacaac


..............................

Lower cased letters mean non-conserved nucleotides in the site consensus

The letters except (A,T,G,C) describe ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide using Standard IUPAC Nucleotide code.

See TABLE at http://www.yeastract.com/help/help_searchbydnamotif.php#Ref1

IUPAC Code Meaning Origin of Description
G G Guanine
A A Adenine
T T Thymine
C C Cytosine
R G or A puRine
Y T or C pYrimidine
M A or C aMino
K G or T Ketone
S G or C Strong interaction
W A or T Weak interaction
H A or C or T not-G, H follows G in the alphabet
B G or T or C not-A, B follows A in the alphabet
V G or C or A not-T (not-U), V follows U in the alphabet
D G or A or T not-C, D follows C in the alphabet
N G or A or T or C aNy