Abstract
Promoter is a key region that is involved in differential transcription regulation of protein-coding and RNA genes. The gene-specific architecture of promoter sequences makes it extremely difficult to devise the general strategy for their computational identification. Accurate prediction of promoters is fundamental for interpreting gene expression patterns, and for constructing and understanding genetic regulatory networks. In the last decade, genomes of many organisms have been sequenced and their gene content was mainly computationally identified. Promoters and transcriptional start sites (TSS), however, are still left largely undetermined and the efficient software able to accurately predict promoters in newly sequenced genomes is not yet available in public domain.
Most of the known promoters include multiple and in some cases mutually exclusive TSSs. Moreover, TSS selection depends on cell/tissue, development stage, and environmental conditions. Such complex promoter structures make their computational identification notoriously difficult. In this work, we have developed tools for the prediction of promoters for different organisms. Instead of focusing on the classification accuracy, we predict the exact positions of the TSS inside the genomic sequences testing every possible location. Our methods significantly outperform the previously developed promoter prediction programs by considerably reducing the number of false positive predictions.