How to predict exact boundaries of genes?

All living organisms contain DNA, providing the instructions for all of the varied activities its cells need to carry out. The totality of an organisms DNA is called a genome and distinct regions within this, called genes, encode all of the cells’ molecular machinery. When the first genome was sequenced scientists relied on manual annotation of these gene regions, but as the genomic revolution of the mid 2000’s unfolded, the rate of genome sequencing dramatically increased and the development of effective automated annotation took precedence.

Modern techniques are now able to identify where the genes reside with high accuracy, but finding the exact boundaries of those regions remains a challenging problem. In particular, the starting point of genes are especially difficult to accurately identify. This is important because a gene’s genomic sequence determines its function and the start often contains important regulatory instructions detailing e.g. localization of the final product. While researchers can manually identify these boundaries using experimental approaches, the laborious and costly nature of these experiments mean that in practice, this can only be performed for a handful of the many tens of thousands of genomes that have now been sequenced.

In order to bridge this widening gap Adam Giess and Eivind Valen from the CBU developed a powerful machine learning approach to predict the exact boundaries of genes. This approach leverages data from a method known as ribo-seq, which captures the molecular machinery in the act of decoding genes. While working with bacterial ribo-seq datasets Adam and Eivind noticed that a distinct pattern occurs when a the machinery interacts with the start of a gene. They then went on to demonstrate how this pattern can be used, in a generalizable method, to accurately annotate gene boundaries in two well known, highly studied bacterial species.

In this way they established a methodology that is able to fix the erroneous gene boundary assignments in existing genomes, and provide highly accurate gene predictions in the ever growing list of newly sequences genomes. Clearing an avenue for the study of a wide range of organisms at previously unprecedented levels of detail.