Hashing enhances alignment-based strategies for bacterial genome annotation

Transforming protein sequences into hash fingerprints to rapidly lookup info from annotation databases. Credit: Oliver Schwengers

DNA sequencing has modified biology like nothing else because the origin of species concept. In explicit, the way in which we examine microbial life has basically modified. Today, we’re in a position to sequence DNA with unprecedented velocity and backbone, in order that we’re even in a position to sequence genomes of microbes which have by no means been described or cultivated earlier than. At the identical time, whole-genome sequencing of identified—most pathogenic—species, has develop into a routine methodology carried out worldwide as a day by day enterprise.

This, in flip, always will increase the quantity of publicly saved sequences, that are equally turning into a treasure trove and a hurdle each on the similar time. For many sequence-based computational analyses, complete and thorough genome annotations play a vital position as a typical beginning floor. And for a very long time this has been perceived as a solved downside.
But, the day by day inflow of recent genome and gene sequences into public databases poses new points for the speedy annotation of microbial genomes. In explicit, the seek for related or similar protein-coding genes has develop into a large-scale bioinformatics search downside like a needle in a haystack—an astonishingly massive haystack, these days.
In this context, we’re going through two diametrically diverging developments. On one hand, public databases are flooded with related and near-identical protein sequences. For occasion, these embody these of utmost relevance like antimicrobial resistance genes and virulence components—sequences which may be crosslinked with tons of helpful info from many public databases. On the opposite hand, numerous new sequences emerge from metagenome initiatives sequencing of what’s also known as microbial darkish matter. However, for a lot of of those sequences no extra info is out there in any respect.
Two distinct bioinformatic challenges come up from this example: first, the precise identification of identified sequences, and second, the practical description of uncommon and even unknown sequences—each within the order of tons of of tens of millions. To tackle these challenges, we tried an alignment-free protein sequence hashing technique coupled with two hierarchical sequence alignment steps as a brand new strategy to this downside. Our work was printed within the journal Microbial Genomics.
To precisely establish identified protein sequences, we used a hash operate that maps enter information of arbitrary lengths to fixed-size binary fingerprints. These hash features are well-known from so-called checksum calculations resulting from an vital attribute: they’re extraordinarily quick to compute, a lot sooner than conventional sequence alignments.

To make the most of this, we created a compact, native database with hash fingerprints of greater than 220 million protein sequences. In a second step, we pre-assigned high-quality annotations and cross-links to additional exterior databases. Of be aware, these demanding large-scale computations are solely required as soon as on the database compilation step which we frequently conduct upon new releases. For the precise genome annotation course of, we will use this dense info storage at runtime and thus obtain actual sequence identifications and ultra-fast lookups of associated info.
We additionally decreased general storage necessities to at least one third though extra wealthy annotation info is included like gene symbols, EC numbers, GO phrases, protein merchandise and exterior database accessions. This info is a precious useful resource to attach sequences at hand with associated sequences saved in public databases.
Interestingly sufficient, this alignment-free strategy additionally helped to considerably keep away from computationally costly alignments which observe as a fallback search technique for unidentified sequences. In a hierarchical two-step course of, remaining protein sequences had been searched by way of conventional sequence alignments in opposition to protein cluster consultant sequences. First, greater than 99 million dense protein clusters had been screened for matches adopted by a second search utilizing more-relaxed thresholds screening greater than 13 million wider clusters.
Potentially damaging runtime results of those big protein cluster databases had been mitigated by the described alignment-free sequence identification strategy. Finally, all annotation info for recognized protein sequences and associated clusters had been mixed giving particular info priority over extra common info.
This hierarchical strategy is a component of a bigger annotation workflow additionally comprising the annotation of non-coding RNA and DNA options, e.g., tRNAs, rRNAs, ncRNAs, CRISPR arrays, origin of replications and plenty of extra. Bakta is out there as a command line software and as a scalable internet service at https://bakta.computational.bio
This story is a part of Science X Dialog, the place researchers can report findings from their printed analysis articles. Visit this web page for details about ScienceX Dialog and how you can take part.

More info:
Oliver Schwengers et al, Bakta: speedy and standardized annotation of bacterial genomes by way of alignment-free sequence identification, Microbial Genomics (2021). DOI: 10.1099/mgen.0.000685

Oliver Schwengers is a microbial bioinformatics PostDoc researcher on the Bioinformatics and Systems Biology division on the JLU Giessen. His analysis actions deal with the evaluation and characterization of bacterial genomes and plasmids based mostly on whole-genome sequencing information in addition to the event of totally automated and scalable bioinformatics software program instruments. He likes to frequently collaborate with researchers from medical, environmental and house microbiology in an interdisciplinary method.

Citation:
Hashing enhances alignment-based strategies for bacterial genome annotation (2022, December 13)
retrieved 13 December 2022
from https://phys.org/information/2022-12-hashing-complements-alignment-based-methods-bacterial.html

This doc is topic to copyright. Apart from any honest dealing for the aim of personal examine or analysis, no
half could also be reproduced with out the written permission. The content material is supplied for info functions solely.