Scalable Annotated Genome Graphs for Representing Sequence Data

Mikhail Karasikov

July 2023

Abstract

Technological advances made over the last decades in sequencing technologies have led to continuous improvements of quality and ever-decreasing costs of sequencing. All this resulted in a steady growth of the amount of biological sequences produced by medical institutions and the general scientific community. Yet, the vast majority of this data is stored in data repositories that do not provide means for large-scale analysis and search in this trove. For example, the European Nucleotide Archive (ENA) and NCBI Sequence Read Archive (SRA) currently store over 37 and 72 Petabases of sequences, respectively. However, to answer even such a simple question as ‘has this sequence, variant, or pathogen been observed anywhere before?’ with a moderately large query would require extensive computations that cost over a thousand US dollars with a typical Cloud Computing provider.
In this dissertation, we consider the problem of indexing large collections of biological sequences. We design compressed data structures and apply these to build a tool called MetaGraph, which aggregates large volumes of sequence data and makes it searchable. As a result, life science researchers and other communities get easy access to the sequence data for investigation, which is essential for making discoveries.
To demonstrate the capacity of MetaGraph, we have indexed a significant portion of all publicly available sequencing samples from the SRA. We have also indexed a number of other diverse and biologically relevant data sets, from reference genomes to raw metagenomic reads. In total, we processed 4.6 Petabases of sequences, which far exceeds the pivotal figure of one Petabase and, at last, makes this data fully and efficiently searchable by sequence. The resulting indexes form a valuable community resource, as they succinctly summarize large raw-sequence data sets while supporting various queries against them. We provide these indexes as a public resource with a subset of them hosted online as a service for interactive search. The size and the diversity of the data we have processed prove the feasibility of keeping all existing sequence archives indexed in a general manner and making them searchable, similarly to how Google indexes web pages and the information extracted from them.

Type

Thesis

Publication

In ETH Zurich Research Collection

Mikhail Karasikov

ML Engineer, PhD

Machine learning researcher/engineer with a background in mathematics and computer science.

Scalable Annotated Genome Graphs for Representing Sequence Data

Abstract

Mikhail Karasikov

ML Engineer, PhD

Events