Sparse non-negative matrix factorization for retrieving genomes across metagenomes

The development of massively parallel sequencing technologies enables to sequence DNA at high-throughput and low cost, fueling the rise of metagenomics which is the study of complex microbial communities sequenced in their natural environment. A metagenomic dataset consists of billions of unordered small fragments of genomes (reads), originating from hundreds or thousands of different organisms. The de novo reconstruction of individual genomes from metagenomes is practically challenging, both because of the complexity of the problem (sequence assembly is NP-hard) and the large data volumes. The clustering of sequences into biologically meaningful partitions (e.g. strains), known as binning, is a key step with most computational tools performing read assembly as a pre-processing. However, metagenome assembly (and even more cross-assembly) is computationally intensive, requiring terabytes of memory; it is also error-prone (yielding artefacts like chimeric contigs) and discards vast amounts of information in the form of unassembled reads (up to 50% for highly diverse metagenomes). Here we show how online learning methods for sparse non-negative matrix factorization can recover relative abundances of genomes across multiple metagenomes and support assembly-free read binning by using abundance covariation signals derived from the occurrence of unique k-mers (subsequences of size k) across samples. The combinatorial explosion of k-mers is controlled by indexing them using locality sensitive hashing, and sparse coding and dictionary learning techniques are used to decompose the k-mer abundance covariation signal into genome-resolved components in latent space.

Keywords

genomics statistical analysis DNA. metagenomics big data genome fragments genome reconstruction sequence assembly clustering metagenome assembly machine learning online learning artificial intelligence sparsity non-negative matrix factorization signal processing binning unique k-mers indexing sparse coding dictionary learning

Domains

Data Analysis, Statistics and Probability [physics.data-an] Artificial Intelligence [cs.AI] Genomics [q-bio.GN] Machine Learning [stat.ML]

Fichier principal

article_VincentProst.pdf (1.09 Mo)

Origin : Files produced by the author(s)

Marie-France Robbe : Connect in order to contact the contributor

https://hal.science/hal-04415393

Submitted on : Monday, April 22, 2024-3:35:16 PM

Last modification on : Tuesday, April 23, 2024-3:26:36 AM

Dates and versions

hal-04415393 , version 1 (22-04-2024)

Identifiers

HAL Id : hal-04415393 , version 1
DOI : 10.1007/978-3-030-46140-9_10

Cite

Vincent Prost, Stéphane Gazut, Thomas Brüls. Sparse non-negative matrix factorization for retrieving genomes across metagenomes. SimBig 2019 - 6th International Conference on Information Management and Big Data, Aug 2019, Lima, Peru. pp.97-105, ⟨10.1007/978-3-030-46140-9_10⟩. ⟨hal-04415393⟩

Export

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CEA CNRS UNIV-EVRY DRT CEA-UPSAY GENOMIQUE-METABOLIQUE UNIV-PARIS-SACLAY JACOB CEA-DRF LIST GENOSCOPE GS-COMPUTER-SCIENCE GS-BIOSPHERA GS-LIFE-SCIENCES-HEALTH GS-SPORT-HUMAN-MOVEMENT

16 View

0 Download