• EDEN: evolutionary dynamics within environments.

      Münch, Philipp C; Stecher, Bärbel; McHardy, Alice C; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (Oxford Academic, 2017-10-15)
      Metagenomics revolutionized the field of microbial ecology, giving access to Gb-sized datasets of microbial communities under natural conditions. This enables fine-grained analyses of the functions of community members, studies of their association with phenotypes and environments, as well as of their microevolution and adaptation to changing environmental conditions. However, phylogenetic methods for studying adaptation and evolutionary dynamics are not able to cope with big data. EDEN is the first software for the rapid detection of protein families and regions under positive selection, as well as their associated biological processes, from meta- and pangenome data. It provides an interactive result visualization for detailed comparative analyses. Availability and implementation: EDEN is available as a Docker installation under the GPL 3.0 license, allowing its use on common operating systems, at http://www.github.com/hzi-bifo/eden.
    • Eleven grand challenges in single-cell data science.

      Lähnemann, David; Köster, Johannes; Szczurek, Ewa; McCarthy, Davis J; Hicks, Stephanie C; Robinson, Mark D; Vallejos, Catalina A; Campbell, Kieran R; Beerenwinkel, Niko; Mahfouz, Ahmed; et al. (BMC, 2020-02-07)
      The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
    • Evaluating assembly and variant calling software for strain-resolved analysis of large DNA viruses

      Deng, Zhi-Luo; Dhingra, Akshay; Fritz, Adrian; Götting, Jasper; Münch, Philipp C; Steinbrück, Lars; Schulz, Thomas F; Ganzenmüller, Tina; McHardy, Alice C; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (Oxford University Press (OUP), 2020-07-07)
      Infection with human cytomegalovirus (HCMV) can cause severe complications in immunocompromised individuals and congenitally infected children. Characterizing heterogeneous viral populations and their evolution by high-throughput sequencing of clinical specimens requires the accurate assembly of individual strains or sequence variants and suitable variant calling methods. However, the performance of most methods has not been assessed for populations composed of low divergent viral strains with large genomes, such as HCMV. In an extensive benchmarking study, we evaluated 15 assemblers and 6 variant callers on 10 lab-generated benchmark data sets created with two different library preparation protocols, to identify best practices and challenges for analyzing such data. Most assemblers, especially metaSPAdes and IVA, performed well across a range of metrics in recovering abundant strains. However, only one, Savage, recovered low abundant strains and in a highly fragmented manner. Two variant callers, LoFreq and VarScan2, excelled across all strain abundances. Both shared a large fraction of false positive variant calls, which were strongly enriched in T to G changes in a ‘G.G’ context. The magnitude of this context-dependent systematic error is linked to the experimental protocol. We provide all benchmarking data, results and the entire benchmarking workflow named QuasiModo, Quasispecies Metric determination on omics, under the GNU General Public License v3.0 (https://github.com/hzi-bifo/Quasimodo), to enable full reproducibility and further benchmarking on these and other data.
    • Evolution of 2009 H1N1 influenza viruses during the pandemic correlates with increased viral pathogenicity and transmissibility in the ferret model.

      Otte, Anna; Marriott, Anthony C; Dreier, Carola; Dove, Brian; Mooren, Kyra; Klingen, Thorsten R; Sauter, Martina; Thompson, Katy-Anne; Bennett, Allan; Klingel, Karin; et al. (2016)
      There is increasing evidence that 2009 pandemic H1N1 influenza viruses have evolved after pandemic onset giving rise to severe epidemics in subsequent waves. However, it still remains unclear which viral determinants might have contributed to disease severity after pandemic initiation. Here, we show that distinct mutations in the 2009 pandemic H1N1 virus genome have occurred with increased frequency after pandemic declaration. Among those, a mutation in the viral hemagglutinin was identified that increases 2009 pandemic H1N1 virus binding to human-like α2,6-linked sialic acids. Moreover, these mutations conferred increased viral replication in the respiratory tract and elevated respiratory droplet transmission between ferrets. Thus, our data show that 2009 H1N1 influenza viruses have evolved after pandemic onset giving rise to novel virus variants that enhance viral replicative fitness and respiratory droplet transmission in a mammalian animal model. These findings might help to improve surveillance efforts to assess the pandemic risk by emerging influenza viruses.
    • Evolutionary model for the unequal segregation of high copy plasmids.

      Münch, Karin; Münch, Richard; Biedendieck, Rebekka; Jahn, Dieter; Müller, Johannes; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (PLOS, 2019-01-01)
      Plasmids are extrachromosomal DNA elements of microorganisms encoding beneficial genetic information. They were thought to be equally distributed to daughter cells during cell division. Here we use mathematical modeling to investigate the evolutionary stability of plasmid segregation for high-copy plasmids—plasmids that are present in up to several hundred copies per cell—carrying antibiotic resistance genes. Evolutionary stable strategies (ESS) are determined by numerical analysis of a plasmid-load structured population model. The theory predicts that the evolutionary stable segregation strategy of a cell depends on the plasmid copy number: For low and medium plasmid load, both daughters receive in average an equal share of plasmids, while in case of high plasmid load, one daughter obtains distinctively and systematically more plasmids. These findings are in good agreement with recent experimental results. We discuss the interpretation and practical consequences.
    • Evolutionary Stabilization of Cooperative Toxin Production through a Bacterium-Plasmid-Phage Interplay.

      Spriewald, Stefanie; Stadler, Eva; Hense, Burkhard A; Münch, Philipp C; McHardy, Alice C; Weiss, Anna S; Obeng, Nancy; Müller, Johannes; Stecher, Bärbel; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (ASM, 2020-07-21)
      Colicins are toxins produced and released by Enterobacteriaceae to kill competitors in the gut. While group A colicins employ a division of labor strategy to liberate the toxin into the environment via colicin-specific lysis, group B colicin systems lack cognate lysis genes. In Salmonella enterica serovar Typhimurium (S. Tm), the group B colicin Ib (ColIb) is released by temperate phage-mediated bacteriolysis. Phage-mediated ColIb release promotes S. Tm fitness against competing Escherichia coli It remained unclear how prophage-mediated lysis is realized in a clonal population of ColIb producers and if prophages contribute to evolutionary stability of toxin release in S. Tm. Here, we show that prophage-mediated lysis occurs in an S. Tm subpopulation only, thereby introducing phenotypic heterogeneity to the system. We established a mathematical model to study the dynamic interplay of S. Tm, ColIb, and a temperate phage in the presence of a competing species. Using this model, we studied long-term evolution of phage lysis rates in a fluctuating infection scenario. This revealed that phage lysis evolves as bet-hedging strategy that maximizes phage spread, regardless of whether colicin is present or not. We conclude that the ColIb system, lacking its own lysis gene, is making use of the evolutionary stable phage strategy to be released. Prophage lysis genes are highly prevalent in nontyphoidal Salmonella genomes. This suggests that the release of ColIb by temperate phages is widespread. In conclusion, our findings shed new light on the evolution and ecology of group B colicin systems.IMPORTANCE Bacteria are excellent model organisms to study mechanisms of social evolution. The production of public goods, e.g., toxin release by cell lysis in clonal bacterial populations, is a frequently studied example of cooperative behavior. Here, we analyze evolutionary stabilization of toxin release by the enteric pathogen Salmonella The release of colicin Ib (ColIb), which is used by Salmonella to gain an edge against competing microbiota following infection, is coupled to bacterial lysis mediated by temperate phages. Here, we show that phage-dependent lysis and subsequent release of colicin and phage particles occurs only in part of the ColIb-expressing Salmonella population. This phenotypic heterogeneity in lysis, which represents an essential step in the temperate phage life cycle, has evolved as a bet-hedging strategy under fluctuating environments such as the gastrointestinal tract. Our findings suggest that prophages can thereby evolutionarily stabilize costly toxin release in bacterial populations.
    • A Fréchet tree distance measure to compare phylogeographic spread paths across trees.

      Reimering, Susanne; Muñoz, Sebastian; McHardy, Alice C; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (Nature publishing group, 2018-11-19)
      Phylogeographic methods reconstruct the origin and spread of taxa by inferring locations for internal nodes of the phylogenetic tree from sampling locations of genetic sequences. This is commonly applied to study pathogen outbreaks and spread. To evaluate such reconstructions, the inferred spread paths from root to leaf nodes should be compared to other methods or references. Usually, ancestral state reconstructions are evaluated by node-wise comparisons, therefore requiring the same tree topology, which is usually unknown. Here, we present a method for comparing phylogeographies across different trees inferred from the same taxa. We compare paths of locations by calculating discrete Fréchet distances. By correcting the distances by the number of paths going through a node, we define the Fréchet tree distance as a distance measure between phylogeographies. As an application, we compare phylogeographic spread patterns on trees inferred with different methods from hemagglutinin sequences of H5N1 influenza viruses, finding that both tree inference and ancestral reconstruction cause variation in phylogeographic spread that is not directly reflected by topological differences. The method is suitable for comparing phylogeographies inferred with different tree or phylogeographic inference methods to each other or to a known ground truth, thus enabling a quality assessment of such techniques.
    • From Genomes to Phenotypes: Traitar, the Microbial Trait Analyzer.

      Weimann, Aaron; Mooren, Kyra; Frank, Jeremy; Pope, Phillip B; Bremges, Andreas; McHardy, Alice C; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56, 38106 Braunschweig, Germany. (2017-01-31)
      The number of sequenced genomes is growing exponentially, profoundly shifting the bottleneck from data generation to genome interpretation. Traits are often used to characterize and distinguish bacteria and are likely a driving factor in microbial community composition, yet little is known about the traits of most microbes. We describe Traitar, the microbial trait analyzer, which is a fully automated software package for deriving phenotypes from a genome sequence. Traitar provides phenotype classifiers to predict 67 traits related to the use of various substrates as carbon and energy sources, oxygen requirement, morphology, antibiotic susceptibility, proteolysis, and enzymatic activities. Furthermore, it suggests protein families associated with the presence of particular phenotypes. Our method uses L1-regularized L2-loss support vector machines for phenotype assignments based on phyletic patterns of protein families and their evolutionary histories across a diverse set of microbial species. We demonstrate reliable phenotype assignment for Traitar to bacterial genomes from 572 species of eight phyla, also based on incomplete single-cell genomes and simulated draft genomes. We also showcase its application in metagenomics by verifying and complementing a manual metabolic reconstruction of two novel Clostridiales species based on draft genomes recovered from commercial biogas reactors. Traitar is available at https://github.com/hzi-bifo/traitar. IMPORTANCE Bacteria are ubiquitous in our ecosystem and have a major impact on human health, e.g., by supporting digestion in the human gut. Bacterial communities can also aid in biotechnological processes such as wastewater treatment or decontamination of polluted soils. Diverse bacteria contribute with their unique capabilities to the functioning of such ecosystems, but lab experiments to investigate those capabilities are labor-intensive. Major advances in sequencing techniques open up the opportunity to study bacteria by their genome sequences. For this purpose, we have developed Traitar, software that predicts traits of bacteria on the basis of their genomes. It is applicable to studies with tens or hundreds of bacterial genomes. Traitar may help researchers in microbiology to pinpoint the traits of interest, reducing the amount of wet lab work required.
    • Functional omics analyses reveal only minor effects of microRNAs on human somatic stem cell differentiation.

      Schira-Heinen, Jessica; Czapla, Agathe; Hendricks, Marion; Kloetgen, Andreas; Wruck, Wasco; Adjaye, James; Kögler, Gesine; Werner Müller, Hans; Stühler, Kai; Trompeter, Hans-Ingo; et al. (NPG, 2020-02-24)
      The contribution of microRNA-mediated posttranscriptional regulation on the final proteome in differentiating cells remains elusive. Here, we evaluated the impact of microRNAs (miRNAs) on the proteome of human umbilical cord blood-derived unrestricted somatic stem cells (USSC) during retinoic acid (RA) differentiation by a systemic approach using next generation sequencing analysing mRNA and miRNA expression and quantitative mass spectrometry-based proteome analyses. Interestingly, regulation of mRNAs and their dedicated proteins highly correlated during RA-incubation. Additionally, RA-induced USSC demonstrated a clear separation from native USSC thereby shifting from a proliferating to a metabolic phenotype. Bioinformatic integration of up- and downregulated miRNAs and proteins initially implied a strong impact of the miRNome on the XXL-USSC proteome. However, quantitative proteome analysis of the miRNA contribution on the final proteome after ectopic overexpression of downregulated miR-27a-5p and miR-221-5p or inhibition of upregulated miR-34a-5p, respectively, followed by RA-induction revealed only minor proportions of differentially abundant proteins. In addition, only small overlaps of these regulated proteins with inversely abundant proteins in non-transfected RA-treated USSC were observed. Hence, mRNA transcription rather than miRNA-mediated regulation is the driving force for protein regulation upon RA-incubation, strongly suggesting that miRNAs are fine-tuning regulators rather than active primary switches during RA-induction of USSC.
    • Genome-guided design of a defined mouse microbiota that confers colonization resistance against Salmonella enterica serovar Typhimurium.

      Brugiroux, Sandrine; Beutler, Markus; Pfann, Carina; Garzetti, Debora; Ruscheweyh, Hans-Joachim; Ring, Diana; Diehl, Manuel; Herp, Simone; Lötscher, Yvonne; Hussain, Saib; et al. (2016-11-21)
      Protection against enteric infections, also termed colonization resistance, results from mutualistic interactions of the host and its indigenous microbes. The gut microbiota of humans and mice is highly diverse and it is therefore challenging to assign specific properties to its individual members. Here, we have used a collection of murine bacterial strains and a modular design approach to create a minimal bacterial community that, once established in germ-free mice, provided colonization resistance against the human enteric pathogen Salmonella enterica serovar Typhimurium (S. Tm). Initially, a community of 12 strains, termed Oligo-Mouse-Microbiota (Oligo-MM
    • Genomic variation and strain-specific functional adaptation in the human gut microbiome during early life.

      Vatanen, Tommi; Plichta, Damian R; Somani, Juhi; Münch, Philipp C; Arthur, Timothy D; Hall, Andrew Brantley; Rudolf, Sabine; Oakeley, Edward J; Ke, Xiaobo; Young, Rachel A; et al. (Springer-Nature, 2019-01-01)
      The human gut microbiome matures towards the adult composition during the first years of life and is implicated in early immune development. Here, we investigate the effects of microbial genomic diversity on gut microbiome development using integrated early childhood data sets collected in the DIABIMMUNE study in Finland, Estonia and Russian Karelia. We show that gut microbial diversity is associated with household location and linear growth of children. Single nucleotide polymorphism- and metagenomic assembly-based strain tracking revealed large and highly dynamic microbial pangenomes, especially in the genus Bacteroides, in which we identified evidence of variability deriving from Bacteroides-targeting bacteriophages. Our analyses revealed functional consequences of strain diversity; only 10% of Finnish infants harboured Bifidobacterium longum subsp. infantis, a subspecies specialized in human milk metabolism, whereas Russian infants commonly maintained a probiotic Bifidobacterium bifidum strain in infancy. Groups of bacteria contributing to diverse, characterized metabolic pathways converged to highly subject-specific configurations over the first two years of life. This longitudinal study extends the current view of early gut microbial community assembly based on strain-level genomic variation.
    • Genomics and prevalence of bacterial and archaeal isolates from biogas-producing microbiomes.

      Maus, Irena; Bremges, Andreas; Stolze, Yvonne; Hahnke, Sarah; Cibis, Katharina G; Koeck, Daniela E; Kim, Yong S; Kreubel, Jana; Hassa, Julia; Wibberg, Daniel; et al. (2017)
      To elucidate biogas microbial communities and processes, the application of high-throughput DNA analysis approaches is becoming increasingly important. Unfortunately, generated data can only partialy be interpreted rudimentary since databases lack reference sequences.
    • Hepatitis C reference viruses highlight potent antibody responses and diverse viral functional interactions with neutralising antibodies.

      Bankwitz, Dorothea; Bahai, Akash; Labuhn, Maurice; Doepke, Mandy; Ginkel, Corinne; Khera, Tanvi; Todt, Daniel; Ströh, Luisa J; Dold, Leona; Klein, Florian; et al. (BMJ Publisher. Group, 2020-12-15)
      Community-acquired pneumonia by primary or superinfections with Streptococcus pneumoniae can lead to acute respiratory distress requiring mechanical ventilation. The pore-forming toxin pneumolysin alters the alveolar-capillary barrier and causes extravasation of protein-rich fluid into the interstitial pulmonary tissue, which impairs gas exchange. Platelets usually prevent endothelial leakage in inflamed pulmonary tissue by sealing inflammation-induced endothelial gaps. We not only confirm that S pneumoniae induces CD62P expression in platelets, but we also show that, in the presence of pneumolysin, CD62P expression is not associated with platelet activation. Pneumolysin induces pores in the platelet membrane, which allow anti-CD62P antibodies to stain the intracellular CD62P without platelet activation. Pneumolysin treatment also results in calcium efflux, increase in light transmission by platelet lysis (not aggregation), loss of platelet thrombus formation in the flow chamber, and loss of pore-sealing capacity of platelets in the Boyden chamber. Specific anti-pneumolysin monoclonal and polyclonal antibodies inhibit these effects of pneumolysin on platelets as do polyvalent human immunoglobulins. In a post hoc analysis of the prospective randomized phase 2 CIGMA trial, we show that administration of a polyvalent immunoglobulin preparation was associated with a nominally higher platelet count and nominally improved survival in patients with severe S pneumoniae-related community-acquired pneumonia. Although, due to the low number of patients, no definitive conclusion can be made, our findings provide a rationale for investigation of pharmacologic immunoglobulin preparations to target pneumolysin by polyvalent immunoglobulin preparations in severe community-acquired pneumococcal pneumonia, to counteract the risk of these patients becoming ventilation dependent. This trial was registered at www.clinicaltrials.gov as #NCT01420744.
    • The homeobox transcription factor HB9 induces senescence and blocks differentiation in hematopoietic stem and progenitor cells.

      Ingenhag, Deborah; Reister, Sven; Auer, Franziska; Bhatia, Sanil; Wildenhain, Sarah; Picard, Daniel; Remke, Marc; Hoell, Jessica I; Kloetgen, Andreas; Sohn, Dennis; et al. (Ferrata Storti Foundation, 2019-01-01)
      The homeobox gene
    • How to Grow a Computational Biology Lab.

      McHardy, Alice Carolyn; Helmholtz Centre for infection research, Inhoffenstr. 7, 38124 Braunschweig, Germany. (2015-09)
    • Improved metagenome assemblies and taxonomic binning using long-read circular consensus sequence data.

      Frank, J A; Pan, Y; Tooming-Klunderud, A; Eijsink, V G H; McHardy, A C; Nederbragt, A J; Pope, P B; Department of Chemistry, Biotechnology and Food Science, Norwegian University of Life Sciences, Ås, 1432 Norway. (2016)
      DNA assembly is a core methodological step in metagenomic pipelines used to study the structure and function within microbial communities. Here we investigate the utility of Pacific Biosciences long and high accuracy circular consensus sequencing (CCS) reads for metagenomic projects. We compared the application and performance of both PacBio CCS and Illumina HiSeq data with assembly and taxonomic binning algorithms using metagenomic samples representing a complex microbial community. Eight SMRT cells produced approximately 94 Mb of CCS reads from a biogas reactor microbiome sample that averaged 1319 nt in length and 99.7% accuracy. CCS data assembly generated a comparative number of large contigs greater than 1 kb, to those assembled from a ~190x larger HiSeq dataset (~18 Gb) produced from the same sample (i.e approximately 62% of total contigs). Hybrid assemblies using PacBio CCS and HiSeq contigs produced improvements in assembly statistics, including an increase in the average contig length and number of large contigs. The incorporation of CCS data produced significant enhancements in taxonomic binning and genome reconstruction of two dominant phylotypes, which assembled and binned poorly using HiSeq data alone. Collectively these results illustrate the value of PacBio CCS reads in certain metagenomics applications.
    • In Silico Vaccine Strain Prediction for Human Influenza Viruses.

      Klingen, Thorsten R; Reimering, Susanne; Guzmán, Carlos A; McHardy, Alice C; Braunschweiger Zentrum für Systembiology, Rebenring 56,38108 Braunschweig, Germany. (2017-10-09)
      Vaccines preventing seasonal influenza infections save many lives every year; however, due to rapid viral evolution, they have to be updated frequently to remain effective. To identify appropriate vaccine strains, the World Health Organization (WHO) operates a global program that continually generates and interprets surveillance data. Over the past decade, sophisticated computational techniques, drawing from multiple theoretical disciplines, have been developed that predict viral lineages rising to predominance, assess their suitability as vaccine strains, link genetic to antigenic alterations, as well as integrate and visualize genetic, epidemiological, structural, and antigenic data. These could form the basis of an objective and reproducible vaccine strain-selection procedure utilizing the complex, large-scale data types from surveillance. To this end, computational techniques should already be incorporated into the vaccine-selection process in an independent, parallel track, and their performance continuously evaluated.
    • An Integrated Metagenome Catalog Reveals New Insights into the Murine Gut Microbiome.

      Lesker, Till R; Durairaj, Abilash C; Gálvez, Eric J C; Lagkouvardos, Ilias; Baines, John F; Clavel, Thomas; Sczyrba, Alexander; McHardy, Alice C; Strowig, Till; HZI,Helmholtz-Zentrum für Infektionsforschung GmbH, Inhoffenstr. 7,38124 Braunschweig, Germany.
      The complexity of host-associated microbial ecosystems requires host-specific reference catalogs to survey the functions and diversity of these communities. We generate a comprehensive resource, the integrated mouse gut metagenome catalog (iMGMC), comprising 4.6 million unique genes and 660 metagenome-assembled genomes (MAGs), many (485 MAGs, 73%) of which are linked to reconstructed full-length 16S rRNA gene sequences. iMGMC enables unprecedented coverage and taxonomic resolution of the mouse gut microbiota; i.e., more than 92% of MAGs lack species-level representatives in public repositories (<95% ANI match). The integration of MAGs and 16S rRNA gene data allows more accurate prediction of functional profiles of communities than predictions based on 16S rRNA amplicons alone. Accompanying iMGMC, we provide a set of MAGs representing 1,296 gut bacteria obtained through complementary assembly strategies. We envision that integrated resources such as iMGMC, together with MAG collections, will enhance the resolution of numerous existing and future sequencing-based studies.
    • Investigation of different nitrogen reduction routes and their key microbial players in wood chip-driven denitrification beds.

      Grießmeier, Victoria; Bremges, Andreas; McHardy, Alice Carolyn; Gescher, Johannes; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56, 38106 Braunschweig, Germany. (2017-12-05)
      Field denitrification beds containing polymeric plant material are increasingly used to eliminate nitrate from agricultural drainage water. They mirror a number of anoxic ecosystems. However, knowledge of the microbial composition, the interaction of microbial species, and the carbon degradation processes within these denitrification systems is sparse. This study revealed several new aspects of the carbon and nitrogen cycle, and these findings can be correlated with the dynamics of the microbial community composition and the activity of key species. Members of the order Pseudomonadales seem to be important players in denitrification at low nitrate concentrations, while a switch to higher nitrate concentrations seems to select for members of the orders Rhodocyclales and Rhizobiales. We observed that high nitrate loading rates lead to an unpredictable transition of the community's activity from denitrification to dissimilatory reduction of nitrate to ammonium (DNRA). This transition is mirrored by an increase in transcripts of the nitrite reductase gene nrfAH and the increase correlates with the activity of members of the order Ignavibacteriales. Denitrification reactors sustained the development of an archaeal community consisting of members of the Bathyarchaeota and methanogens belonging to the Euryarchaeota. Unexpectedly, the activity of the methanogens positively correlated with the nitrate loading rates.
    • MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples.

      Asgari, Ehsaneddin; Garakani, Kiavash; McHardy, Alice C; Mofrad, Mohammad R K; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (Oxford University Press, 2018-07-01)
      Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes. A k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn's disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine. The software and datasets are available at https://llp.berkeley.edu/micropheno. Supplementary data are available at Bioinformatics online.