• YBX1 Indirectly Targets Heterochromatin-Repressed Inflammatory Response-Related Apoptosis Genes through Regulating CBX5 mRNA.

      Kloetgen, Andreas; Duggimpudi, Sujitha; Schuschel, Konstantin; Hezaveh, Kebria; Picard, Daniel; Schaal, Heiner; Remke, Marc; Klusmann, Jan-Henning; Borkhardt, Arndt; McHardy, Alice C; et al. (MDPI, 2020-06-23)
      Medulloblastomas arise from undifferentiated precursor cells in the cerebellum and account for about 20% of all solid brain tumors during childhood; standard therapies include radiation and chemotherapy, which oftentimes come with severe impairment of the cognitive development of the young patients. Here, we show that the posttranscriptional regulator Y-box binding protein 1 (YBX1), a DNA- and RNA-binding protein, acts as an oncogene in medulloblastomas by regulating cellular survival and apoptosis. We observed different cellular responses upon YBX1 knockdown in several medulloblastoma cell lines, with significantly altered transcription and subsequent apoptosis rates. Mechanistically, PAR-CLIP for YBX1 and integration with RNA-Seq data uncovered direct posttranscriptional control of the heterochromatin-associated gene CBX5; upon YBX1 knockdown and subsequent CBX5 mRNA instability, heterochromatin-regulated genes involved in inflammatory response, apoptosis and death receptor signaling were de-repressed. Thus, YBX1 acts as an oncogene in medulloblastoma through indirect transcriptional regulation of inflammatory genes regulating apoptosis and represents a promising novel therapeutic target in this tumor entity.
    • Functional omics analyses reveal only minor effects of microRNAs on human somatic stem cell differentiation.

      Schira-Heinen, Jessica; Czapla, Agathe; Hendricks, Marion; Kloetgen, Andreas; Wruck, Wasco; Adjaye, James; Kögler, Gesine; Werner Müller, Hans; Stühler, Kai; Trompeter, Hans-Ingo; et al. (NPG, 2020-02-24)
      The contribution of microRNA-mediated posttranscriptional regulation on the final proteome in differentiating cells remains elusive. Here, we evaluated the impact of microRNAs (miRNAs) on the proteome of human umbilical cord blood-derived unrestricted somatic stem cells (USSC) during retinoic acid (RA) differentiation by a systemic approach using next generation sequencing analysing mRNA and miRNA expression and quantitative mass spectrometry-based proteome analyses. Interestingly, regulation of mRNAs and their dedicated proteins highly correlated during RA-incubation. Additionally, RA-induced USSC demonstrated a clear separation from native USSC thereby shifting from a proliferating to a metabolic phenotype. Bioinformatic integration of up- and downregulated miRNAs and proteins initially implied a strong impact of the miRNome on the XXL-USSC proteome. However, quantitative proteome analysis of the miRNA contribution on the final proteome after ectopic overexpression of downregulated miR-27a-5p and miR-221-5p or inhibition of upregulated miR-34a-5p, respectively, followed by RA-induction revealed only minor proportions of differentially abundant proteins. In addition, only small overlaps of these regulated proteins with inversely abundant proteins in non-transfected RA-treated USSC were observed. Hence, mRNA transcription rather than miRNA-mediated regulation is the driving force for protein regulation upon RA-incubation, strongly suggesting that miRNAs are fine-tuning regulators rather than active primary switches during RA-induction of USSC.
    • Eleven grand challenges in single-cell data science.

      Lähnemann, David; Köster, Johannes; Szczurek, Ewa; McCarthy, Davis J; Hicks, Stephanie C; Robinson, Mark D; Vallejos, Catalina A; Campbell, Kieran R; Beerenwinkel, Niko; Mahfouz, Ahmed; et al. (BMC, 2020-02-07)
      The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
    • Phylogeographic reconstruction using air transportation data and its application to the 2009 H1N1 influenza A pandemic.

      Reimering, Susanne; Muñoz, Sebastian; McHardy, Alice C; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (PLOS, 2020-02-01)
      Influenza A viruses cause seasonal epidemics and occasional pandemics in the human population. While the worldwide circulation of seasonal influenza is at least partly understood, the exact migration patterns between countries, states or cities are not well studied. Here, we use the Sankoff algorithm for parsimonious phylogeographic reconstruction together with effective distances based on a worldwide air transportation network. By first simulating geographic spread and then phylogenetic trees and genetic sequences, we confirmed that reconstructions with effective distances inferred phylogeographic spread more accurately than reconstructions with geographic distances and Bayesian reconstructions with BEAST that do not use any distance information, and led to comparable results to the Bayesian reconstruction using distance information via a generalized linear model. Our method extends Bayesian methods that estimate rates from the data by using fine-grained locations like airports and inferring intermediate locations not observed among sampled isolates. When applied to sequence data of the pandemic H1N1 influenza A virus in 2009, our approach correctly inferred the origin and proposed airports mainly involved in the spread of the virus. In case of a novel outbreak, this approach allows to rapidly analyze sequence data and infer origin and spread routes to improve disease surveillance and control.
    • CAMITAX: Taxon labels for microbial genomes.

      Bremges, Andreas; Fritz, Adrian; McHardy, Alice C; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (Oxford Academic, 2020-01-01)
      BACKGROUND: The number of microbial genome sequences is increasing exponentially, especially thanks to recent advances in recovering complete or near-complete genomes from metagenomes and single cells. Assigning reliable taxon labels to genomes is key and often a prerequisite for downstream analyses. FINDINGS: We introduce CAMITAX, a scalable and reproducible workflow for the taxonomic labelling of microbial genomes recovered from isolates, single cells, and metagenomes. CAMITAX combines genome distance-, 16S ribosomal RNA gene-, and gene homology-based taxonomic assignments with phylogenetic placement. It uses Nextflow to orchestrate reference databases and software containers and thus combines ease of installation and use with computational reproducibility. We evaluated the method on several hundred metagenome-assembled genomes with high-quality taxonomic annotations from the TARA Oceans project, and we show that the ensemble classification method in CAMITAX improved on all individual methods across tested ranks. CONCLUSIONS: While we initially developed CAMITAX to aid the Critical Assessment of Metagenome Interpretation (CAMI) initiative, it evolved into a comprehensive software package to reliably assign taxon labels to microbial genomes. CAMITAX is available under Apache License 2.0 at https://github.com/CAMI-challenge/CAMITAX.
    • The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens.

      Zhou, Naihui; Jiang, Yuxiang; Bergquist, Timothy R; Lee, Alexandra J; Kacsoh, Balint Z; Crocker, Alex W; Lewis, Kimberley A; Georghiou, George; Nguyen, Huy N; Hamid, Md Nafiz; et al. (BMC, 2019-11-19)
      BACKGROUND: The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. RESULTS: Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. CONCLUSION: We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.
    • Pediatric ALL relapses after allo-SCT show high individuality, clonal dynamics, selective pressure, and druggable targets.

      Hoell, Jessica I; Ginzel, Sebastian; Kuhlen, Michaela; Kloetgen, Andreas; Gombert, Michael; Fischer, Ute; Hein, Daniel; Demir, Salih; Stanulla, Martin; Schrappe, Martin; et al. (American Society of Haematology, 2019-10-22)
      Survival of patients with pediatric acute lymphoblastic leukemia (ALL) after allogeneic hematopoietic stem cell transplantation (allo-SCT) is mainly compromised by leukemia relapse, carrying dismal prognosis. As novel individualized therapeutic approaches are urgently needed, we performed whole-exome sequencing of leukemic blasts of 10 children with post-allo-SCT relapses with the aim of thoroughly characterizing the mutational landscape and identifying druggable mutations. We found that post-allo-SCT ALL relapses display highly diverse and mostly patient-individual genetic lesions. Moreover, mutational cluster analysis showed substantial clonal dynamics during leukemia progression from initial diagnosis to relapse after allo-SCT. Only very few alterations stayed constant over time. This dynamic clonality was exemplified by the detection of thiopurine resistance-mediating mutations in the nucleotidase NT5C2 in 3 patients' first relapses, which disappeared in the post-allo-SCT relapses on relief of selective pressure of maintenance chemotherapy. Moreover, we identified TP53 mutations in 4 of 10 patients after allo-SCT, reflecting acquired chemoresistance associated with selective pressure of prior antineoplastic treatment. Finally, in 9 of 10 children's post-allo-SCT relapse, we found alterations in genes for which targeted therapies with novel agents are readily available. We could show efficient targeting of leukemic blasts by APR-246 in 2 patients carrying TP53 mutations. Our findings shed light on the genetic basis of post-allo-SCT relapse and may pave the way for unraveling novel therapeutic strategies in this challenging situation.
    • Structures and functions linked to genome-wide adaptation of human influenza A viruses.

      Klingen, Thorsten R; Loers, Jens; Stanelle-Bertram, Stephanie; Gabriel, Gülsah; McHardy, Alice C; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (Springer-Nature, 2019-04-18)
      Human influenza A viruses elicit short-term respiratory infections with considerable mortality and morbidity. While H3N2 viruses circulate for more than 50 years, the recent introduction of pH1N1 viruses presents an excellent opportunity for a comparative analysis of the genome-wide evolutionary forces acting on both subtypes. Here, we inferred patches of sites relevant for adaptation, i.e. being under positive selection, on eleven viral protein structures, from all available data since 1968 and correlated these with known functional properties. Overall, pH1N1 have more patches than H3N2 viruses, especially in the viral polymerase complex, while antigenic evolution is more apparent for H3N2 viruses. In both subtypes, NS1 has the highest patch and patch site frequency, indicating that NS1-mediated viral attenuation of host inflammatory responses is a continuously intensifying process, elevated even in the longtime-circulating subtype H3N2. We confirmed the resistance-causing effects of two pH1N1 changes against oseltamivir in NA activity assays, demonstrating the value of the resource for discovering functionally relevant changes. Our results represent an atlas of protein regions and sites with links to host adaptation, antiviral drug resistance and immune evasion for both subtypes for further study.
    • Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX).

      Asgari, Ehsaneddin; McHardy, Alice C; Mofrad, Mohammad R K; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (Springer Nature, 2019-03-05)
    • Assessing taxonomic metagenome profilers with OPAL.

      Meyer, Fernando; Bremges, Andreas; Belmann, Peter; Janssen, Stefan; McHardy, Alice C; Koslicki, David; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (BioMedCentral, 2019-03-04)
      The explosive growth in taxonomic metagenome profiling methods over the past years has created a need for systematic comparisons using relevant performance criteria. The Open-community Profiling Assessment tooL (OPAL) implements commonly used performance metrics, including those of the first challenge of the initiative for the Critical Assessment of Metagenome Interpretation (CAMI), together with convenient visualizations. In addition, we perform in-depth performance comparisons with seven profilers on datasets of CAMI and the Human Microbiome Project. OPAL is freely available at https://github.com/CAMI-challenge/OPAL .
    • CAMISIM: simulating metagenomes and microbial communities.

      Fritz, Adrian; Hofmann, Peter; Majda, Stephan; Dahms, Eik; Dröge, Johannes; Fiedler, Jessika; Lesker, Till R; Belmann, Peter; DeMaere, Matthew Z; Darling, Aaron E; et al. (BioMedCentral, 2019-02-08)
      Shotgun metagenome data sets of microbial communities are highly diverse, not only due to the natural variation of the underlying biological systems, but also due to differences in laboratory protocols, replicate numbers, and sequencing technologies. Accordingly, to effectively assess the performance of metagenomic analysis software, a wide range of benchmark data sets are required. We describe the CAMISIM microbial community and metagenome simulator. The software can model different microbial abundance profiles, multi-sample time series, and differential abundance studies, includes real and simulated strain-level diversity, and generates second- and third-generation sequencing data from taxonomic profiles or de novo. Gold standards are created for sequence assembly, genome binning, taxonomic binning, and taxonomic profiling. CAMSIM generated the benchmark data sets of the first CAMI challenge. For two simulated multi-sample data sets of the human and mouse gut microbiomes, we observed high functional congruence to the real data. As further applications, we investigated the effect of varying evolutionary genome divergence, sequencing depth, and read error profiles on two popular metagenome assemblers, MEGAHIT, and metaSPAdes, on several thousand small data sets generated with CAMISIM. CAMISIM can simulate a wide variety of microbial communities and metagenome data sets together with standards of truth for method evaluation. All data sets and the software are freely available at https://github.com/CAMI-challenge/CAMISIM.
    • Toward unrestricted use of public genomic data.

      Amann, Rudolf I; Baichoo, Shakuntala; Blencowe, Benjamin J; Bork, Peer; Borodovsky, Mark; Brooksbank, Cath; Chain, Patrick S G; Colwell, Rita R; Daffonchio, Daniele G; Danchin, Antoine; et al. (AAAS, 2019-01-25)
      Despite some notable progress in data sharing policies and practices, restrictions are still often placed on the open and unconditional use of various genomic data after they have received official approval for release to the public domain or to public databases. These restrictions, which often conflict with the terms and conditions of the funding bodies who supported the release of those data for the benefit of the scientific community and society, are perpetuated by the lack of clear guiding rules for data usage. Existing guidelines for data released to the public domain recognize but fail to resolve tensions between the importance of free and unconditional use of these data and the “right” of the data producers to the first publication. This self-contradiction has resulted in a loophole that allows different interpretations and a continuous debate between data producers and data users on the use of public data. We argue that the publicly available data should be treated as open data, a shared resource with unrestricted use for analysis, interpretation, and publication.
    • Genomic variation and strain-specific functional adaptation in the human gut microbiome during early life.

      Vatanen, Tommi; Plichta, Damian R; Somani, Juhi; Münch, Philipp C; Arthur, Timothy D; Hall, Andrew Brantley; Rudolf, Sabine; Oakeley, Edward J; Ke, Xiaobo; Young, Rachel A; et al. (Springer-Nature, 2019-01-01)
      The human gut microbiome matures towards the adult composition during the first years of life and is implicated in early immune development. Here, we investigate the effects of microbial genomic diversity on gut microbiome development using integrated early childhood data sets collected in the DIABIMMUNE study in Finland, Estonia and Russian Karelia. We show that gut microbial diversity is associated with household location and linear growth of children. Single nucleotide polymorphism- and metagenomic assembly-based strain tracking revealed large and highly dynamic microbial pangenomes, especially in the genus Bacteroides, in which we identified evidence of variability deriving from Bacteroides-targeting bacteriophages. Our analyses revealed functional consequences of strain diversity; only 10% of Finnish infants harboured Bifidobacterium longum subsp. infantis, a subspecies specialized in human milk metabolism, whereas Russian infants commonly maintained a probiotic Bifidobacterium bifidum strain in infancy. Groups of bacteria contributing to diverse, characterized metabolic pathways converged to highly subject-specific configurations over the first two years of life. This longitudinal study extends the current view of early gut microbial community assembly based on strain-level genomic variation.
    • The homeobox transcription factor HB9 induces senescence and blocks differentiation in hematopoietic stem and progenitor cells.

      Ingenhag, Deborah; Reister, Sven; Auer, Franziska; Bhatia, Sanil; Wildenhain, Sarah; Picard, Daniel; Remke, Marc; Hoell, Jessica I; Kloetgen, Andreas; Sohn, Dennis; et al. (Ferrata Storti Foundation, 2019-01-01)
      The homeobox gene
    • Evolutionary model for the unequal segregation of high copy plasmids.

      Münch, Karin; Münch, Richard; Biedendieck, Rebekka; Jahn, Dieter; Müller, Johannes; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (PLOS, 2019-01-01)
      Plasmids are extrachromosomal DNA elements of microorganisms encoding beneficial genetic information. They were thought to be equally distributed to daughter cells during cell division. Here we use mathematical modeling to investigate the evolutionary stability of plasmid segregation for high-copy plasmids—plasmids that are present in up to several hundred copies per cell—carrying antibiotic resistance genes. Evolutionary stable strategies (ESS) are determined by numerical analysis of a plasmid-load structured population model. The theory predicts that the evolutionary stable segregation strategy of a cell depends on the plasmid copy number: For low and medium plasmid load, both daughters receive in average an equal share of plasmids, while in case of high plasmid load, one daughter obtains distinctively and systematically more plasmids. These findings are in good agreement with recent experimental results. We discuss the interpretation and practical consequences.
    • Reproducible Colonization of Germ-Free Mice With the Oligo-Mouse-Microbiota in Different Animal Facilities.

      Eberl, Claudia; Ring, Diana; Münch, Philipp C; Beutler, Markus; Basic, Marijana; Slack, Emma Caroline; Schwarzer, Martin; Srutkova, Dagmar; Lange, Anna; Frick, Julia S; et al. (Frontiers, 2019-01-01)
      The Oligo-Mouse-Microbiota (OMM12) is a recently developed synthetic bacterial community for functional microbiome research in mouse models (Brugiroux et al., 2016). To date, the OMM12 model has been established in several germ-free mouse facilities world-wide and is employed to address a growing variety of research questions related to infection biology, mucosal immunology, microbial ecology and host-microbiome metabolic cross-talk. The OMM12 consists of 12 sequenced and publically available strains isolated from mice, representing five bacterial phyla that are naturally abundant in the murine gastrointestinal tract (Lagkouvardos et al., 2016). Under germ-free conditions, the OMM12 colonizes mice stably over multiple generations. Here, we investigated whether stably colonized OMM12 mouse lines could be reproducibly established in different animal facilities. Germ-free C57Bl/6J mice were inoculated with a frozen mixture of the OMM12 strains. Within 2 weeks after application, the OMM12 community reached the same stable composition in all facilities, as determined by fecal microbiome analysis. We show that a second application of the OMM12 strains after 72 h leads to a more stable community composition than a single application. The availability of such protocols for reliable de novo generation of gnotobiotic rodents will certainly contribute to increasing experimental reproducibility in biomedical research.
    • A Fréchet tree distance measure to compare phylogeographic spread paths across trees.

      Reimering, Susanne; Muñoz, Sebastian; McHardy, Alice C; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (Nature publishing group, 2018-11-19)
      Phylogeographic methods reconstruct the origin and spread of taxa by inferring locations for internal nodes of the phylogenetic tree from sampling locations of genetic sequences. This is commonly applied to study pathogen outbreaks and spread. To evaluate such reconstructions, the inferred spread paths from root to leaf nodes should be compared to other methods or references. Usually, ancestral state reconstructions are evaluated by node-wise comparisons, therefore requiring the same tree topology, which is usually unknown. Here, we present a method for comparing phylogeographies across different trees inferred from the same taxa. We compare paths of locations by calculating discrete Fréchet distances. By correcting the distances by the number of paths going through a node, we define the Fréchet tree distance as a distance measure between phylogeographies. As an application, we compare phylogeographic spread patterns on trees inferred with different methods from hemagglutinin sequences of H5N1 influenza viruses, finding that both tree inference and ancestral reconstruction cause variation in phylogeographic spread that is not directly reflected by topological differences. The method is suitable for comparing phylogeographies inferred with different tree or phylogeographic inference methods to each other or to a known ground truth, thus enabling a quality assessment of such techniques.
    • Transcriptome-wide analysis uncovers the targets of the RNA-binding protein MSI2 and effects of MSI2's RNA-binding activity on IL-6 signaling.

      Duggimpudi, Sujitha; Kloetgen, Andreas; Maney, Sathish Kumar; Münch, Philipp C; Hezaveh, Kebria; Shaykhalishahi, Hamed; Hoyer, Wolfgang; McHardy, Alice C; Lang, Philipp A; Borkhardt, Arndt; et al. (American Society for Biochemistry and Molecular Biology, 2018-08-20)
      The RNA-binding protein Musashi 2 (MSI2) has emerged as an important regulator in cancer initiation, progression, and drug resistance. Translocations and deregulation of the MSI2 gene are diagnostic of certain cancers, including chronic myeloid leukemia (CML) with translocation t(7;17), acute myeloid leukemia (AML) with translocation t(10;17), and some cases of B-precursor acute lymphoblastic leukemia (pB-ALL). To better understand the function of MSI2 in leukemia, the mRNA targets that are bound and regulated by MSI2 and their MSI2-binding motifs need to be identified. To this end, using photoactivatable ribonucleoside cross-linking and immunoprecipitation (PAR-CLIP) and the multiple EM for motif elicitation (MEME) analysis tool, here we identified MSI2's mRNA targets and the consensus RNA-recognition element (RRE) motif recognized by MSI2 (UUAG). Of note, MSI2 knockdown altered the expression of several genes with roles in eukaryotic initiation factor 2 (eIF2), hepatocyte growth factor (HGF), and epidermal growth factor (EGF) signaling pathways. We also show that MSI2 regulates classic interleukin-6 (IL-6) signaling by promoting the degradation of the mRNA of IL-6 signal transducer (IL6ST or GP130), which, in turn, affected the phosphorylation statuses of signal transducer and activator of transcription 3 (STAT3) and the mitogen-activated protein kinase ERK. In summary, we have identified multiple MSI2-regulated mRNAs and provided evidence that MSI2 controls IL6ST activity that control oncogenic signaling networks. Our findings may help inform strategies for unraveling the role of MSI2 in leukemia to pave the way for the development of targeted therapies.
    • Modular Traits of the Rhizobiales Root Microbiota and Their Evolutionary Relationship with Symbiotic Rhizobia.

      Garrido-Oter, Ruben; Nakano, Ryohei Thomas; Dombrowski, Nina; Ma, Ka-Wai; McHardy, Alice C; Schulze-Lefert, Paul; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (Elsevier, 2018-07-11)
      Animal-microbe facultative symbioses play a fundamental role in ecosystem and organismal health. Yet, due to the flexible nature of their association, the selection pressures that act on animals and their facultative symbionts remain elusive. Here we apply experimental evolution to Drosophila melanogaster associated with its growth-promoting symbiont Lactobacillus plantarum, representing a well-established model of facultative symbiosis. We find that the diet of the host, rather than the host itself, is a predominant driving force in the evolution of this symbiosis. Furthermore, we identify a mechanism resulting from the bacterium's adaptation to the diet, which confers growth benefits to the colonized host. Our study reveals that bacterial adaptation to the host's diet may be the foremost step in determining the evolutionary course of a facultative animal-microbe symbiosis.
    • MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples.

      Asgari, Ehsaneddin; Garakani, Kiavash; McHardy, Alice C; Mofrad, Mohammad R K; BRICS, Braunschweiger Zentrum für Systembiologie, Rebenring 56,38106 Braunschweig, Germany. (Oxford University Press, 2018-07-01)
      Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes. A k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn's disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine. The software and datasets are available at https://llp.berkeley.edu/micropheno. Supplementary data are available at Bioinformatics online.