A beginner’s guide to integrating multi-omics data from microbial communities

Heintz-Buschart, Anna; Westerhuis, Johan A.

doi:10.1042/bio_2022_100

Microbial communities are immensely important and occur nearly everywhere, but their inner workings are still being discovered. The early years of microbiome research have been dominated by cataloguing the sheer diversity of microbes in these communities. Now, more and more studies try to understand connections between the microbes, between the way communities are built and how they function, and between their activity and the effects on their surroundings, including host organisms like humans. Omics measurements, or meta-omics as they are called when multiple organisms are measured at the same time, are a cornerstone in this endeavour. Here, we will discuss why their integration is important, how it can be achieved, what pitfalls may be avoided and which approaches are taken by integrative studies.

Omics analyses of microbiomes

Microbiomes are diverse communities of microorganisms that are responsible for essential environmental and host-associated processes. Many of their important functions are biochemical, involving the primary or secondary metabolism. These functions can be studied by omics analyses of the informative molecules of the central dogma of molecular biology, i.e., DNA, RNA, proteins and their metabolites. In this beginner’s guide, we use the general term ‘analyte’ for either of these molecules, while we stick to DNA, RNA, protein and metabolite for specific cases. Technological advances in DNA and RNA sequencing and mass spectrometry of proteins and metabolites have driven considerable progress in cataloguing and understanding the molecular make-up and functioning of microbial communities in the last decades.

Omics technologies aim to measure as many analytes as possible within a system, e.g., all genes in a genome or all transcripts in a transcriptome (see Figure 1). The prefix ‘meta-’ indicates that the system under study comprises multiple species: in the case of a microbiome, this can consist of hundreds or thousands of different microbial taxa. Finally, multi-omics means the system-wide analysis of multiple analyte pools, e.g., the metagenome and metaproteome. The term is sometimes extended to multiple connected systems, e.g., the microbial metagenome and the host metabolome. Omics technologies generate large data volumes, whose processing and decoding have high computational demands. Nevertheless, these analyses are often less laborious than traditional, culture-based microbiological methods – in some cases, they are also the only option to learn about microbiomes when members of the microbiome are not culturable in isolation. Moreover, the phenotypes of isolates may not reflect their activity in a microbiome and in association with a host, because of interactions such as cross-feeding, inter-species signalling, chemical inhibition of competitors and immune responses.

Why integrate?

Each ‘meta-omics level’ is a proxy for the functions of the microbiome system. Each level provides information on only a part of this system. Integration of multiple omics levels can give more insight in the functioning of the whole system e.g., to answer questions on the production of a metabolite that may be beneficial or detrimental to a host (e.g., a short chain fatty acid in a gut bacterium). Measuring the metabolome may be a faithful proxy for the metabolite level, but the metabolome is a community measure. It does not enable us to distinguish between mechanisms, such as producers disappear or become inactive; producers invest into alternative metabolic pathways; and other community members metabolize our product of interest. Therefore, integration of, e.g., metagenomics, metatranscriptomics and metabolomics, including lipidomics, provides mechanistic hypotheses for a better understanding of the community state or dynamics. This can lead to mechanistic hypotheses, generalizable observations or indications of functionally important community members.

Summary

Microbial communities or microbiomes are made up of many different, interacting microbial species.
To understand how microbiomes function, they are commonly studied by directly measuring ‘meta-omes’, e.g., the metagenome (the genomes of all present microbes), metatranscriptome (all transcripts of all microbes), metaproteome (all proteins), or metabolome (the metabolites of all microbes).
Because every meta-omics data set provides only parts of the picture, integration of multiple omics data is key to mechanistic insights into microbial communities.
Multi-omics study design and analysis makes provisions for biological and technical differences in the meta-omics data sets.
Depending on the system under study and the research question, integration can make use of:
1. common sequence information;
2. data fusion methods to identify common patterns or interactions;
3. prior knowledge of genetic or metabolic functions.

Integration can also improve the detection of analytes in one data set by borrowing information from another: the measurements of the second level can be adjusted (e.g., choice of instrumentation or sampling depth); or the data mining can be adapted (e.g., by setting the search space in metaproteomics). Integration can serve to validate the results of models that are based on observations of one level. Or it can be part of an exploratory study of a system that is not yet well described and where information is lacking at all omics levels (as was the case in the recent description of the Asgard archaea). If the state of the microbiome is of interest for classification purposes, e.g., as a diagnostic tool, the integration of several omics levels can improve the sensitivity and/or specificity and suggest biomarker panels made up of different kinds of analytes. Several approaches for integration of multi-omics data exist. They can be roughly divided into three groups (Figure 2): (a) the flow of genetic information through omics levels; (b) data fusion models that identify common patterns or interactions; and (c) prior knowledge on functional units, e.g., metabolic pathways. Combinations of the three approaches can be employed in the same study. The development of strategies to combine and complement these approaches is an active field of research.

What to consider when planning a multi-omics study

Which omics levels contain the most important information for the question at hand? How much knowledge is there on the system at that level, and can this be taken advantage of? For metagenomics or metatranscriptomics data, curated collections of genomes from isolates or metagenomes exist, including gene annotations and gene function predictions. The tens or hundreds of thousands of genomes in these collections far outnumber the species recorded in phenotype databases. Most of the millions of recognized gene families or orthologous groups are also not well described. Therefore, databases that link gene identities or orthologous groups to metabolic reactions, pathways, kinetics or models can only provide a means to integrate a limited proportion of sequence-based information with metabolomes.

How reliable are the different omics data sets? Each omics data set is affected by its technology’s limitations. For example, metaproteomics can only reach a shallower sampling depth than sequencing-based analyses and often does not reach the same phylogenetic resolution. The time scales in which the analytes can be measured accurately also vary: DNA can be very stable, but it can also represent dead and dormant microorganisms. RNA can be very unstable and does not always yield reliable results when long experimental handling times are required. Proteins and metabolites have specific life times, with some being very labile and others outlasting their producers in the environment.

Should more omics or more samples be measured? There are usually many more analytes than samples, which aggravates problematic characteristics of omics data sets. The uncertainty in the identification of the analytes may also affect the observed abundance, including not reporting an analyte. Most omics data sets are not truly quantitative, e.g., the number of reads linked to one taxon does not specify a cell count, and the peak intensities of different metabolites cannot be compared, as each metabolite has its own response factor which depends on the type of metabolite. Frequently, the technological sampling or measurement depth is not adequate to capture all analytes and very high and very low levels cannot be measured with the desired accuracy.

How related are the omics data sets? Preferably, the different omics are measured from the same samples or subjects, to reduce the effects of individual variation on the integration. There are also technical differences in the data sets: the total counts in metagenomics and -transcriptomics data are capped by the sequencing technology (and, therefore, compositional), while for metabolites the abundance of one metabolite does not per se affect the detectable amount of all others (but there can be specific effects on detectability, as in ion suppression). Metatranscriptomics functional data has a high proportion of zeroes, which mainly occur due to under-sampling, while the high (strain-level) resolution of metagenomics yields taxa which are absent in most samples. It’s important to take these differences into account when designing studies, when choosing omics levels and in the multi-omics integration.

Applications

After a few pioneering multi-omics studies of human-associated and soil microbiomes, multi-omics investigations of microbiomes have become more frequent in the last 5 years – contributing to human, animal, plant, environmental, and biotechnological research.

Sequence- and genome-centric integration

As mentioned earlier, metagenomics, metatranscriptomics and metaproteomics lend themselves to integration due to the sequence identity: metatranscriptomics reads can be mapped onto assembled metagenomes or can be co-assembled with metagenomic reads (Figure 3). This increases the detectability of highly expressed genes in low-abundant taxa. Proteins must be identified based on protein or peptide databases, and it has been demonstrated that this process is aided by the use of metagenomic information from the same sample. Integration has been used to determine the correlation between transcript and protein abundance, the level of variation of the different omics levels and hence the potential to find mechanistically important players in either data set.

An important concept in integrated meta-omics is genome-centric analyses: genomes are reconstructed from metagenomics data and the other omes are mapped to them. The genome, therefore, gives context to the observations (e.g., other genetic functions in the same genome, abundances in different samples). Based on phylogenetic relationships, genomes can be linked to prior phenotypic or biochemical knowledge. Examples of applications include functional analyses of human gut, soil, ground- and wastewater microbiomes. An advantage of these methods is that they are applicable to both described and completely unknown organisms. Genome-centric approaches to metabolomics integration based on reference strain metabolite profiles and co-cultures are currently being developed.

Data fusion

Probably the most common combination of omes are metagenomics and metabolomics: it has been studied in most microbiomes, from various human niches, over model and non-model terrestrial and marine animals, to plant rhizospheres, to soil and to biotechnological mixed communities. However, this integration is challenging: there is, of course, no sequence-identity to rely on. Due to the absence of biologically meaningful links, applicable methods are called data ‘fusion methods’, as opposed to integration methods. In the simplest case, fusion is attempted by pair-wise correlations of, e.g., all microbial taxa with all metabolites, where correlations that are above a certain threshold are represented in a ‘correlation network graph’. However, in this example, it is likely that many real processes are not observed except for cases where metabolites can only be produced and metabolized by single taxa or by sets of highly correlating taxa. More advanced methods estimate multivariate correlations between data sets by calculating linear combinations (components) of the analytes in one data set that co-vary highly with linear combinations of the other data set(s) (see Figure 2b). These covarying components are said to describe the ‘common’ or ‘joint’ information. Most of these methods assume analyte levels to be symmetrically distributed, but new methods are being developed that take the zero-inflated structure of microbiome data into account. Note that for interpreting correlations, it is of extreme importance to consider over which samples the correlation is calculated as they can change due to changing experimental conditions (Figure 4).

All data fusion methods have in common that biological interpretation is done afterwards, which is a problematic situation: the omics way of working that revolutionized biological research suffers from the curse of dimensionality, as more and more analytes are measured in an untargeted approach for a small set of samples. To model such data in a meaningful manner, thousands of samples would be needed to find the relevant analytes between the noise. If studies have a limited number of samples, it is of essential importance that the biological function of each feature is known and used to link analytes within and between datasets.

Integration with prior knowledge

Knowledge-based analyses of metabolic networks, which represent both sequence-based omics and metabolomics, have been applied in soil, wastewater treatment and human microbiomes. Here, the omics levels are summarized as functions of the whole community. Because they are additionally linked by known metabolic pathways, these approaches are successful in providing insights into metabolic pathways that respond to changing conditions (e.g., drought stress, temperature, or nutrition).

The microbial metagenome has become an omics level that is included in multi-omics studies of complex organisms such as plants, e.g., Brassica rapa, mice, cows and humans – especially in the context of metabolic and inflammatory diseases. Systematic references for mechanistic links of host and microbiome are not yet well developed, especially outside of human physiology. Hence, the associations between microbiome taxa or functions and host gene expression, epigenetics, metabolism and phenotype must necessarily be established by data fusion methods. Another trend in host-focused studies is to attempt classification of individuals (e.g., as a diagnostic tool for colorectal carcinomas) based on multi-omics biomarker panels, where supervised data fusion approaches are applied.

Outlook

Integration strategies adapt to and are facilitated by technological advances: for instance, recent research in a biogas reactor community has demonstrated several ways of how quantitative measurements at multiple meta-omics levels provide better functional explanations of community phenotype. High-quality multi-species metabolic models, methods for metagenomics-based construction of metabolic models and the integration of multi-omics measurements into such models are important research fields. Measurements and integration of the spatial structure of microbial communities will play a bigger role in the future. Large, openly accessible multi-omics data sets, databases with genetic and metabolite information and data standards (Figure 5) are developed, maintained and grown thanks to individual and community efforts. A key challenge for meta-omics integration will be the development of methods that combine the approaches described earlier to make sound use of data and meaningful knowledge – and to use the gained information to develop new research questions.

Author information

Anna Heintz-Buschart is an assistant professor for Microbial Metagenomics at the Swammerdam Institute for Life Sciences. She earned a PhD in the wet-lab on molecular microbiology and she still believes that our world belongs to the microbes. Because of that, she develops bioinformatics methods to analyse large-scale meta-omics data and integrate multiple omics levels, to facilitate biological interpretations and predictions. She has not quite decided what her favourite microbe-related system is: the human microbiome, biotechnology, soil ecology or biodiversity research. Email: a.u.s.heintzbuschart@uva.nl. twitter handle: @_a_h_b_.

Johan A. Westerhuis is an assistant professor for Biosystems Data Analysis at the Swammerdam Institute for Life Sciences. He obtained his PhD at the University Centre for Pharmacy at the University of Groningen on 'Multivariate statistical modeling of the pharmaceutical process of wet granulation and tableting' using multiblock (path-) models. After a postdoc at McMaster University, Hamilton, ON, on batch process monitoring and multiblock methods, he joined the Biosystems Data Analysis group at the Universiteit van Amsterdam. He teaches statistics and biochemical data analysis at Bachelor’s and Master’s levels and supervises PhD students and postdocs in metabolomics and microbiome data analysis. Email: j.a.westerhuis@uva.nl.

2022

Published by Portland Press Limited under the Creative Commons Attribution License 4.0 (CC BY-NC-ND)

A beginner’s guide to integrating multi-omics data from microbial communities

Omics analyses of microbiomes

Why integrate?

What to consider when planning a multi-omics study

Applications

Sequence- and genome-centric integration

Data fusion

Integration with prior knowledge

Outlook

Further reading

Readers who are interested in knowing more are referred to recent reviews which cover multi-omics integration from different points of view:

Specific methodological questions are addressed in the current literature:

Data analysis packages may be found here:

Example applications of multi-omics analyses can be found here:

Author information

Contents

Data & Figures

Supplements

References

Cited By

Get Email Alerts

CONNECT

EXPLORE

A beginner’s guide to integrating multi-omics data from microbial communities

Omics analyses of microbiomes

Why integrate?

What to consider when planning a multi-omics study

Applications

Sequence- and genome-centric integration

Data fusion

Integration with prior knowledge

Outlook

Further reading

Readers who are interested in knowing more are referred to recent reviews which cover multi-omics integration from different points of view:

Specific methodological questions are addressed in the current literature:

Data analysis packages may be found here:

Example applications of multi-omics analyses can be found here:

Author information

Contents

Data & Figures

Supplements

References

Related

Cited By

Get Email Alerts

CONNECT

EXPLORE

This Feature Is Available To Subscribers Only