Bisulfite sequencing is a powerful technique to detect 5-methylcytosine in DNA that has immensely contributed to our understanding of epigenetic regulation in plants and animals. Meanwhile, research on other base modifications, including 6-methyladenine and 4-methylcytosine that are frequent in prokaryotes, has been impeded by the lack of a comparable technique. Bisulfite sequencing also suffers from a number of drawbacks that are difficult to surmount, among which DNA degradation, lack of specificity, or short reads with low sequence diversity. In this review, we explore the recent refinements to bisulfite sequencing protocols that enable targeting genomic regions of interest, detecting derivatives of 5-methylcytosine, and mapping single-cell methylomes. We then present the unique advantage of long-read sequencing in detecting base modifications in native DNA and highlight the respective strengths and weaknesses of PacBio and Nanopore sequencing for this application. Although analysing epigenetic data from long-read platforms remains challenging, the ability to detect various modified bases from a universal sample preparation, in addition to the mapping and phasing advantages of the longer read lengths, provide long-read sequencing with a decisive edge over short-read bisulfite sequencing for an expanding number of applications across kingdoms.
Genomic DNA is composed of the canonical DNA bases A, T, C, and G. Modified DNA bases do not change the underlying sequence, but instead carry an extra layer of information that often dictates how that DNA sequence is utilised: for example identifying sequences as endogenous or modulating transcription [1–3]. The DNAmod database catalogs 43 DNA modifications encountered in natural DNA , some with regulatory roles but the majority resulting from DNA damage .
N6-methyladenine (6mA), 4-methylcytosine (4mC), and 5-methylcytosine (5mC) are frequent in bacteria and have roles not only in cellular defence but also in the regulation of gene expression, with effects on virulence and physiology . 5mC is also the most frequent, most studied and best understood modification in plants and animals [6–8]. This is in part due to the accurate bisulfite-based short-read sequencing techniques available to measure 5mC [9,10]. However, bisulfite sequencing suffers from a number of limitations and is not easily applicable to the detection of other base modifications.
On the other hand, emerging long-read sequencing techniques offer exciting possibilities to study a wide range of modifications, with the advantages inherent to long reads and single-molecule sequencing. In this review, we will discuss the current gold-standard in detection of 5mC and its oxidised derivatives and compare it to the current and future possibilities offered by long-read sequencing from Pacific Biosciences and Oxford Nanopore technologies. We note, however, that there are many alternative methods available to measure 5mC that we will not discuss here, each of which may be useful in specific applications and warrant consideration (reviewed in [11–13]).
Sequencing of bisulfite-converted DNA
DNA methylation marks remain intact upon DNA extraction. Treatment of genomic DNA with sodium bisulfite results in deamination of unmethylated cytosines to uracil, leaving methylated cytosines intact  (Figure 1). Treated DNA is subsequently PCR-amplified with a uracil-tolerant polymerase to provide sufficient template for analysis, causing uracil to convert to thymine. DNA methylation can therefore be read directly by traditional Sanger or Illumina short-read sequencing through comparison to a reference or untreated sequence, providing a readout that is highly quantitative and with base-pair resolution.
In animals, 5mC is most often enriched in the CpG dinucleotide context [7,15,16]; however, 5mC also appears in the CHG and CHH contexts (where H is either A, C or T) where studies also suggest possible functional roles [17–20]. In plants, methylation is abundant in all CG, CHG and CHH contexts although with extensive variation across species . The fact that bisulfite sequencing informs on all cytosines regardless of context can therefore be highly beneficial. These properties have made sequencing of bisulfite-converted DNA the gold standard of 5mC detection techniques; however, there are some downfalls for consideration. Firstly, the bisulfite treatment is very harsh, leading to degradation and DNA that is harder to PCR amplify, therefore large amounts of input DNA are often required. Recently, New England Biolabs reported a technique called Enzymatic Methyl-seq (EM-seq) that uses enzymatic deamination of cytosine by APOBEC, thereby producing sequence identical to bisulfite treatment, but avoiding the need for the harsh chemical treatment. Additionally, bisulfite sequence data require more sophisticated bioinformatic analysis techniques than are required for unconverted DNA, as sequence must be compared to a bisulfite-converted reference genome before methylation calls can be inferred. There are a number of excellent tools designed specifically to process bisulfite sequence data (reviewed in ). Illumina sequencing of bisulfite DNA also suffers from the same problems as all short-read data, particularly mapping issues to repeating or low complexity regions, including regions of pertinence to 5mC such as heavily GC rich regulatory regions and repetitive DNA (Figure 1). These issues are further compounded by the loss of sequence diversity due to the bisulfite conversion [22–24]. Additionally, short reads are difficult to haplotype as the short nature of the read reduces the likelihood of it containing an informative single-nucleotide polymorphism (SNP)  (Figure 1). These limitations of short-read sequencing are abrogated with long-read sequencing, discussed in detail below.
Illumina sequencing of total genomic DNA is known as whole-genome bisulfite sequencing (WGBS) and provides the most comprehensive and unbiased survey of 5mC currently possible [9,10]; however, obtaining such comprehensive data requires a high number of reads, with a coverage of 5–15× recommended (∼160–480M 100-bp reads for a haploid human genome, pooling both DNA strands) .
Enrichment techniques for bisulfite sequencing
In order to mitigate the high costs associated with obtaining the necessary reads for quality WGBS data in large genomes, techniques that enrich for regions of interest have been developed. For a small number of genomic loci (≤ 20), amplicon sequencing is straightforward and cost-effective .
DNA is first bisulfite-treated before amplification by specific primers and barcoding, then sequenced as a multiplex . For larger numbers of regions, capture-sequencing avoids the labour-intensive design of primer pairs; however, they require the synthesis of a probe panel. Capture by hybridisation to specific probes can be performed either before (Agilent Sure-Select Methyl-Seq, TruSeq Methyl Capture, ) or after bisulfite conversion (Roche SeqCap Epi, [30–32]). In the latter case, there is a risk that preferential binding of probes to certain methylation states of the target fragments could introduce biases in the quantitation of methylation. Custom panels can be expensive for one-off applications, but there are commercially available panels that perform well for the human genome .
In mammals, reduced representation bisulfite sequencing (RRBS, Figure 1) offers a cost-efficient solution [34,35], enabling enrichment of regions where regulation by mC is more likely such as CpG Islands: regions of high CpG density known to frequently correspond to differentially methylated gene regulatory regions such as enhancers and promoters . By utilising MspI, a 5mC agnostic restriction enzyme that cuts at CCGG motifs, RRBS has been found to be informative for 85% of CpG islands, representing < 3% of the genome and therefore greatly reducing sequencing costs [37,38].
The obvious drawback to RRBS is that it is by design limited to loci containing MspI cut sites. Regions of moderate CpG density that flank CpG islands, known as CpG island shores, are also found to be frequently differentially methylated [39,40] and these can be captured by sequencing the longer restriction fragments, in a technique known as enhanced RRBS . Another consideration with RRBS is that the MspI cut creates a lack of diversity at the start of sequencing reads, which can interfere with calibration and cluster detection on the latest Illumina sequencers. However, this can be overcome by masking the first bases of each read from the sequencer (known as dark sequencing) , using adapters that contain diversity bases, or spiking in libraries of high diversity.
Identification of oxidised forms of methylation by bisulfite sequencing
In mammals, active DNA demethylation is achieved through the oxidation of 5mC to 5-hydroxymethylcytosine (5hmC), 5-formylcytosine (5fC) and 5-carboxylcytosine (5caC) by the TET family of dioxygenases, with demethylation being completed by the excision of 5fC or 5caC by the DNA glycosylase TDG (reviewed in [43,44]). Like 5mC, 5hmC is protected from bisulfite-induced deamination and therefore bisulfite sequencing is unable to differentiate the two forms, while 5fC and 5caC are vulnerable to deamination and thus read as unmethylated cytosine . This is a major caveat to be considered for any bisulfite sequencing experiment; however, adapted bisulfite sequencing methods have been developed to distinguish these oxidised forms. Oxidative bisulfite sequencing (OxBS-seq) utilises specific chemical oxidation of 5hmC to 5fC, such that only 5mC remains protected from bisulfite-induced deamination. Therefore OxBS-seq informs specifically on 5mC without confounding 5hmC. Positions of 5hmC can be determined through a subtraction process of OxBS-seq from unoxidized bisulfite sequencing . TET-assisted bisulfite sequencing (TAB-seq) uses the TET1 enzyme to oxidise 5mC to 5caC while protecting 5hmC from oxidation through the addition of a glucose, thus rendering positions of 5mC, but not 5hmC, susceptible to bisulfite induced deamination . Therefore, TAB-seq provides a direct measurement of 5hmC, with positions of 5mC able to be determined by subtracting TAB-seq signal from standard bisulfite sequencing. Identification of 5fC positions can be determined from bisulfite sequencing either by protecting 5fC from bisulfite-induced deamination chemically (fCAB-seq ) or by the selective reduction to 5hmC (redBS-Seq ). Carboxylcytosine can be protected from deamination by chemical modifications rendering it also detectable by bisulfite sequencing (CAB-seq ).
Single-cell bisulfite sequencing
Until recently, large input requirements have meant genomic assays, including bisulfite sequencing, could only provide average measurements across bulk cell populations. Now single-cell genomics has begun to offer unprecedented insight into cell-to-cell variation. Here, bisulfite sequencing has major advantages over other methods of 5mC detection. Firstly, changing the order of adapter ligation to after bisulfite treatment (known as post bisulfite adapter tagging or PBAT), vastly reduced the input requirement to that of a single nucleus [51,52], although adapter ligation is not free from biases and chimeric reads tend to be formed . Secondly, whereas techniques that utilise read count-based statistics as measurements suffer from the low coverage typical of single-cell sequencing, bisulfite sequence data contains the measurement within the read, meaning every mappable read is informative. Techniques are available for both single-cell WGBS  and single-cell RRBS  (Figure 1). Excitingly, RNA-seq from the cytoplasmic fraction is being combined with bisulfite sequencing of the nucleus from within the same single cell, giving unprecedented comparison of the functional link between the epigenome and the transcriptome. Known as single-cell multi-omics, methods currently exist to measure combinations of 5mC, RNA, copy number and nucleosome positioning [55–57]. Detailed discussion of multi-omic methods are available in [58–61].
Detecting DNA methylation through single molecule long-read sequencing
Long-read sequencing offers a solution to the mappability problem of short reads. Two long-read sequencing technologies are currently available: nanopore sequencing from Oxford Nanopore Technologies (ONT) and single molecule real-time (SMRT) sequencing from Pacific Biosciences (PacBio) (Figure 1). Both strategies can be applied to bisulfite-treated and amplified DNA to provide a readout similar to short-read bisulfite sequencing, but their main advantage lies in their ability to sequence native DNA and infer base modifications from their impact on the raw sequencing signal. DNA degradation due to bisulfite conversion is then avoided, as are amplification biases. More generally there is no longer a need for enzymatic or chemical treatments specific to each base modification of interest, opening up the range of assayable modifications and reducing experimental complexity. The downside of this feature is that the DNA cannot be amplified; therefore, input amounts can be limiting (200 fmol for Nanopore, or about 1 μg of 8 kb fragments; 5 μg for a typical PacBio library, although 100 ng can be sufficient ). In cases where only small DNA amounts are available because of experimental design (e.g. single small organism, microdissected tissue, single cells) or because of the preciousness of the original tissue (e.g. biopsies or paleogenomic samples), Nanopore and PacBio native DNA sequencing will not be feasible. These approaches are therefore best suited to bulk samples, constraining the granularity of the analyses and making cell-to-cell variation difficult to evaluate.
When only parts of the genome are of interest, PCR-free, CRISPR-based enrichment techniques are available for both PacBio and ONT. Hundred- or thousand-fold enrichments are achievable [64–66], providing cost- and sequencing-effective targeted genetic and epigenetic assays.
As opposed to short-read bisulfite sequencing, calling base modifications at a single-base, single-read level from long-read sequencing is not accurate. Accurate estimates are thus derived by aggregating statistics over multiple passes (PacBio only), or summarising at genomic positions (requiring sufficient sequencing depth), regions or motifs when applicable. PacBio and ONT have distinct strengths and limitations that influence their respective use cases (Figure 1).
PacBio’s SMRT sequencing relies on sequencing-by-synthesis, where the sequence of a circular DNA template is determined from the succession of fluorescence pulses, each resulting from the addition of one labelled nucleotide by a polymerase fixed to the bottom of a well. Base modifications thus do not affect the basecalled sequence, but they affect the kinetics of the polymerase. By considering the inter-pulse duration, base modifications can be inferred from the comparison of a modified template to an in silico model or an unmodified template  (Figure 1). For example, presence of a 6mA in the template strand tends to delay the incorporation of the complementary T by the polymerase. The patterns of kinetic perturbations can be more complex and context-dependent, while the magnitude of the perturbations also depend on the base modification .
Because the signal-to-noise ratio is low, the detection of base modifications in single-molecule is inaccurate and often requires summarising at the genomic position-level. As 6mA and 4mC produce strong kinetic signatures, a coverage of 25× per strand is recommended . However the subtle effects of 5mC and 5hmC increase the requirements to 250×, unless they are enriched or modified to produce a larger kinetic effect by glycosylation or TET-conversion to 5-carboxylcytosine [68,69]. PacBio sequencing therefore only achieves single-molecule resolution for certain marks and on relatively short fragments (≤ 2 kb) that can be read a large number of times by the polymerase . With longer fragments, cell-to-cell variability cannot be investigated in detail. PacBio’s price per Gb, sensitivity to particular modifications, resolution and high coverage requirements make this technology particularly suited to bacterial genomes, where 6mA and 4mC are frequent and often concentrated on specific motifs (Figure 1). The use of SMRT sequencing since 2012 has greatly expanded the number of known methyltransferases [71,72]. SMRT sequencing has also been applied to the detection of base J (β-d-glucosyl-hydroxymethyluracil) in Leishmania , and has the potential to discover unknown modifications [74,75]. In 2019, the introduction of PacBio’s Sequel II sequencer and v2 chemistry greatly improved the throughput and affordability of SMRT sequencing, generating up to 160 Gb per SMRT cell.
ONT’s nanopore sequencing measures the variation in ionic current through a biological nanopore as a single-stranded nucleic acid is ratcheted through. Neural networks translate the current trace into nucleotides in a process named basecalling. Base modifications on the DNA introduce deviations in the raw signal, making them detectable (Figure 1).
Detection of base modifications in ONT data commonly involves three steps: (1) basecalling with canonical bases (e.g. with ONT’s Guppy basecaller ), (2) anchoring the raw signal to a genomic reference, and (3) weighing the evidence that a base is modified. Nanopolish  is a popular software to detect 5mCG with a pre-trained algorithm, showing good correlation with bisulfite data on human and mouse genomes [62,77,78]. Because Nanopolish incorporates a model for 5mCG, there is no need to sequence a PCR-amplified, unmodified control in addition to the sample of interest. Nanopolish outputs the probability that a base is modified at a single-read, almost single-nucleotide (actually single k-mer) resolution. Other available tools differ in the underlying algorithms and the modifications they are trained to detect: signalAlign demonstrated detection of 6mA, 5mC and 5hmC , mCaller, DeepSignal, DeepMod and Megalodon detect 6mA and 5mCG [80–83], and D-Nascent and RepNano are tailored towards BrdU detection [84,85]. ONT-developed Tombo provides a model for 5mC and 6mA . Detection of 6mA is generally less accurate than for 5mCG, although gains may still be obtained from improved algorithms and training data [79,80]. Similarly to the principles of base modification detection in PacBio, when no pre-trained algorithms are available for the base modification of interest, it can be inferred by comparison to an in silico reference signal or, more effectively, to a PCR-amplified control devoid of modifications (Tombo, NanoMod ).
Only very recently has it become possible to directly basecall modifications from the raw signal, without genomic anchoring (from Guppy v3.2.1 ). This technique is very promising, limiting the need for computationally intensive downstream analysis; however, it is for the moment not benchmarked and restricted to 5mC in the CG and CC(A/T)GG contexts and 6mA in the GATC context.
The performances of methods that use prior knowledge about the expected deviations in signal depend notably on the training data used, which is typically composed of a fully unmodified sample (PCR-amplified or synthesised) and a fully modified sample (synthesised or modified in vitro by enzymes). Motifs that are not represented in the training set or that contain mixtures of modified and unmodified bases lead to suboptimal performance . For example, on a motif such as CGCGT, Nanopolish only reports a likelihood that the whole group is methylated, rather than probabilities for individual cytosines .
The price of whole-genome nanopore sequencing is comparable to that of whole-genome bisulfite sequencing. Detection of base modifications by nanopore sequencing is still an area of active development. We do not yet know the full extent of modifications that can be distinguished, what the limits to sensitivity are and we lack generalised algorithms able to call many modifications at the same time. Independent benchmarks of established and emerging tools are needed to understand the stability of performance across species and sequencing batches. Every time ONT upgrades the pore chemistry, the raw signal changes and the algorithms have to be trained again. Fortunately, many tools offer the possibility for users to train the algorithms on their own data, both at the basecalling stage (e.g. Taiyaki , Chiron ) or post-alignment (Nanopolish, DeepSignal, mCaller, Tombo). Recent advances in basecalling hold the promise that genetic and epigenetic information will soon come directly out of the sequencer without the need for extra processing.
Bisulfite sequencing provides a quantitative and sensitive assay for DNA methylation at base-resolution (Figure 1). The high-cost of deep coverage can be avoided by targeted sequencing techniques. However, standard bisulfite conversion cannot distinguish between 5mC and 5hmC and specialised protocols are required to specifically detect each of these marks. This is also true for 5caC and 5fmC and library preparation and sequencing have to be performed separately for each mark of interest.
By contrast, SMRT and nanopore long-read sequencing have the capacity to detect a range of base modifications simultaneously, without additional sample preparation. SMRT most sensitively detects 4mC and 6mA, particularly relevant to bacterial epigenomics. Nanopore sequencing so far performs best at detecting 5mCG, although accuracy on m6A is comparable with PacBio , and the lower cost per Gb compared to SMRT make it suitable for larger genomes. Both technologies are compatible with hypothesis-free testing for new base modifications. The accuracy of the long-read technologies in detecting 5mC remains lower than that of bisulfite sequencing, a trend likely to hold true for other marks. This becomes particularly problematic for rare modifications, outside of motifs, where a high false positive rate may hide the signal in background noise. Orthogonal validation is therefore recommended . Focusing on specific motifs where the mark of interest is abundant is a successful strategy to increase the signal-to-noise ratio. Improving the accuracy and range of detectable modifications depends on the generation of appropriate training data, where DNA containing known base modifications are present at known positions, in all biologically relevant motifs. Unfortunately, our ability to synthesise DNA trails our capacity to sequence it. However, increases in long-read sequencing throughput and progress in applying machine learning suggest that there are still accuracy gains to be made, both for PacBio and ONT sequencing. Nanopore basecallers that include modified bases are starting to emerge [76,85], foreshadowing a near future where base modifications are a standard component of DNA and RNA sequencing.
Optimisation of bisulfite sequencing for low input requirements have made it suitable for single-cell sequencing (Figure 1). While SMRT and nanopore sequencing are single-molecule techniques, they are not single-cell techniques and currently require in excess of 100 ng of DNA. The loss of base modifications during PCR is a major hurdle to adapting long-read sequencing to single-cell epigenomics.
In addition to the detection of base modifications not amenable to bisulfite sequencing, a major advantage of long-read sequencing is the ability to phase epigenetic and genetic information, providing allele-specific 5mC patterns that allow insight into the effect of mutations, structural variants, or parental origin on gene regulation [62,65,66,78]. Long-read sequencing also provides genetic and epigenetic information over repeat-rich regions that are refractory to short-read sequencing. There are a number of human diseases linked to repeat expansions and failed epigenetic regulation, which have been difficult to study with short reads . Long-read sequencing is expected to greatly contribute to the diagnostic and molecular understanding of these conditions.
Rapid iterations over ONT nanopore chemistry, protocols and software are both exciting and challenging. Contrary to bisulfite sequencing, there are few established analytical pipelines for long-read epigenetics. Benchmarking efforts are crucial to evaluate the performances of the available tools. While we are still a long way from reading out the 43 base modifications listed in DNAmod , long-read sequencing brings us closer to obtaining full-length, complete, and phased epigenomes.
Variations of short-read bisulfite sequencing allow the mapping of 5mC, 5hmC, 5fC, and 5caC at base resolution.
Single-cell methylomes can be obtained with bisulfite sequencing and combined with other omics to study epigenetic regulation at single-cell resolution.
PacBio most sensitively detects 6mA and 4mC and is most adapted to bacterial epigenomics.
Nanopore sequencing can resolve 6mA, 5mC, 5hmC, and BrdU in single molecules and is under active development.
Long reads improve phasing, genomic coverage and completeness of epigenomes without specialised chemistries, compared to bisulfite sequencing.
The authors declare that there are no competing interests associated with the manuscript.
Q.G. and A.K. contributed equally to the design and writing of the review.
A.K. receives funding from the National Health and Medical Research Council [grant number 1140976].
Oxford Nanopore Technologies
reduced representation bisulfite sequencing
- SMRT sequencing
single molecule real time
whole-genome bisulfite sequencing