Abstract
Recombinant proteins have been extensively employed as therapeutics for the treatment of various critical and life-threatening diseases and as industrial enzymes in high-value industrial processes. Advances in genetic engineering and synthetic biology have broadened the horizon of heterologous protein production using multiple expression platforms. Selection of a suitable expression system depends on a variety of factors ranging from the physicochemical properties of the target protein to economic considerations. For more than 40 years, Escherichia coli has been an established organism of choice for protein production. This review aims to provide a stepwise approach for any researcher embarking on the journey of recombinant protein production in E. coli. We present an overview of the challenges associated with heterologous protein expression, fundamental considerations connected to the protein of interest (POI) and designing expression constructs, as well as insights into recently developed technologies that have contributed to this ever-growing field.
Introduction
Ever since the Food and Drug Administration (FDA) approved the first recombinant protein for therapeutic use in 1982, Escherichia coli has been a workhorse for recombinant protein production in both academia and industry. Despite huge advances in other expression systems, the production of heterologous recombinant proteins in microbial expression systems remains simpler and less expensive than in alternative systems such as mammalian cell culture [1]. E. coli offers various advantages such as comparatively easier genetic manipulation, use of simple growth medium, rapid cell growth, simple fermentation process, virus-free product, high product yields, and cost-effective production [1]. The science behind recombinant protein production seems straightforward, however, in practice, multiple factors can impose hurdles. As Sun Tzu says in the Art of War ‘know the enemy and know yourself’, because if you do not then there is a high chance of failure. Hence, the starting point for any expression should be to know your protein.
The protein and its properties
This review will focus on the production of soluble proteins or soluble fragments of transmembrane (TM) or membrane-associated proteins. For additional issues connected with the production of TM proteins, see [2–4]. Often the protein of interest (POI) is a eukaryotic protein. This can cause additional problems including codon usage, post-translational modifications (PTMs) and issues related to protein folding.
For an overview of the full workflow, see Figure 1. The starting point for any protein expression is to define the protein you wish to make, taking into account possible splice variants, signal sequences, TM helices, and PTMs found in the natural protein. While protein databases such as UniProt [5] are an excellent starting point for looking at these, it is always worthwhile doing additional bioinformatics analysis (Table 1).
Overall workflow
Bioinformatics analysis . | Examples . | Comments . |
---|---|---|
Signal sequences | SignalP [70] | Should not be included in the protein sequence you want to express in E. coli. If you want to target the protein to the periplasm, use a cleavable E. coli signal sequence. See [71] for a recent review. |
TM helices | TOPCONS [72] | TM helices should not be included in the protein sequence if you want to obtain soluble protein. |
Glycosylation | Reviewed in [73] | Our rule of thumb is that if the protein has more than one N-glycosylation site per 100 amino acids the protein it may not be expressed solubly, as glycans enhance solubility and E. coli does not naturally glycosylate proteins. |
Disulfide bonds | UniProt or scientific literature | Most proteins that contain structural disulfides require them to be natively formed to allow soluble protein production. In our experience, disulfide bond prediction can be poor except by homology. A good rule of thumb is that proteins that enter the secretory pathway that contains cysteines are likely to contain disulfide bonds. |
Other PTMs, e.g. Sulfation phosphorylation | Sulfinator [74] NetPhos [75] | Our rule of thumb is that while these may modulate the function of the protein, their absence does not affect soluble protein production. |
Biophysical properties | Protparam [76] | The pI and molecular weight of the protein are useful for confirming expression and for rational protein purification, e.g. the calculated pI helps predict column types and pH for ion-exchange chromatography. |
Complex formation | UniProt or scientific literature | Obligate protein complexes usually require most/all the proteins in the complex to be co-expressed to be able to obtain folded soluble protein. |
Bioinformatics analysis . | Examples . | Comments . |
---|---|---|
Signal sequences | SignalP [70] | Should not be included in the protein sequence you want to express in E. coli. If you want to target the protein to the periplasm, use a cleavable E. coli signal sequence. See [71] for a recent review. |
TM helices | TOPCONS [72] | TM helices should not be included in the protein sequence if you want to obtain soluble protein. |
Glycosylation | Reviewed in [73] | Our rule of thumb is that if the protein has more than one N-glycosylation site per 100 amino acids the protein it may not be expressed solubly, as glycans enhance solubility and E. coli does not naturally glycosylate proteins. |
Disulfide bonds | UniProt or scientific literature | Most proteins that contain structural disulfides require them to be natively formed to allow soluble protein production. In our experience, disulfide bond prediction can be poor except by homology. A good rule of thumb is that proteins that enter the secretory pathway that contains cysteines are likely to contain disulfide bonds. |
Other PTMs, e.g. Sulfation phosphorylation | Sulfinator [74] NetPhos [75] | Our rule of thumb is that while these may modulate the function of the protein, their absence does not affect soluble protein production. |
Biophysical properties | Protparam [76] | The pI and molecular weight of the protein are useful for confirming expression and for rational protein purification, e.g. the calculated pI helps predict column types and pH for ion-exchange chromatography. |
Complex formation | UniProt or scientific literature | Obligate protein complexes usually require most/all the proteins in the complex to be co-expressed to be able to obtain folded soluble protein. |
For other analyses, see for example ExPASy [66]. Abbreviation: pI, isoelectric point.
While bioinformatics approaches are powerful, they are only predictions and so gathering a consensus from multiple independent bioinformatics approaches or looking for validation through experimental means (e.g., from published literature) is always worthwhile. For example, human cytotoxic T-lymphocyte antigen 4 (CTLA-4) is an obligate dimer and requires N-glycosylation of Asn78 and Asn110 for dimerization [6]. As this PTM cannot be made in E. coli, spending a little time to know your protein can save a lot of heartache later on. In essence, without the use of synthetic biology approaches (see below), the only eukaryotic-like PTMs E. coli does is disulfide bond formation in the periplasm [7].
It is also often worthwhile using bioinformatics approaches, e.g. JPRED [8] to look for both domain boundaries and prediction of intrinsically disordered protein (IDP) regions. Expressing a construct that is too short and misses an essential part of a domain, e.g. a β-strand, is always going to result in failure, while expressing a construct that is too long and includes flexible regions prone to proteolysis is likely to either result in heterogeneity or loss of a purification tag. Proteins with large IDP regions are often problematic to make as they are often prone to degradation, however, it should be remembered that many IDP regions may gain structure upon interaction with other molecules, e.g. upon protein complex formation (e.g. ACTR and nuclear co-activator binding domain (NCBD)) [9] and so, co-expression of a partner may help considerably in obtaining the protein in a stable and soluble form.
Before cloning the gene for the protein you want, it is worth considering how you are going to subsequently purify it, as this may affect the construct you want to express. The most powerful first step in the purification of soluble proteins is affinity chromatography (if possible). This includes either the endogenous properties of the protein, e.g. immobilized-ligand or substrate mimic chromatography (e.g. Cibacron Blue F3GA [10] or cyclic peptide-based ligands [11]) or the addition of a tag to aid purification, e.g. a maltose-binding protein (MBP)-tag, glutathione-S-transferase (GST)-tag or most commonly a hexahistidine tag (His-tag) allowing the use of immobilized metal affinity chromatography (IMAC). For an overview of possible affinity tags, refer to [12]. If the structure of your protein or something closely related is available, it is worthwhile looking at the accessibility of the N- and C-termini to see if any added tag is likely to be disruptive to the structure, e.g. if the protein termini are buried. Alternatively, structure prediction programs such as Phyre 2 [13] could be used. While very useful and widely used, N-terminal His-tags may increase the heterogeneity of your final product due to variable (phospho)gluconylation of the N-terminus [14].
Depending on the end use of the protein, you may want to be able to remove the affinity tag after purification by proteolysis. Enzymes with broad specificity can sometimes be used, e.g. trypsin can be used to both remove an N-terminal tag and the C-peptide from insulin derivatives, e.g. [15] but usually, removal of affinity tags is mediated through more highly specific proteases such as TEV (consensus site ENLYFQ↓G/S) and Factor Xa (consensus site IE/DGR) [12]. Care should be taken of the source of the protease, for example, recombinant bovine Factor Xa is reported to have a different specificity than recombinant human Factor Xa [16,17]; see also MEROPS database for other proteases [18]. Most proteases have specificity to sequences both before and after the site of cleavage and so often one or more amino acids from the cleavage site are left on the mature protein. In addition, proteases cannot access buried cleavage sites and so often the cleavage site is put into a flexible linker region (usually glycine/serine-rich), which may add more residues to the mature protein.
In addition to making fusion proteins to aid purification, they can also be used to add solubilization tags. Such tags which are often small, highly soluble, and stable proteins, can aid not only in the solubilization of the final product but also in the solubilization of folding intermediates. If a eukaryotic protein has more than one N-glycan per 100 amino acids, a solubilization tag may be essential to produce it in a soluble form in E. coli. Commonly used solubilization tags include MBP (which doubles as an affinity purification tag), thioredoxin, Sumo, or Fh8. For solubilization tags, there needs to be a balance, if they help too little then soluble protein may not be achieved. Conversely, if they help solubilize too much then false positives may be achieved where the final product is soluble despite the POI not being correctly folded. This balance often has to be achieved by trial and error.
Even with careful selection of domain boundaries and possible solubilization tags, not all eukaryotic proteins can fold to a native state in E. coli. This is linked to issues of protein folding, PTMs, and/or the protein being part of an unknown obligate complex. E. coli contains a wide range of molecular chaperones (e.g. GroEL/ES, DnaK, Skp) and ten peptidyl cis-trans prolyl isomerases and so issues related to protein folding are usually either linked with (i) translation rates (see below); (ii) oxidative folding, i.e. the formation of disulfide bonds; (iii) the protein having an essential PTM which E. coli cannot perform; (iv) the protein having a buried prosthetic group which wildtype E. coli cannot make or becomes limiting (in some cases this can be solved by the addition of the moiety to the growth media); (v) rare cases where a specialized folding factor is involved in folding the protein, e.g. to express a hyperthermophilic α-amylase from Pyrococcus furiosus (a hyperthermophilic archaeum) in E. coli, the co-expression of small heat shock protein (sHSP) or chaperonin (HSP60) from the same P. furiosus was found to be essential [19]. For an overview of alternate expression platforms and genetic engineering approaches available to carry out PTMs in heterologous proteins, refer to [20].
Native disulfide bond formation is the most common issue. There are three approaches to deal with this issue. Firstly, the protein could be allowed to form aggregates, or inclusion bodies, of misfolded/unfolded protein. Inclusion bodies are relatively easy to purify, and the protein can then be refolded in vitro [21,22]. Secondly, the protein could be targeted to the periplasm via the addition of an N-terminal periplasmic signal sequence. Here there is machinery for native disulfide formation [7], and while it is a powerful technique both the sec secretion system and the folding apparatus in the periplasm can easily be overwhelmed, so (extreme) care must be taken [23]. Thirdly, an engineered strain could be used that removes disulfide bond reducing pathways from the cytoplasm [24,25], or adds oxidative folding catalysts, reviewed in [26]. This can be combined with the TAT-secretion system for exporting folded proteins to the periplasm, e.g. [27,28]. Similar synthetic biology approaches also allow other PTMs to be made in the cytoplasm, for example mucin-type O-glycosylation in E. coli. [29].
Finally, it should be remembered that the cytoplasm of E. coli contains methionine aminopeptidase, which can remove the initiating methionine [30], depending on the subsequent amino acids (e.g. serine, alanine, cysteine, proline, or glycine at P1′ preferred, Pro at P2′ inhibits), with engineered systems extending the list, e.g. [31]. This also combines with the N-end rule for protein clearance from a cell. For E. coli, proteins with an N-terminal Arg, Lys, Leu, Phe, Tyr, or Trp can be rapidly degraded [32], but this depends on the context of the N-terminal and subsequent amino acids [33,34].
After all these considerations, if no purified protein is obtained, a simple troubleshooting sodium dodecyl sulfate/polyacrylamide gel electrophoresis (SDS/PAGE) analysis may quickly help elucidate the possible issues (Figure 2). SDS/PAGE analysis can be complemented by other techniques including mass spectrometry, Western blotting, activity assays for the POI etc.
SDS/PAGE gel troubleshooting
The gene and its properties
Once details of the protein construct are finalized it is time to turn your attention to the gene. Just as much care must be taken for it as for the protein construct or yields may be low. One important concept that is often forgotten in protein expression is cellular homeostasis or everything in balance. Too often a high-copy number plasmid may be used with a strong promoter, but this will invariably result in less protein than could be produced as too many cellular resources are put into making plasmid deoxyribonucleic acid (DNA) and messenger RNA (mRNA), and the mRNA produced is in far excess of the limitations of the translation apparatus (Figure 3).
Proteostasis and the balance between gene copy number, promoter strength and recombinant protein expression levels
A multitude of genetic engineering strategies have been developed over the years to enable efficient cloning of protein expression constructs [35,36]. While industry often integrates genes into the bacterial chromosome to avoid the problem of plasmid loss during large scale fermentation, the academic approach more usually uses plasmids for expression as they are faster and cheaper to use. Plasmid selection for protein production is based on (i) copy number, which depends on the origin of replication of the plasmid (Table 2); (ii) promoter (Table 3); (iii) selection marker (Table 4). There is a balance between plasmid copy number and promoter strength (Figure 3) to maximize cellular resources going into protein production and this also depends on the media, with chemically defined minimal media being more sensitive to alterations in these, in particular when either is excessively high. Recent advancements in synthetic biology led to growth-decoupled recombinant protein production through the co-expression of a bacteriophage-derived E. coli ribonucleic acid (RNA) polymerase inhibitor peptide called Gp2 [37]. This approach allowed the modulation of metabolic resources, so they are exclusively utilized to produce the POI.
Ori . | Typical copy number . | Example vector . | References . |
---|---|---|---|
pMB1 | ∼15–20 | pBR322 | [77] |
pMB1 (derivative) | ∼500–700 | pUC/pGEM | [78] |
pBR322 | ∼15–20 | pET/pGEX | [79] |
ColE1 | ∼15–25 | pColE1 | [80] |
ColE1 (derivative) | ∼300–500 | pBluescript | [81] |
p15A | ∼10 | pACYC | [82] |
R6K | ∼10–15 | pR6K | [83] |
pSC101 | ∼5 | pSC101 | [84,85] |
Ori . | Typical copy number . | Example vector . | References . |
---|---|---|---|
pMB1 | ∼15–20 | pBR322 | [77] |
pMB1 (derivative) | ∼500–700 | pUC/pGEM | [78] |
pBR322 | ∼15–20 | pET/pGEX | [79] |
ColE1 | ∼15–25 | pColE1 | [80] |
ColE1 (derivative) | ∼300–500 | pBluescript | [81] |
p15A | ∼10 | pACYC | [82] |
R6K | ∼10–15 | pR6K | [83] |
pSC101 | ∼5 | pSC101 | [84,85] |
SC101 or R6K are compatible with p15A and with one from the set of pMB1/pBR322/ColE1, i.e. they can exist in the same cell on different plasmids while other combinations are not compatible. R6K and pSC101 require pir and dnaA genes respectively, for replication.
Promoter . | Comments . |
---|---|
Lac | Relatively low constitutive expression in the absence of the lacI repressor. Inducible by IPTG and allolactose (formed from lactose by the action of LacZ. Repressed by glucose. |
LacUV5 | Similar to the lac promoter but stronger due to more efficient recruitment of RNA polymerase. |
T7 | Based on T7 bacteriophage system which promotes high levels of transcription. This promoter cannot be recognized by the host polymerase, so requires T7 polymerase—often chromosomally integrated under the control a LacUV5 promoter. |
T5 | Based on T5 bacteriophage early promoter and the lac-operon. It contains three LacI binding sites and is strongly repressed in LacIq strains. Inducible by IPTG and lactose. |
Tac | A developed hybrid of lacUV5 and trp promoters. Higher expression levels than either, with tight regulation. Inducible by IPTG and allolactose and repression by LacI and glucose. |
araBAD | Tunable induction by L-arabinose. Tightly regulated independent of the presence of other carbon sources. Depends on Ara status of the host cell. |
rhaBAD | Tunable induction by L-rhamnose. Low basal expression. Tightly regulated independent of the presence of other carbon sources. |
proU | Promoter from an osmoregulated operon. Induction by higher osmolarity, e.g. increased [NaCl] in the media. |
Promoter . | Comments . |
---|---|
Lac | Relatively low constitutive expression in the absence of the lacI repressor. Inducible by IPTG and allolactose (formed from lactose by the action of LacZ. Repressed by glucose. |
LacUV5 | Similar to the lac promoter but stronger due to more efficient recruitment of RNA polymerase. |
T7 | Based on T7 bacteriophage system which promotes high levels of transcription. This promoter cannot be recognized by the host polymerase, so requires T7 polymerase—often chromosomally integrated under the control a LacUV5 promoter. |
T5 | Based on T5 bacteriophage early promoter and the lac-operon. It contains three LacI binding sites and is strongly repressed in LacIq strains. Inducible by IPTG and lactose. |
Tac | A developed hybrid of lacUV5 and trp promoters. Higher expression levels than either, with tight regulation. Inducible by IPTG and allolactose and repression by LacI and glucose. |
araBAD | Tunable induction by L-arabinose. Tightly regulated independent of the presence of other carbon sources. Depends on Ara status of the host cell. |
rhaBAD | Tunable induction by L-rhamnose. Low basal expression. Tightly regulated independent of the presence of other carbon sources. |
proU | Promoter from an osmoregulated operon. Induction by higher osmolarity, e.g. increased [NaCl] in the media. |
Antibiotic based . | ||
---|---|---|
Name . | Mechanism of action and inactivation . | References . |
Ampicillin | Acts as an inhibitor of transpeptidase and causes cell lysis. The ampr gene encodes β-lactamase which catalyzes the hydrolysis of the B-lactam ring of ampicillin. | [86] |
Chloramphenicol | Acts to inhibit protein synthesis by the ribosome and hence is bacteriostatic. The camr gene encodes an acetyltransferase that, catalyzes the formation of inactive hydroxyl acetoxy derivatives. | [87] |
Kanamycin | Binds to 30S ribosome subunit and causes misreading of mRNA. The kanr gene encodes for an enzyme that phosphorylates kanamycin, thereby inactivating it. | [88] |
Tetracycline | Tetracycline blocks the A site of the ribosome preventing entry by tRNAs. The tetr gene encodes an efflux protein transporting tetracycline out of the cytosol. | [89] |
Streptomycin | Streptomycin binds to 16S ribosomal RNA, inhibiting protein synthesis. The strepr gene encodes for aminoglycoside modifying enzymes such as nucleotidyltransferases or phosphotransferases which inactivate streptomycin. | [90] |
FabV-Triclosan | Plasmid system expressing FabV which protects against deleterious effects of Triclosan added to the growth media. | [91] |
Antibiotic based . | ||
---|---|---|
Name . | Mechanism of action and inactivation . | References . |
Ampicillin | Acts as an inhibitor of transpeptidase and causes cell lysis. The ampr gene encodes β-lactamase which catalyzes the hydrolysis of the B-lactam ring of ampicillin. | [86] |
Chloramphenicol | Acts to inhibit protein synthesis by the ribosome and hence is bacteriostatic. The camr gene encodes an acetyltransferase that, catalyzes the formation of inactive hydroxyl acetoxy derivatives. | [87] |
Kanamycin | Binds to 30S ribosome subunit and causes misreading of mRNA. The kanr gene encodes for an enzyme that phosphorylates kanamycin, thereby inactivating it. | [88] |
Tetracycline | Tetracycline blocks the A site of the ribosome preventing entry by tRNAs. The tetr gene encodes an efflux protein transporting tetracycline out of the cytosol. | [89] |
Streptomycin | Streptomycin binds to 16S ribosomal RNA, inhibiting protein synthesis. The strepr gene encodes for aminoglycoside modifying enzymes such as nucleotidyltransferases or phosphotransferases which inactivate streptomycin. | [90] |
FabV-Triclosan | Plasmid system expressing FabV which protects against deleterious effects of Triclosan added to the growth media. | [91] |
Non-antibiotic based . | ||
---|---|---|
Name . | Mechanism . | References . |
Gene KO | Plasmid carries the wildtype gene to complement the auxotrophy in a knockout E. coli strain, e.g. ΔProBA, ΔTpiA, ΔglyA, ΔQAPRTase. | [92–95] |
lac-DapD | Plasmid-mediated repressor titration: The engineered host strain contains dapD under control of the lac operator/promoter (lacO/P). A plasmid containing lacO releases repression of DapD by titration of lacI. | [96] |
ColE3-Amn | The vector contains the C-terminal ribonuclease domain of colicin E3 with an amber stop codon (s) at 5′ terminus. Allows propagation of the vector in E. coli cells without amber suppressor activity. | [97] |
Non-antibiotic based . | ||
---|---|---|
Name . | Mechanism . | References . |
Gene KO | Plasmid carries the wildtype gene to complement the auxotrophy in a knockout E. coli strain, e.g. ΔProBA, ΔTpiA, ΔglyA, ΔQAPRTase. | [92–95] |
lac-DapD | Plasmid-mediated repressor titration: The engineered host strain contains dapD under control of the lac operator/promoter (lacO/P). A plasmid containing lacO releases repression of DapD by titration of lacI. | [96] |
ColE3-Amn | The vector contains the C-terminal ribonuclease domain of colicin E3 with an amber stop codon (s) at 5′ terminus. Allows propagation of the vector in E. coli cells without amber suppressor activity. | [97] |
The plasmid is not the only decision to make. The source of the gene is important. For decades, the normal source of the gene for the POI was directly from the original organism e.g., by complementary DNA (cDNA) library obtained by real time-polymerase chain reaction (RT-PCR) from an mRNA pool (to avoid introns). While this can be fast, cheap and efficient, it can give rise to problems connected with differences in translation initiation and codon usage between prokaryotes and eukaryotes.
While eukaryotic ribosomes bind to the cap at the 5′ end of the mRNA and then move down the mRNA until they initiate translation from the first AUG codon with a Kozak sequence in front of it, prokaryotic ribosomes bind to a sequence on the mRNA known as the Shine–Dalgarno (SD) sequence or ribosome-binding site (rbs; Figure 4). The rbs are usually 5–13 base pairs [38] upstream of the initiating AUG (optimal distance 5–6 base pairs [39]); and are complementary to the 3′ end of the 16S ribosomal RNA. In E. coli, this sequence is AGGAGGU [40]. The requirement for a distinct rbs has two consequences for eukaryotic protein expression in E. coli. Firstly, an rbs must be present before the initiating AUG. This may be present in the plasmid outside the multicloning site, but care should be taken that it is within the correct distance and that there are no other possible AUG trinucleotides that translation could initiate from. Secondly, this nucleotide sequence should not appear inside the gene of interest. An internal rbs will either result in the generation of a second protein (if there is an AUG at the correct distance from it) or will result in translation stalling as a ribosome binds to this site and prevents translation through it. Due to this care must be taken in the codon used for Gly–Gly pairs (i.e. not GGA–GGU), Arg–Arg pairs (i.e. not AGG–AGG), and sequences around Glu (GAG), including Glu–Glu pairs (GAG–GAG). AGG and GGA codons are rarely used by E. coli (see below) and so mostly care with codon optimization to avoid internal rbs relates to sequences around Glu (Q/K/E-E or E-V).
Schematic representation of initiation of translation in prokaryotes and eukaryotes
Codon usage is not equally distributed among the codons available and the variation in codon usage bias is considerable between organisms (Table 5). Codon usage varies considerably between organisms (Table 5) and correlates with corresponding transfer RNA (tRNA) levels [41]. mRNA which contains multiple rare codons can exhibit translation stalling and mRNA degradation, reviewed in [42]. Codon usage issues can be examined by bioinformatic approaches, e.g. Graphical Codon Usage Analyzer [43]. One method to prevent this problem was the overexpression of rare tRNAs, e.g. [44,45] such as from pLysSRARE [46]. For more detailed insights into codon usage, refer to [47]. The more usual approach now is the use of synthetic genes that can be codon optimized for the expression host, while simultaneously avoiding internal rbs, internal restriction sites, and factors that influence mRNA structure and stability [48,49]. As prices have rapidly dropped a synthetic gene can cost less than the labor and material costs associated with cloning a gene from a cDNA library.
Codon . | Amino acid . | Expected usage . | E. coli (W3110) . | Homo sapiens . | ||
---|---|---|---|---|---|---|
. | . | . | Usage . | Ratio . | Usage . | Ratio . |
AGG | Arg | 0.17 | 0.02 | 7.2 | 0.21 | 0.8 |
CUA | Leu | 0.17 | 0.04 | 4.7 | 0.07 | 2.4 |
AGA | Arg | 0.17 | 0.04 | 4.0 | 0.22 | 0.8 |
UAG | Stop | 0.33 | 0.06 | 5.2 | 0.24 | 1.4 |
AUA | Ile | 0.25 | 0.07 | 3.6 | 0.17 | 1.5 |
CUC | Leu | 0.17 | 0.10 | 1.6 | 0.20 | 0.8 |
GGA | Gly | 0.25 | 0.11 | 2.3 | 0.25 | 1.0 |
CCC | Pro | 0.25 | 0.12 | 2.0 | 0.32 | 0.8 |
ACA | Thr | 0.25 | 0.13 | 1.9 | 0.28 | 0.9 |
GGG | Gly | 0.25 | 0.15 | 1.7 | 0.25 | 1.0 |
CCU | Pro | 0.25 | 0.16 | 1.6 | 0.29 | 0.9 |
GCU | Ala | 0.25 | 0.16 | 1.6 | 0.27 | 0.9 |
AAG | Lys | 0.5 | 0.23 | 2.1 | 0.57 | 0.9 |
GAG | Glu | 0.5 | 0.31 | 1.6 | 0.58 | 0.9 |
Codon . | Amino acid . | Expected usage . | E. coli (W3110) . | Homo sapiens . | ||
---|---|---|---|---|---|---|
. | . | . | Usage . | Ratio . | Usage . | Ratio . |
AGG | Arg | 0.17 | 0.02 | 7.2 | 0.21 | 0.8 |
CUA | Leu | 0.17 | 0.04 | 4.7 | 0.07 | 2.4 |
AGA | Arg | 0.17 | 0.04 | 4.0 | 0.22 | 0.8 |
UAG | Stop | 0.33 | 0.06 | 5.2 | 0.24 | 1.4 |
AUA | Ile | 0.25 | 0.07 | 3.6 | 0.17 | 1.5 |
CUC | Leu | 0.17 | 0.10 | 1.6 | 0.20 | 0.8 |
GGA | Gly | 0.25 | 0.11 | 2.3 | 0.25 | 1.0 |
CCC | Pro | 0.25 | 0.12 | 2.0 | 0.32 | 0.8 |
ACA | Thr | 0.25 | 0.13 | 1.9 | 0.28 | 0.9 |
GGG | Gly | 0.25 | 0.15 | 1.7 | 0.25 | 1.0 |
CCU | Pro | 0.25 | 0.16 | 1.6 | 0.29 | 0.9 |
GCU | Ala | 0.25 | 0.16 | 1.6 | 0.27 | 0.9 |
AAG | Lys | 0.5 | 0.23 | 2.1 | 0.57 | 0.9 |
GAG | Glu | 0.5 | 0.31 | 1.6 | 0.58 | 0.9 |
The expected usage for each amino acid dependence is based on the number of codons encoding that amino acid. The ratio of the expected usage to actual usage in an organism shows the relative underuse of the codon. Codon usage between E. coli and human genes is quite different, with only the CUA codon being relatively underused in both organisms. Codon data taken from [69].
Synthetic genes can also help mitigate the potentially deleterious effects of one other difference between eukaryotic and prokaryotic protein translation, translation rates. In prokaryotes such as E. coli, transcription and translation rates are coupled, with transcription rates approx. 50 nucleotides/s and translation rates approx. 16 amino acids/s [50]. In contrast, translation rates in eukaryotes are slower, with a rate of approx. 3 amino acids/s [51]. Protein folding has evolved in parallel with these translation rates and hence when a eukaryotic protein is expressed in E. coli, the rate of the translation may be faster than the rate of folding and for multidomain proteins, this can be a serious issue (Figure 5). This can be mitigated by modulation of translation rate [52], codon usage harmonization [53], or the use of rarer codons just after domain boundaries to cause ribosome stalling [54] (Figure 5).
The influence of translation rate on protein folding efficiency
A specialized ribosome system aimed specifically at the expression of the POI in E. coli by modifying the SD sequence of the mRNA and corresponding anti-SD sequence of the 16S rRNA was first reported by Hui and De Boer in 1987 [55]. Alternative ribosome systems such as the orthogonal riboswitch system [56], the RiboTite system [57], and the Ribo-T system [58] have been reported since. The riboswitch system allows tunable co-expression of multiple genes in a dose-dependent response to small synthetic molecules while the RiboTite system, which builds on the riboswitch technology, has been shown to harmonize protein translation rates with protein secretion [59]. The Ribo-T system employs an engineered hybrid rRNA composed of both small and large subunit rRNA sequences, in which short RNA linkers covalently link the subunits into a single translating unit [58]. This orthogonal ribosome–mRNA system is capable of supporting bacterial growth even in the absence of wildtype ribosomes and its improved tethered version has been reported recently [60].
Another difference between eukaryotic and prokaryotic protein translation can be an advantage for recombinant protein production. Many prokaryotic genes are expressed in operons, where a single promoter results in the production of multiple proteins from a single mRNA that has an rbs before the initiating AUG of each (Figure 4). This allows both the co-expression of subunits that form complexes, or the co-expression of ancillary factors that may be required for the protein to reach the native conformation.
Strains and media for small-scale expression screening
Once a suitable construct for protein expression has been generated, the next step is to express the protein. This leads again to more rational choices needing to be made. E. coli is a remarkably diverse bacterial species, with only approx. 20% of the genome common to all strains [61]. It can be broadly split into four subgroupings, K-12 strains, B-strains, and the C and W strains based on their initial isolation [61]. Many K-12 and B-strains are used for recombinant protein production (Table 6). Some POI show strong strain dependence, often for unclear reasons, so we routinely test any new protein in at least one K-12 and one B-strain. Similarly, there are a wide variety of media choices, which can be broadly split into rich media (which contains yeast extract and/or another mixed source of peptides such as tryptone) and chemically defined or minimal media (where there are often only 1–3 carbon sources and a single nitrogen source). Again, some POIs show strong media dependence for production and so we routinely test any new protein in at least one rich media and one chemically defined media. While Luria–Bertani (LB) media used to be the default media for academic protein production, it has been largely superseded by media which allow higher density cultures to be obtained as higher cell mass usually results in higher protein yields. In particular, the use of auto-induction media, e.g. [62], both facilitate the screening of multiple POI and allow culture densities typically 10× higher than LB. Additionally, an alternate growth medium for recombinant protein production in E. coli which allows the controlled release of substrates, thereby mimicking fed-batch process conditions at a small scale, has been reported [63].
Strain . | Comments . | References . |
---|---|---|
BL21 | B-derived strain widely used for production of recombinant proteins. Deficient in Ion and OmpT proteases. | [98] |
BL21 (DE3) | Derived from BL21; routinely used for protein expression under the control of a T7 promoter regulated by T7 RNA polymerase carried by the DE3 prophage (chromosomal integration under the control of a lacUV5 promoter). | [98] |
C41 (DE3)/C43 (DE3) | Derived from BL21(DE3) with unspotted mutations that allow them to produce some toxic and membrane proteins. | [99] |
MG1655 | Well characterized K-12 derived strain widely used for recombinant protein production. Higher stress resistance allows high-cell density fermentation. | [100] |
W3110 | Closely related to MG1655. Stress resilience and membrane stiffness allows high-cell density fermentations for heterologous protein production. | [100] |
RV308 | A K-12 derived strain mutated for industrial protein production; offers increased protein yields and low acetate production. | [101] |
HMS174(DE3) | K-12 derived strain with a recA mutation. These strains stabilize certain target genes whose products may cause the loss of the DE3 prophage and allow heterologous protein production under the control of a T7 promoter. | [101] |
Strain . | Comments . | References . |
---|---|---|
BL21 | B-derived strain widely used for production of recombinant proteins. Deficient in Ion and OmpT proteases. | [98] |
BL21 (DE3) | Derived from BL21; routinely used for protein expression under the control of a T7 promoter regulated by T7 RNA polymerase carried by the DE3 prophage (chromosomal integration under the control of a lacUV5 promoter). | [98] |
C41 (DE3)/C43 (DE3) | Derived from BL21(DE3) with unspotted mutations that allow them to produce some toxic and membrane proteins. | [99] |
MG1655 | Well characterized K-12 derived strain widely used for recombinant protein production. Higher stress resistance allows high-cell density fermentation. | [100] |
W3110 | Closely related to MG1655. Stress resilience and membrane stiffness allows high-cell density fermentations for heterologous protein production. | [100] |
RV308 | A K-12 derived strain mutated for industrial protein production; offers increased protein yields and low acetate production. | [101] |
HMS174(DE3) | K-12 derived strain with a recA mutation. These strains stabilize certain target genes whose products may cause the loss of the DE3 prophage and allow heterologous protein production under the control of a T7 promoter. | [101] |
For genotypes of these and other strains, see for example https://openwetware.org/wiki/E._coli_genotypes.
In addition to strain and media, the temperature of the culture post-induction can play a key role in the yield of the folded protein. This effect probably arises both from the change in relative hydrophobicity with temperature and from the slower rate of protein translation [64] so as not to exceed the capacity of the folding machinery. If you choose to use a non-autoinducing media, the concentration of inducer (e.g. isopropyl β-d-1-thiogalactopyranoside (IPTG)) and the timing and length of induction can also significantly influence the yields of folded protein and may need optimization.
Once small-scale screening experiments have concluded positively and you have chosen your expression construct and strain, you may want to scale-up the production and purification of your protein depending on the end use. For an extensive overview of upstream and downstream process development strategies for production of heterologous proteins in E. coli, refer to [1,65].
Summary
E. coli is an excellent host for recombinant protein production in both academia and industry.
A rational approach is required for successful protein production. Understanding or predicting using bioinformatics tools, the biophysical characteristics of the protein is essential.
Correct identification of domain boundaries, signal sequences, TM regions, obligate oligomeric complex formation, and PTMs are critical.
It is equally important to consider genetic and translation factors, such as codon usage, the nature and position of the rbs and differences between prokaryotic and eukaryotic translation rates.
Other factors such as the strain and media used also impact protein yield, but they cannot compensate for poor planning.
Competing Interests
The authors declare that there are no competing interests associated with the manuscript.
Funding
This work was supported the European Union’s Horizon 2020 Research and Innovation Programme under Marie Sklodowska-Curie [grant number 642937].
Author Contribution
L.W.R. conceived the article. All authors contributed to the writing.
Abbreviations
- cDNA
complementary DNA
- DNA
deoxyribonucleic acid
- IDP
intrinsically disordered protein
- LB
Luria–Bertani
- MBP
maltose-binding protein
- mRNA
messenger RNA
- POI
protein of interest
- PTM
post-translational modification
- rbs
ribosome-binding site
- SD
Shine–Dalgarno
- SDS/PAGE
sodium dodecyl sulfate/polyacrylamide gel electrophoresis
- TM
transmembrane
- tRNA
transfer RNA
References
Author notes
These authors contributed equally to this work.