Uncovering protein structure

Stollar, Elliott J; Smith, David P

doi:10.1042/EBC20190042

Abstract

Structural biology is the study of the molecular arrangement and dynamics of biological macromolecules, particularly proteins. The resulting structures are then used to help explain how proteins function. This article gives the reader an insight into protein structure and the underlying chemistry and physics that is used to uncover protein structure. We start with the chemistry of amino acids and how they interact within, and between proteins, we also explore the four levels of protein structure and how proteins fold into discrete domains. We consider the thermodynamics of protein folding and why proteins misfold. We look at protein dynamics and how proteins can take on a range of conformations and states. In the second part of this review, we describe the variety of methods biochemists use to uncover the structure and properties of proteins that were described in the first part. Protein structural biology is a relatively new and exciting field that promises to provide atomic-level detail to more and more of the molecules that are fundamental to life processes.

Introduction

Proteins are one of the most important classes of molecules for life and underpin the field of biochemistry. To fully understand their role, it is essential to explore both their structure and function and this review focuses on how we uncover protein structure. To understand structure, we explore the chemical nature of amino acids which are the building blocks of proteins. We consider how interactions between amino acids help proteins fold and fluctuate as they adopt a variety of structures. Furthermore, to understand how we experimentally study protein structure, we explore fundamental concepts in physics and associated computational methods. This topic is truly interdisciplinary and in addition to biochemistry, spans the fields of biophysics, structural biology and computational biology.

We start by describing the four levels of protein structure and how a variety of protein domains and architectures exist. Proteins are biological molecules produced in living cells, and we must also consider how a long chain of amino acids that are produced from the ribosome can transition to a folded structure that is central to the protein’s function. As such, we consider protein folding thermodynamics and also what happens when proteins misfold inside a cell. We also explore other universal properties of proteins that include their ability to change their shape known as conformational change. In particular, although proteins usually exist in one dominant conformation, we discuss how proteins actually exist in a population (ensemble) of rapidly interconverting conformations that allow them to be flexible and adapt their shapes required for function. We then discuss in detail the primary techniques used to study protein structure and dynamics that have provided these insights. Given the interdisciplinary nature of this topic, along the way, we have provided some stand-alone boxes to give more details about the fundamental science behind these concepts.

Part 1: The structural properties of proteins

Proteins

Proteins are one of the four major molecules that direct life that includes nucleic acids (deoxyribonucleic acid (DNA), RNA), lipids (fats) and polysaccharides (sugars). All of these large ‘macromolecules’ are carbon-based covalent compounds that use weak reversible non-covalent interactions to fold and interact with their targets, giving the molecules and their complexes distinct shapes and dynamics. Proteins are polymers of typically hundreds of amino acids joined together by peptide bonds, whereas shorter polypeptides (less than 30 amino acids) are typically referred to as peptides. Each amino acid has a common structure containing a central α carbon atom (C_α) that is joined to an amino group (–NH₂) and a carboxylic acid group (–COOH) both of which are used to form peptide bonds. What is most interesting, is that for 19 of the 20 different amino acids, the C_α group is also bonded to a different R group, giving every amino acid its unique ‘side chain’. The side chain gives the amino acid distinctive structural and chemical properties as side chains differ in size, shape, polarity, charge and hydrophobicity (Figure 1). Amino acids are also chiral and can be configured in two possible mirror images (stereoisomers) as the C_α group is bonded to four unique groups that form a chiral centre. As mirror images, stereoisomers cannot be superimposed, in the same way, your hands are mirror images and cannot be rotated to match. The two stereoisomers for each of the 19 chiral amino acids are denoted as d and l, however only the l-stereoisomer is used in nature to construct proteins (glycine has hydrogen for a side chain and is not chiral).

l-Amino acids

Figure 1

CATH https://www.cathdb.info/	SCOP http://scop.mrc-lmb.cam.ac.uk
• Class: Structures are classified according to their secondary structure composition (mostly α, mostly β, mixed α/β or few secondary structures).	• Class: Structures are classified according to their secondary structure composition (mostly α, mostly β, mixed α/β or few secondary structures).
• Architecture: Structures are classified according to their overall shape as determined by the orientations of the secondary structures in 3D space but ignores the connectivity between them.	• Fold: Groups on the basis of the global structural features shared by the majority of their members.
• Topology (fold family): Structures are grouped into fold groups at this level depending on both the overall shape and connectivity of the secondary structures.	• Superfamily: The domains in a fold are grouped into superfamilies, which have at least a distant common ancestor.
• Homologous superfamily: This level groups together protein domains which are thought to share a common ancestor.	• Family: The domains in a superfamily are grouped into families, which have a more recent common ancestor.

Motion	Distance moved (Å)	Time taken (s)	Energy source
Atomic or molecular vibrations	∼0.01 to 1	∼10⁻¹⁵ to 10⁻¹¹	Thermal energy
Collective motions	∼0.01 to >5	∼10⁻¹² to 10⁻³	Thermal energy
• Fast (e.g. amino acid sidechain movements such as ring flips)
• Slow (e.g. domain shifts)
Binding induced conformational changes	∼0.5 to >10	∼10⁻⁹ to 10³	Binding interactions

Scientific concept	What does it tell you?	How long does it take and how many samples can be analysed?
Small Angle X-Ray Scattering (SAXS)
Diffraction by non-crystalline samples that are powders or in solution, in which all the molecules are randomly oriented is usually called scattering. The diffraction pattern is averaged in all directions, spherically, because the X-ray beam encounters all the possible orientations of the molecules in the sample. The diffraction pattern still contains information about how the electron density varies with distance from the centre of the molecules that make up that sample.	Analysing the intensities at different X-ray to sample (small) angles provides a distance distribution function which gives the frequencies of all possible intramolecular distances in a protein. From this you can model the overall protein shape and generate a simple protein envelope. Since samples are in solution you can easily detect dynamics, binding and conformational changes. The data also allow you to calculate a radius of gyration (the distance the mass is spread).	The data can be recorded relatively quickly in 1 day and only requires well matched buffer solutions to subtract any scattering contribution from buffer. Some facilities can now analyse multiple samples within a 96-well plate but most commonly only one sample can be measured at a time.
Isothermal Titration Calorimetry
Measuring heat changes when adding molecules to protein solutions.	It tells you how well the molecule binds and the enthalpy, entropy and free energy of the interaction.	One titration for one interaction takes approximately 2 h.
Native Mass Spectrometry
Electrospray ionisation works by passing a current through a volatile solvent. This causes protein complexes to become ionised and move into the gas phase. The molecular mass can be calculated by how long it takes an ion to travel a set distance. This is called time of flight (TOF).	The molecular weight of proteins and complexes can be determined in the gas phase. It can be used to tell you through changes in mass, if the protein has bound to another molecule, for example, a metal ion or drug.	Each spectrum can be acquired in a few seconds so many samples can be measured in a day but data analysis can take many more hours.
General Fluorescence
Fluorescence involves using a beam of light, that excites the electrons in molecules of certain compounds and causes them to emit light of a longer wavelength.	Different fluorophores absorb and emit light at different wavelengths dependent on their local environment. For example, the molecule 8-anilino-1-naphthalenesulphonic acid (ANS) is an extensively utilised fluorescent probe for the characterisation of proteins and binding sites as it only fluoresces when bound to hydrophobic patches on a protein.	The process is very rapid, occurring within milliseconds. By use of multiwell plate readers, many hundreds of measurements can be recorded within a few minutes.
Differential Scanning Fluorimetry
When proteins are folded they bury their hydrophobic core and cannot bind a fluorescent dye. Using heat during a temperature ramp the protein unfolds and binds the dye and the fluorescence of the dye increases.	It tells you the temperature at which half the protein is unfolded, also known as the T_m. If you add a drug molecule to the protein the T_m increases and this can tell you how well it binds.	You can measure 96 samples in just 1 h. It is very popular for screening many binding partners and buffer conditions.
Intrinsic Tryptophan Fluorescence
Within proteins the amino acid side chain of tryptophan is fluorescent. The wavelength of light emitted ranges from ∼300 nm in non-polar environments such as the inside of a protein to 350 nm in aqueous polar environments found on the surface.	As the peak wavelength of light emitted is dependent on the environment around the amino acid side chain, fluorescence can be used as a very sensitive measurement of the conformational state of individual tryptophan residues. If the emitted light is nearer to 300 nm then the Tryptophan is in a non-polar environment or if it is nearer to 350 nm, it is in an aqueous polar environment.	Like with general fluorescence, the process is very rapid. Typically, emission spectra can be acquired in less than a minute, meaning many samples can be analysed quickly.
Chemical Denaturation followed by Intrinsic Tryptophan Fluorescence
Folded proteins usually have tryptophans buried in the core and they fluoresce differently to when they become exposed on unfolding due to a strong chemical denaturant such as urea or guanidinium. These chemicals are titrated into a folded protein solution and the fluorescence is measured for each point.	The data are plotted and produces a denaturation curve that tells you the concentration of denaturant where half of the protein is unfolded. The slope of the transition also tells you how sensitive the protein is to the denaturant. Together these values allow you to calculate a free energy change for unfolding, which is an absolute measure of the protein’s stability. If drugs or ligands are also included in a separate experiment, then a binding constant can be calculated. Proteins can also be suddenly induced to fold or unfold where the change in fluorescence can be measured in real time to understand protein folding kinetics.	These assays are typically run in a 96-well plate, using small volumes and low concentrations of proteins. A titration of a whole plate can take approximately 6 h and takes more data analysis than Differential Scanning Fluorimetry but gives more accurate and quantitative data. Protein unfolding kinetics can also be performed in a plate but direct folding kinetics usually requires a stopped-flow spectrometer and is lower throughput.
Fluorescence Resonance Energy Transfer
Fluorescence Resonance Energy Transfer (FRET) is a distance-dependent physical process by which energy is transferred between two fluorophores. Light is absorbed by a fluorophore at one wavelength (excitation), followed by emission at a longer wavelength, which is absorbed by an adjacent fluorophore, which then emits even longer wavelength light that is detected. Ideally these fluorophores should have narrow but partially overlapping emission lines. The FRET pair can be small molecules such as rhodamine and fluorescein that are cross-linked directly to the protein. Alternatively, molecules such as Green or Blue Fluorescent Protein (GFP/BFP) can be linked directly to the termini of two proteins under study.	Can be used as a molecular ruler to determine how close two molecules are together. One protein is tagged with a donor fluorophore and the second is tagged with an acceptor fluorophore. If they are within a few nanometres, then energy transfer occurs. It is used to measure dynamics and protein interactions.	Acquiring data for FRET is rapid once the proteins are labelled. However, attaching fluorescent probes to a protein can take many hours or days to achieve.
Protein Computational Biology
Computational Biology can be used to predict structure and dynamics of proteins. Many powerful algorithms have been developed that consider chemical properties of the amino acid sequence to characterise proteins. Homology modelling uses a sequence of a protein with an unknown structure with a known structure that is usually in a related family (see SCOP and CATH) to model the unknown structure. This method is an active area of research. Another important area is molecular dynamics simulations which apply to proteins the rules in chemistry and physics that govern how molecules behave in aqueous environments. System-wide analysis of proteins uses an organism’s proteome, which is all of its protein sequences determined from genome sequencing.	These methods generate several important protein databases of predicted structures, interactions and evolutionary relationships that generate hypotheses that are testable in the lab. Molecular dynamics generate movies of protein motions that provide new information about how proteins behave that could not be seen using traditional experimental techniques. A new field of systems biology, tries to combine all information available about proteins in an organism to simulate how they all work together in a cell, or complete organism, to carry out life functions.	Many of the databases make predictions about proteins automatically and the information is readily available to anybody. For example, the database UniProt has a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. Molecular dynamics simulations can take many days to run and analyse and we are currently limited to viewing only microseconds of motions. Systems biology approaches are also very time-consuming at this time. Computational methods however, can save a lot of time and effort as computers are cheap yet powerful tools to use in combination with experimental methods.

Cover Image

Uncovering protein structure

Abstract

Introduction

Part 1: The structural properties of proteins

Proteins

l-Amino acids

Intermolecular interactions

Proteins have diverse structures and functions

Protein structure

Resonance stabilisation causes the peptide bond to have double-bond character and carry a dipole

Protein secondary structural elements

Protein domains

Motifs, Domains and Full-length proteins

Protein folding

Two state folding of a small protein

Protein misfolding

Cross-β structure of amyloid material

Protein dynamics

Positive allostery in haemoglobin

Feedback inhibition in metabolic pathways

Cartoon of the coupled folding and binding

Part 2: Approaches to study protein structure

Spectroscopy and light

Circular Dichroism

CD spectroscopy

Characteristic CD spectra

X-ray crystallography

The X-ray crystallography set up

Fourier Transformation

An electron density map

NMR

How bulk magnetisation is generated and manipulated for multiple copies of the same atom

An 1H FID for a protein and its Fourier Transform

1H 15N-HSQC of a small protein domain

Cryo-EM

Cryo-EM process

Other methods

Concluding comments

Data Availability

Competing Interests

Author Contribution

Acknowledgements

Abbreviations

Further reading and resources

Cited By

Get Email Alerts

CONNECT

EXPLORE

This Feature Is Available To Subscribers Only

An ¹H FID for a protein and its Fourier Transform

¹H ¹⁵N-HSQC of a small protein domain