Testing for adaptive signatures of amino acid alphabet evolution using chemistry space
© Ilardo and Freeland; licensee Chemistry Central Ltd. 2014
Received: 15 June 2013
Accepted: 2 August 2013
Published: 21 January 2014
Multidisciplinary consensus indicates that half of the genetically amino acids are likely to have been available on the prebiotic earth, which implies certain adaptive expectations for the relationship between those amino acids and later additions to the genetic code. Chemistry space a concept that translates molecules to corresponding points in multidimensional space provides a framework for investigating these relationships. We therefore developed three tests to explore these implications using chemistry space to quantify otherwise qualitative questions.
All three of our tests individually, as well as combined, provide quantitative evidence to support an adaptive expansion of the genetically encoded amino acid alphabet from 10 prebiotically plausible (“early”) amino acids to the full set of 20 amino acids found within the standard genetic code.
We present three logically independent, novel tests of the adaptive growth of the amino acid alphabet from a smaller, functionally cohesive alphabet of only 10 amino acids to the 20 amino acids of the standard genetic code. While similar tests in the past have compared the genetically encoded amino acids to an external context of amino acids that were not incorporated into the genetic code our tests focus on the internal context of the 20 genetically encoded amino acids and find strong support. Of particular note one of these tests for the first time moves beyond consideration of amino acids as monomers and begins to explore polypeptides by considering the chemistry space of amino acid dimers.
KeywordsAmino acid alphabet Genetic code Chemistry space Natural selection
In recent years, chemistry space has revolutionized the search for pharmaceutically relevant molecules (e.g. [1–4]). The utility of chemistry space as a concept, however extends far beyond the pharmaceutical industry and is here used to investigate adaptive properties of the genetically encoded amino acids. The fundamental point of chemistry space is to assign molecules with numeric values that define specific aspects of their physical and/or chemical attributes. This simple step transforms a collection of unique molecules into a set of points in multi-dimensional space, which are therefore amenable to powerful visualization and quantitative analysis. Key to this transformation is replacing conceptual properties (such as “hydrophobicity”), with precisely defined, measurable molecular descriptors that quantify them (e.g. LogP).
If the concept of early amino acids is correct, then a subset of 10 genetically encoded amino acids formed a functionally cohesive protein-building alphabet of some earlier genetic code. We might expect this particular subset to exhibit similar adaptive qualities to those reported previously for the entire set of 20 genetically encoded amino acids .
If the amino acid alphabet grew from this smaller subset to its current size by adaptively adding new members (the lates), then we might anticipate that the late amino acids contribute the sort of literal expansion of chemistry space implied by previous qualitative statements such as “The driving force [in the growth of the amino acid alphabet] is the possibility to produce fitter proteins when the repertoire of amino acids is enlarged” .
If the concepts of early and late amino acids are correct, then the growth of the amino acid alphabet tested in (i) and (ii) should make adaptive sense when considering the use of amino acids in polymers. In particular, a smaller, early alphabet of 10 amino acids implies a library of 55 different amino acid dimers that were available for use in early proteins. An adaptive interpretation would predict that the addition of late amino acids should not overlap with this already-populated chemistry space. That is, the late amino acids would be expected to populate empty regions of chemistry space, and therefore fill functional roles that neither the early amino acids nor their dimers could perform.
Raw Data: molecular descriptor values for each of the 20 genetically encoded amino acids
Early amino acids
Late amino acids
Raw Data: molecular descriptor values for early amino acid dimers
Early amino acid dimers
Testing for ‘internal’ adaptive properties of the genetically encoded amino acids involves two steps. First, any such test must define an appropriate chemistry space for the amino acids. This requires careful selection of molecular descriptors that accurately depict amino acids in terms of their relationships with one another and their roles within proteins. Given an appropriately defined chemistry space, it becomes possible to perform quantitative tests for the adaptive logic of the genetically encoded amino acids. For this second step, we test the three hypotheses outlined in the introduction. The first test probes the concept of early amino acids, asking whether this specific subset of 10 amino acids distinguishes itself relative to other possible subsets of the standard amino acid alphabet. The second test complements the first by turning to assess the idea of late amino acids, asking whether this subset amounts to a literal expansion of chemistry space. The third and final test places these two investigations into a broader, unified context by asking whether the early and late amino acids make adaptive sense when considered alongside a third component: dimers constructed from the early amino acids.
Defining amino acid chemistry space
Defining an appropriate chemistry space of amino acids is essential for any quantitative analysis of the amino acid alphabet. This conceptually simple step is rendered challenging by the vast array of amino acid molecular properties that have been measured. For example, the Amino Acid Index (or AAindex) comprises an extensive collection of such measures for the genetically encoded amino acids drawn from the scientific literature . Currently, the database lists over 600 molecular descriptors. Though few of these descriptors are entirely independent of one another, the question remains: which subset best reflects relevance to the role of building proteins? To address this challenge, we start by noting that three key properties are commonly acknowledged to dominate amino acids’ biochemical roles within protein structure and function: size, hydrophobicity, and charge [13–15].
Each property contributes to the biochemical interactions of amino acids in unique and essential ways. The size or bulk of amino acids’ sidechains has long been recognized as an important factor in defining amino acid similarity ; hydrophobicity is likewise widely acknowledged as a fundamental determinant of folding pathways of nascent peptides and has been previously linked to the genetic code through the adaptive hypothesis [16, 17]; and electrostatic interactions between amino acids have been shown to play a crucial role in inter- and intra-protein molecular interactions [15, 18]. For these reasons, and because extensive tests have probed the reliability of various measures of these properties , especially in relation to the genetically encoded amino acids , these are the three dimensions we chose to investigate.
Having selected the properties of size, hydrophobicity, and charge, it remains to choose specific molecular descriptors with which to measure each of these conceptual properties. Size is the most straightforward of the three, owing to strong agreement across a variety of different descriptors. We elected to use ACD Molar Volume because it is freely available for a wide range of molecules, including amino acid dimers (see Test 3) via ChemSpider (http://www.chemspider.com). To represent hydrophobicity, we selected ACD LogP, also freely available through ChemSpider. LogP represents a subtly different, related property of lipophilicity, which is essentially hydrophobicity with the added consideration of polarity . Specifically, LogP measures the logarithm partition coefficient, which represents the ratio of a compound’s concentration in organic versus aqueous-phase solvents of a two compartment system (i.e. a the measure of the molecule’s relative solubility in each of the two solvents). We chose to represent the charge (or electrostatic interactions) of a compound using Kowin isolectric point (pI). Whereas other measures of charge are highly dependent on the pH at which the measurement is taken, pI records the pH at which the concentration of the anionic and cationic forms of an amino acid are equal. It can be derived theoretically by calculating the pKa (or dissociation constant) values for the ionized states of the amino acid that exist one positive and one negative charge away from the neutral state of the amino acid .
Test 1: are the early amino acids an adaptive subset?
A simple way to test the adaptive value of the early amino acids is therefore to take the 20 amino acids and ask: if we were to select 10 amino acids at random and measure their coverage in size (Molar Volume), hydrophobicity (ACD LogP) and charge (Isoelectric point), then how often would these random subsets exhibit a better coverage of amino acid chemistry space than that of the “earlies”? We therefore wrote a script to measure the coverage of the 10 early amino acids and of 1 million random samples of 10 amino acids chosen randomly from within the twenty encoded amino acids. This script was run separately for each dimension of amino acid chemistry space and then for each possible combination of 2 and 3 dimensions simultaneously.
Test 2: do the late amino acids expand the chemistry space of the early amino acids?
If the “late” amino acids were added to expand the universe of genetically encoded proteins, then we might expect them to be associated with some measurable expansion of amino acid chemistry space. In other words, we may predict that the late amino acids lie further from the earlies than would be expected for an arbitrary division of the amino acids.
Test 3: do the late amino acids fill an adaptive gap?
Our third test evaluates whether “late” amino acids fill a gap by expanding the chemistry space of monomeric “early” amino acids without occupying an area of chemistry space not already filled by dimers made from early amino acids. In the 2-dimensional chemistry space of size and hydrophobicity, we found the “late” amino acids and “early” dimers are more separated (exhibit less overlap) than random designations of these molecules 99.77% of the time.
Our aim here was to apply the concept of chemistry space in order to corroborate or refute previous claims for adaptive properties of the set of 20 genetically encoded amino acids. Whereas previous claims tested the genetically encoded amino acids against a background of plausible alternatives that were never, as far as we can tell, incorporated into genetic coding, our focus here was on the internal logic of the 20 genetically encoded amino acids, building from the premise of early versus late amino acids.
Our first test asked whether the 10 amino acids that are thought to have formed a simpler, earlier genetic code exhibit, as a set, similar adaptive properties as have been recorded for the full amino acid alphabet. Our results show a weak adaptive signal for any individual dimension of chemistry space. However, if we accept that amino chemistry space becomes more meaningful for protein structure and function as it is measured in two or three dimensions, then our results provide strong support for this notion of an internal adaptive logic to the early amino acids. In terms of their coverage of chemistry space, the early amino acids prove to be a highly unusual subset of the twenty genetically encoded amino acids. This is consistent with the inference that, at some point before the emergence of the standard genetic code, they could have constituted a cohesive and functional alphabet. In the same way as the standard 20 were shown to be adaptively advantageous compared to a broader pool of alternatives , the “early” amino acids also appear to cover chemistry space exceptionally well.
Our findings complement a body of recent research spanning multiple approaches that suggests the early amino acids contain enough chemical information to form a coherent functional set. This includes a study that suggests the set of early amino acids are sufficient to enable protein folding by reducing the amino acid composition of a protein while maintaining its foldability . We draw attention to the unusual terminology of this paper, which is consistent with similar efforts by others but misleading to those outside the field. The bold claim that a functional protein has been made entirely from early amino acids actually refers to a protein sequence that comprises 80% early amino acids, with the remaining 20% of the sequence drawing from the full alphabet of 20 amino acids. More straightforwardly, our results also agree with a study that used phylogenetic analysis of amino acid compositional bias in ribosomal proteins to conclude that “at a more primitive state, the code would still contain a similar diversity of physiochemical properties” . A third study that examined modern protein sequences deficient in late amino acids of functional significance (the basic amino acids Lysine and Arginine) also supported the idea of “small proteins without basic amino acids performed important functions in the prebiotic chemistry of early Earth” .
Test 2 asked whether the 10 amino acids that are thought to have been later additions to the genetic code represent a quantifiable expansion of amino acid chemistry space. Here we see a weak signal that the late amino acids lie further in chemistry space from the earlies than expected by chance. It appears, however, that most of this effect is coming from the dimension of size. In other words, the later additions to the standard amino acid alphabet differ from the early amino acids primarily in that they were larger. Considering that the early amino acids include the smallest L-alpha amino acids that are chemically possible, this is perhaps not surprising. Nevertheless, so long as we accept that size is an important component of an adaptive chemistry space, then these results support an adaptive explanation: the late amino acids show a literal expansion of two and three-dimensional chemistry space. This assumption is bolstered by the results of the same test performed with Glycine and Alanine removed, where we still find that, in almost 90% of trials, the true late amino acids show a greater expansion in chemistry space than would be expected by chance. Our findings again are in general agreement with previous inference drawn from a different approach, where the late amino acids are believed to have been advantageous precisely because of their “unique specialized” structure .
Our third test offers clarification for the otherwise somewhat ambiguous results of Test 2. If the concept of early versus late amino acids is correct, then it implies that all dimers comprising two early amino acids were present before any late amino acid was incorporated into the genetic code. We therefore tested whether the expansion of amino acid chemistry space brought about by the addition of late amino acids was steered by an additional consideration: avoiding overlap with regions already occupied by dimers made from the earlies. In accordance with this adaptive hypothesis, we find a strong signal that the lates indeed filled a gap in chemistry space that was not already occupied by the early amino acids or their dimers.
This third test represents a qualitative expansion of thinking about adaptive amino acid chemistry space in that it is the first time molecules larger than monomers have been considered. Indeed, it is largely thanks to the availability of a free database, Chemspider, which includes relevant molecules and their key descriptors that these measurements were made possible. In this context, it is unfortunate that the molecular descriptor isoelectric point is not readily available for dimers, but this implies a simple, logical future step to verify or undermine our current interpretation.
Here we present three, logically independent tests that each generates quantitative evidence to corroborate or challenge previous claims for adaptive properties of the standard amino acid alphabet. Each test operates in terms of a simple, 3-dimensional chemistry space built from amino acid charge (pI), size (ACD Molar volume), and hydrophobicity (ACD LogP). Whereas previous claims focus on the external chemistry space of amino acids that are not part of the genetic code, we turn to consider the internal logic of the 20 genetically encoded amino acids. In particular, we consider the surprisingly strong, multidisciplinary consensus that has emerged in recent years to suggest that the genetic code may have begun with only half the 20 amino acids currently found in the standard genetic code. This division of the standard amino acid alphabet into “early” and “late” amino acids allows us to make three predictions based upon an adaptive hypothesis: (i) the “early” amino acids should form a cohesive sub-set with similar properties to the final, full-sized amino acid alphabet; (ii) the “late” amino acids should demonstrate quantifiable expansion of amino acid chemistry space in terms of dimensions that define protein-building potential, and (iii) the expansion of “late” amino acids should populate regions of chemistry space that were not already available to a genetic code that build dimers from the early amino acids. Taken together, our results provide strong support for these predictions.
In order to implement our methods, we wrote our source code in Java version 6. We then ran it on Mac OS X Version 10.6.8. Source code is available upon request to the corresponding author.
This material is based upon work supported by the National Aeronautics and Space Administration through the NASA Astrobiology Institute under Cooperative Agreement No. NNA09DA77A issued through the Office of Space Science. We thank James Stephenson for helpful and stimulating discussion.
- Dobson CM: Chemical space and biology. Nature 2004, 432: 824–828. 7019 10.1038/nature03192View ArticleGoogle Scholar
- Barker A, Kettle JG, Nowak T, Pease JE: Expanding medicinal chemistry space. Drug Discov Today 2012,18(5–6):298–304.View ArticleGoogle Scholar
- Lloyd DG, Golfis G, Knox AJ, Fayne D, Meegan MJ, Oprea TI: Oncology exploration: charting cancer medicinal chemistry space. Drug Discov Today 2006, 11.3: 149–159.View ArticleGoogle Scholar
- Reymond JL, Mahendra A: Exploring chemical space for drug discovery using the chemical universe database. ACS Chem Neurosci 2012,3(9):649–57. 10.1021/cn3000422View ArticleGoogle Scholar
- Philip GK, Freeland SJ: Did evolution select a nonrandom “alphabet” of amino acids? Astrobiology 2011, 11.3: 235–240.View ArticleGoogle Scholar
- Wong JT, Bronskill PM: Inadequacy of prebiotic synthesis as origin of proteinous amino acids. J Mol Evol 1979, 13.2: 115–125.View ArticleGoogle Scholar
- Trifonov EN: Consensus temporal order of amino acids and evolution of the triplet code. Gene 2000, 261.1: 139–151.View ArticleGoogle Scholar
- Higgs PG, Pudritz RE: A thermodynamic basis for prebiotic amino acid synthesis and the nature of the first genetic code. Astrobiology 2009, 9.5: 483–490.View ArticleGoogle Scholar
- Cleaves HJ: The origin of the biologically coded amino acids. J Theor Biol 2010, 263.4: 490–498.View ArticleGoogle Scholar
- Longo LM, Lee J, Blaber M: Simplified protein design biased for prebiotic amino acids yields a foldable, halophilic protein. Proc Natl Acad Sci 2013, 110.6: 2135–2139.View ArticleGoogle Scholar
- Weberndorfer G, Hofacker IL, Stadler PF: On the evolution of primitive genetic codes. Orig Life Evol Biosph 2003, 33.4–5: 491–514.View ArticleGoogle Scholar
- Kawashima S, Ogata H, Kanehisa M: AAindex: amino acid index database. Nucleic Acids Res 1999,27(1):368–369. 10.1093/nar/27.1.368View ArticleGoogle Scholar
- Grantham R: Amino acid difference formula to help explain protein evolution. Science 1974,185(4154):862. 10.1126/science.185.4154.862View ArticleGoogle Scholar
- Ladunga I, Smith RF: Amino acid substitutions preserve protein folding by conserving steric and hydrophobicity properties. Protein Eng 1997,10(3):187–196. 10.1093/protein/10.3.187View ArticleGoogle Scholar
- Müller-Späth S: Charge interactions can dominate the dimensions of intrinsically disordered proteins. Proc Natl Acad Sci 2010,107(33):14609–14614. 10.1073/pnas.1001743107View ArticleGoogle Scholar
- Kauzmann W: Of protein denaturation. Adv Protein Chem 1959, 14: 1.View ArticleGoogle Scholar
- Baussand J, Deremble C, Carbone A: Periodic distributions of hydrophobic amino acids allows the definition of fundamental building blocks to align distantly related proteins. Proteins 2007,67(3):695–-708.View ArticleGoogle Scholar
- Gilson MK, Honig BH: Calculation of electrostatic potentials in an enzyme active site. Nature 1987,330(6143):84–86. 10.1038/330084a0View ArticleGoogle Scholar
- Lu Y, Freeland SJ: On the evolution of the standard amino-acid alphabet. Genome Biol 2006,7(1):102. 10.1186/gb-2006-7-1-102View ArticleGoogle Scholar
- Van der Waterbeemd H, Karajiannis H, Tayar NE: Lipophilicity of amino acids. Amino Acids 1994,7(2):129–145. 10.1007/BF00814156View ArticleGoogle Scholar
- Fournier GP, Gogarten JP: Rooting the ribosomal tree of life. Mol Biol Evol 2010,27(8):1792–1801. 10.1093/molbev/msq057View ArticleGoogle Scholar
- McDonald GD, Storrie-Lombardi MC: Biochemical constraints in a protobiotic earth devoid of basic amino acids: the “BAA (−) World”. Astrobiology 2010,10(10):989–1000. 10.1089/ast.2010.0484View ArticleGoogle Scholar
- White HB III: Coenzymes as fossils of an earlier metabolic state. J Mol Evol 1976,7(2):101–104. 10.1007/BF01732468View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.