Testing for ‘internal’ adaptive properties of the genetically encoded amino acids involves two steps. First, any such test must define an appropriate chemistry space for the amino acids. This requires careful selection of molecular descriptors that accurately depict amino acids in terms of their relationships with one another and their roles within proteins. Given an appropriately defined chemistry space, it becomes possible to perform quantitative tests for the adaptive logic of the genetically encoded amino acids. For this second step, we test the three hypotheses outlined in the introduction. The first test probes the concept of early amino acids, asking whether this specific subset of 10 amino acids distinguishes itself relative to other possible subsets of the standard amino acid alphabet. The second test complements the first by turning to assess the idea of late amino acids, asking whether this subset amounts to a literal expansion of chemistry space. The third and final test places these two investigations into a broader, unified context by asking whether the early and late amino acids make adaptive sense when considered alongside a third component: dimers constructed from the early amino acids.
Defining amino acid chemistry space
Defining an appropriate chemistry space of amino acids is essential for any quantitative analysis of the amino acid alphabet. This conceptually simple step is rendered challenging by the vast array of amino acid molecular properties that have been measured. For example, the Amino Acid Index (or AAindex) comprises an extensive collection of such measures for the genetically encoded amino acids drawn from the scientific literature [12]. Currently, the database lists over 600 molecular descriptors. Though few of these descriptors are entirely independent of one another, the question remains: which subset best reflects relevance to the role of building proteins? To address this challenge, we start by noting that three key properties are commonly acknowledged to dominate amino acids’ biochemical roles within protein structure and function: size, hydrophobicity, and charge [13–15].
Each property contributes to the biochemical interactions of amino acids in unique and essential ways. The size or bulk of amino acids’ sidechains has long been recognized as an important factor in defining amino acid similarity [13]; hydrophobicity is likewise widely acknowledged as a fundamental determinant of folding pathways of nascent peptides and has been previously linked to the genetic code through the adaptive hypothesis [16, 17]; and electrostatic interactions between amino acids have been shown to play a crucial role in inter- and intra-protein molecular interactions [15, 18]. For these reasons, and because extensive tests have probed the reliability of various measures of these properties [19], especially in relation to the genetically encoded amino acids [5], these are the three dimensions we chose to investigate.
Having selected the properties of size, hydrophobicity, and charge, it remains to choose specific molecular descriptors with which to measure each of these conceptual properties. Size is the most straightforward of the three, owing to strong agreement across a variety of different descriptors. We elected to use ACD Molar Volume because it is freely available for a wide range of molecules, including amino acid dimers (see Test 3) via ChemSpider (http://www.chemspider.com). To represent hydrophobicity, we selected ACD LogP, also freely available through ChemSpider. LogP represents a subtly different, related property of lipophilicity, which is essentially hydrophobicity with the added consideration of polarity [20]. Specifically, LogP measures the logarithm partition coefficient, which represents the ratio of a compound’s concentration in organic versus aqueous-phase solvents of a two compartment system (i.e. a the measure of the molecule’s relative solubility in each of the two solvents). We chose to represent the charge (or electrostatic interactions) of a compound using Kowin isolectric point (pI). Whereas other measures of charge are highly dependent on the pH at which the measurement is taken, pI records the pH at which the concentration of the anionic and cationic forms of an amino acid are equal. It can be derived theoretically by calculating the pKa (or dissociation constant) values for the ionized states of the amino acid that exist one positive and one negative charge away from the neutral state of the amino acid [19].
Test 1: are the early amino acids an adaptive subset?
Our hypothesis predicts that if the 10 amino acids designated as “early” were indeed used by some earlier genetic code as a protein-building alphabet, they should exhibit unusual qualities as a subset that are analogous to the unusual properties detected for the entire set of 20 genetically encoded amino acids. In particular, the concept of amino acid coverage was used previously to measure the adaptive value of an amino acid alphabet in terms of its protein building potential [5]. Coverage considers both the range of values covered within a particular descriptor and how evenly these values are distributed within that range (Figure 1). An adaptive set of amino acids comprises members evenly distributed across a broad range for key physicochemical properties of size (molar volume), charge (Isoelectric Point) and hydrophobicity (LogP). Such a set of amino acids can combine within an evolving protein to approximate any suite of properties required by shifting environmental conditions.
A simple way to test the adaptive value of the early amino acids is therefore to take the 20 amino acids and ask: if we were to select 10 amino acids at random and measure their coverage in size (Molar Volume), hydrophobicity (ACD LogP) and charge (Isoelectric point), then how often would these random subsets exhibit a better coverage of amino acid chemistry space than that of the “earlies”? We therefore wrote a script to measure the coverage of the 10 early amino acids and of 1 million random samples of 10 amino acids chosen randomly from within the twenty encoded amino acids. This script was run separately for each dimension of amino acid chemistry space and then for each possible combination of 2 and 3 dimensions simultaneously.
Test 2: do the late amino acids expand the chemistry space of the early amino acids?
If the “late” amino acids were added to expand the universe of genetically encoded proteins, then we might expect them to be associated with some measurable expansion of amino acid chemistry space. In other words, we may predict that the late amino acids lie further from the earlies than would be expected for an arbitrary division of the amino acids.
To test this idea, we wrote a further script that first calculated the mean of the cluster of early amino acids for a given molecular descriptor and then measured the distance between this mean and each of the lates as a summed total distance (Figure 2). It then replicated this measurement 1 million times, each time randomly designating 10 of the 20 amino acids as early and 10 as late in order to record how often the true late amino acids show greater expansion from the earlies than occurs by chance. As with Test 1 above, this second test was performed for each individual dimension of amino acid chemistry space, and for combinations of 2 and 3 dimensions simultaneously. As an additional test of the robustness of our results, we repeated all calculations having first removed Glycine and Alanine, which, owing to their unique structural simplicity, can arguably be considered outliers rather than meaningful degrees of freedom of amino acid possibility space.
Test 3: do the late amino acids fill an adaptive gap?
Our general adaptive hypothesis is that the late amino acids were added to the genetic code by natural selection because they expanded the protein-building chemistry space available to some simpler precursor of the standard genetic code. Test 2 therefore considers whether the late amino acids populate new points in chemistry space that were unavailable to the early amino acids. However, amino acids do not act in isolation: they are polymerized to form proteins. This implies that the biologically relevant chemistry space of the early amino acids also includes the chemistry space of their dimers, trimers, etc. For the late amino acids to be adaptively advantageous, they must not only expand the chemistry space of the early amino acids, but also do so in such a way that they are performing a novel role; that is, one not already fulfilled by existing amino acids or their oligomers. Within one dimension, early amino acid dimers and late monomers overlap considerably in range for size and hydrophobocity (Figure 3). Any adaptive separation must therefore be found in combinations of these properties.
In order to test this, we first defined an area around all points in chemistry space (late amino acid monomers and early amino acid dimers) to represent the region of chemistry space populated by each molecule. This area was calculated as a circle centered on each point with a radius equal to the average distance between all points (see Figure 4). Using the designation of late amino acids and early-dimers, we measured the frequency with which dimers were found to overlap with the chemistry space of late amino acids. We then randomized the designation of late amino acids and early dimers and measured equivalently how often dimers were found to share chemistry space with late amino acids. This allowed us to measure the overlap in chemistry space between these two sets of molecules compared to what could be expected by chance and to therefore quantify the novelty contributed by the late amino acids.