Main

Investigations into whether the LUCA was a hyperthermophilic (optimal growth temperature (OGT) ≥80 °C), thermophilic (OGT 50–80 °C), or mesophilic (OGT ≤50 °C) organism have relied on correlations between the species’ OGT and the composition of their macromolecular sequences. In extant prokaryotic species, the G+C content of rRNA stems (that is, double-stranded parts) has been shown to correlate with OGT12. Exploiting this correlation, support was obtained for a non-hyperthermophilic LUCA2. In contrast, studies based on correlations between the composition of the LUCA’s proteins and OGT concluded in favour of a hyperthermophilic LUCA13,14 and of hyperthermophilic ancestors for both Archaea and Bacteria. The discrepancy between these results could come from some unexplained incongruence between rRNA and proteins, or, as we shall see, from differences between evolutionary models used.

These previous investigations2,13,14 based their conclusions on comparisons of reconstructed ancestral sequence compositions with extant ones. Accurate modelling of the evolution of compositions is therefore crucial for such approaches. Two of these studies13,14 relied on homogeneous models of evolution which make the simplifying hypothesis that substitutions occur with constant probabilities over time and across all lineages. If genomes and proteins had evolved according to a homogeneous model, they would all share the same base and amino acid compositions. Clearly, rRNA12 and protein sequences15 do not. Another approach2 has been to use a branch-heterogeneous model of RNA sequence evolution. Branch-heterogeneous models are computationally more challenging, but more realistic as they allow replacement or substitution probabilities to vary between lineages, and thus explicitly account for compositional drifts2,6,7,16,17. Accordingly, they have been shown to accurately reconstruct ancestral sequence compositions7.

We recently developed nhPhyML7, an efficient program for the branch-heterogeneous modelling of nucleotide sequence evolution in the maximum likelihood framework, and nhPhyloBayes6, which implements a site- and branch-heterogeneous Bayesian model of protein sequence evolution. The latter combines the break-point approach17 to model variations of amino acid replacement rates along branches and the CAT18 mixture model to account for site-wise variations of these rates. These models have been shown to describe the evolution of real sequences more faithfully than homogeneous ones6,17, although neither homogeneous nor heterogeneous models ensure that inferred ancestral sequences are biologically functional. Using nhPhyML and nhPhyloBayes, we can reconstruct ancestral sequences of both rRNAs and proteins with branch-heterogeneous models, and estimate sequence compositions of all nodes of the tree of life, including the LUCA and its descendants. These compositions can be translated into approximate OGTs using the OGT/composition correlations observed in extant sequences12,15.

A nucleotide data set of concatenated small- and large-subunit rRNAs—restricted to double-stranded regions—from 456 organisms (1,043 sites), and an amino acid data set of 56 concatenated nearly universal proteins from 30 organisms (3,336 sites), were assembled, each data set sampling all forms of cellular life. Correspondence analyses of the protein data set show that eukaryotes and prokaryotes markedly differ in amino acid compositions and that an effect of temperature on proteomes is detectable only among prokaryotic species (Supplementary Figs 4 and 6b). Similarly, the correlation between rRNA G+C content and OGT has only been documented in prokaryotes12. The ability to infer ancestral OGTs from rRNA and protein compositions therefore applies only to prokaryotes. However, eukaryotic sequences were kept in the subsequent analyses because they are part of the tree of life and as such provide useful phylogenetic information for ancestral sequence inferences.

The effect of temperature on prokaryotic proteomes is independent from genomic G+C contents15, and was summarized in terms of average content in the amino acids I, V, Y, W, R, E and L (hereafter referred to as IVYWREL). Accordingly, our correspondence analysis identifies two independent factors accounting for most of the variance in amino acid compositions of prokaryotic proteins (Supplementary Fig. 5). The first factor (45.4% of the variance) highly correlates to genome G+C content (r = 0.81); the second (13.8% of the variance) is strongly correlated to OGT (r = 0.83) and to IVYWREL content (r = 0.73, Supplementary Fig. 6). The second factor was therefore used here as a molecular thermometer. The rRNA-based and the protein-based thermometers are thus independent, both because they come from distinct genome parts and because they exploit different effects of temperature on sequence composition. Furthermore, the correlation between rRNA G+C content and OGT is not expected to vary during evolutionary time because it stems from the different thermal stabilities of G–C and A–U RNA base pairs12. Thus, assuming that the relationship between temperature and amino acid composition of prokaryotes has also not varied since LUCA, the estimations of rRNA G+C content and amino acid compositions through branch-heterogeneous models provide two independent means to analyse the evolution of thermophily.

For each data set, a phylogenetic tree was inferred and rooted on the branch separating Bacteria from Archaea and Eukaryota (Supplementary Figs 7 and 8). Because the location of the root in the universal tree remains uncertain19, the alternative rooting on the eukaryotic branch was also considered. Correlations between G+C content and OGT (Fig. 1a), and between the second axis of the amino acid correspondence analysis and OGT (Fig. 1b), were used to estimate OGTs for the LUCA and its descendants (Fig. 2).

Figure 1: Correlations between sequence compositions and OGT, and estimates of key ancestral compositions.
figure 1

Black dots indicate extant prokaryotes positioned according to their sequence composition and OGT. Dashed coloured lines indicate predicted OGTs for various ancestors. a, Correlation between rRNA G+C content and OGT. The vertical coloured bars indicate most likely nhPhyML estimates of ancestral G+C contents with their 95% confidence intervals. b, Correlation between the second factor of the correspondence analysis on amino acid compositions and OGT. The vertical coloured bars indicate median ancestral compositions inferred by nhPhyloBayes with their 95% confidence intervals. The LUCA is significantly less thermophilic than its direct descendants (P ≤ 0.005).

PowerPoint slide

Figure 2: Evolution of thermophily over the tree of life.
figure 2

Protein-derived nhPhyloBayes OGT estimates (and their 95% confidence intervals for key ancestors) for prokaryotic organisms are colour-coded from blue to red for low to high temperatures. Colours were interpolated between temperatures estimated at nodes. The eukaryotic domain, in which OGT cannot be estimated, has been shaded. The colour scale is in °C; the branch length scale is in substitutions per site. A, archaeal; B, bacterial; E, eukaryotic domains. Ac, Actinobacteria; Aq, Aquificae; Ba, Bacteroidetes; C, Cyanobacteria; Cf, Chloroflexi; Ch, Chlamydiae; Cr, Crenarchaeota; DT, Deinococcus/Thermus; Eu, Euryarchaeota; F, Firmicutes; P, Proteobacteria; Pl, Planctomycetes; T, Thermotogae.

PowerPoint slide

Proteins and rRNAs support similar patterns of OGT changes for prokaryotes, so the discrepancy between previous rRNA- and protein-based investigations2,13,14 was not a result of incongruence between these molecules. Protein-derived temperature estimates are generally lower than those based on rRNAs (Fig. 1), although some protein and rRNA-based OGT estimates overlap if confidence intervals of ancestral compositions are taken into account (Supplementary Table 3). Both types of data support key conclusions (Fig. 1). First, the LUCA is predicted to be a non-hyperthermophilic organism, as previously reported2. Second, both archaeal and bacterial ancestors, as well as the common ancestor of Archaea and Eukaryota, are estimated to have been thermophilic to hyperthermophilic (Fig. 2). This result is in line with previous studies3,5. Third, within the bacterial phylogenetic tree, tolerance to heat decreased (Fig. 2). This last result is congruent with recent estimates of the evolution of OGTs in the bacterial domain based on ancestral reconstructions and characterizations of elongation factor Tu proteins4.

Support for the hypothesis of a non-hyperthermophilic LUCA and of subsequent parallel adaptations to high temperatures partly rests on a protein content depleted in IVYWREL for the LUCA and subsequently enriched in these amino acids. This is consistent with a recent report that amino acids IVYEW might be under-represented in LUCA’s proteins20. This finding has been interpreted as evidence that these five amino acids were a late addition to the genetic code, and that the proteome of the LUCA had not yet reached compositional equilibrium. Although such interpretation in terms of early genetic code evolution is possible, our hypothesis of parallel adaptations to high temperatures has the advantage of explaining the patterns observed with both rRNAs and proteins.

Additional experiments suggest that the present analyses of rRNA and protein sequences with branch-heterogeneous models of evolution uncover genuine signals of ancient temperature preferences and are not affected by systematic biases.

First, these results are robust to changes in the topology chosen for inference because analyses with alternative topologies yielded virtually identical OGT estimates (Supplementary Fig. 10). Moreover, phylogenetic trees rooted on the eukaryotic branch also suggest that OGT increased between the universal ancestor and the divergence of Archaea and Bacteria (Supplementary Figs 13–15).

Second, taxonomic sampling does not strongly affect these results. With rRNA and protein data sets in which eukaryotic sequences were removed, the signal for OGT increases between the LUCA and the domain ancestors was essentially unchanged (Supplementary Fig. 36). Moreover, both for rRNAs and proteins, two artificially biased data sets containing sequences from either thermophilic or mesophilic prokaryotes were assembled (see Supplementary Information). The signal for parallel increases in OGT is confirmed in all but one of these four data sets: the mesophilic rRNA data set. However, the longest of the two mesophilic alignments, the protein data set, supports the same pattern of OGT changes as the complete data sets (Supplementary Figs 16 and 17). Notably, analysis of the protein mesophilic data set shows that this pattern is independent of the debated position of hyperthermophilic organisms in the tree of life. Furthermore, with all rRNA and protein data sets, even with the sampling limited to thermophilic prokaryotes, the LUCA remains predicted as a non-hyperthermophilic organism (Supplementary Figs 18 and 19).

Third, dependence of the results on models used for ancestral reconstruction was investigated. Additional branch-heterogeneous evolutionary models were applied, two to the rRNA data set, and one to the protein data set (see Supplementary Information). All these alternative branch-heterogeneous models confirm our results (Supplementary Figs 21–23, 29 and 30). Compositional analyses were also conducted using branch-homogeneous models of evolution: GTR21 for rRNA and proteins, and CAT18 for proteins. All these models tend to predict parallel adaptations to higher temperatures from the LUCA to its descendants, suggesting the existence of a genuine signal for such a pattern in the data (Supplementary Figs 24, 26 and 28). However, only when models are realistic enough is the LUCA predicted as significantly less thermophilic than its two descendants. For instance, ancestral protein compositions predicted by the GTR model for the LUCA and its two descendants strongly overlap, which may explain previously published results13, whereas the CAT model better separates these ancestral node distributions, although less clearly than does the CAT–BP branch-heterogeneous model (Supplementary Figs 26, 28 and 29). These experiments show that as the evolutionary process is more accurately modelled, the support for parallel increases in OGT from the LUCA to its offspring is strengthened.

Fourth, it is known that the base compositions of fast and slowly evolving sites and, particularly, of single- and double-stranded regions of rRNA molecules differ and that this may bias ancestral sequence estimates16. To minimize this bias, only double-stranded rRNA regions have been analysed here. Moreover, if fast-evolving sites are removed, estimates still support parallel adaptations to high temperatures (Supplementary Fig. 33).

Fifth, it has been shown that some ancestral reconstruction methods might improperly estimate the frequencies of rare amino acids22. To control for that potential bias, the two rarest amino acids, cysteine and tryptophan, were discarded from estimated ancestral sequences: this had essentially no impact on results (Supplementary Fig. 34).

Sixth, the sensitivity of the OGT estimates at the tree root to the prior distribution of ancestral amino acid compositions used for Bayesian analyses was investigated (Supplementary Fig. 35). This prior distribution induces a flat, uninformative distribution over OGTs, whereas the posterior distributions estimated for LUCA and the bacterial ancestor have small variance, and thus reflect a genuine signal in the data, rather than a bias from the prior. Moreover, even with a strongly informative prior distribution that is biased towards high temperature amino acid distributions, the posterior distribution of the LUCA’s amino acid composition, although altered, is centred at lower temperatures than that of the bacterial ancestor.

The present use of molecular thermometers requires that evolution of the data sets under analysis can be modelled by a tree structure as far as reconstruction of ancestral compositions is concerned. We emphasize that our protein analyses are based on 56 genes that did not undergo between-domain transfers (see Methods), which precludes that ancestral sequence reconstructions are confounded by such gene exchanges. We do not exclude within-domain lateral transfers of these genes; however, the robustness of the inferred ancestral compositions to alternative domain phylogenies4,7(see also Supplementary Figs 10 and 20) suggests that these potential transfers do not fundamentally affect the results for domain ancestors. Finally, because molecular thermometers measure the average environmental temperature of the hosts of ancestral genes, they apply even if ancestral genes of extant prokaryotes originate from diverse organisms19.

Thus, all our analyses support the hypothesis of a non-hyperthermophilic LUCA and of transitions to higher environmental temperatures for its descendants. Although these organisms have not yet been anchored in time23, a few geological and biological factors may explain observed changes in temperature preferences. It has already been observed4 that the general trend of decreasing OGTs from the bacterial ancestor to extant species strikingly parallels recent geological estimates of the progressive cooling down of oceans shifting from about 70 °C 3.5 billion years ago to approximately 10 °C at present24. The evolution of thermophily in the bacterial domain might therefore stem from the continuous adjustment of Bacteria to ocean temperatures, although the evidence for a hot Archaean climate remains debated25. A similar conclusion may apply to Archaea as well, but would require confirmation with additional genome sequences from mesophilic Archaea. A hot Archaean ocean may preclude the existence of a cool ‘little pond’ where the LUCA could have evolved. Therefore, a non-hyperthermophilic LUCA would suggest that moderate temperatures existed earlier in the history of the Earth.

Geological data about palaeoclimates that old are very scarce. However, some models of Hadean and early Archaean climates (3.5–4.2 billion years ago) suggest that the Earth might have been colder than it is today, possibly covered with frozen oceans1,26. Moreover, a hypothesis of brutal temperature changes involving meteoritic impacts that boiled the oceans and therefore nearly annihilated all life forms but the most heat-resistant ones has been proposed1,8,9. Huge meteorites probably impacted the Earth at least as late as 3.8–4 billion years ago, most notably during the late heavy bombardment27 and created a series of brief but very hot climates on Earth1. As life may have originated more than 3.7 billion years ago28, it is possible that early organisms, namely the LUCA’s offspring, experienced such bottlenecks.

Alternatively, under the hypothesis that life originated extra-terrestrially, the transfer of life to the Earth from another planet in ejecta created by meteorite impacts would have also entailed selection of heat-resistant cells1. Overall, geological knowledge provides several frames that might fit the predictions of our biological thermometers.

A biological hypothesis could provide an internal mechanism to explain the observed pattern. It posits that the LUCA had an RNA genome, and that its offspring lineages independently evolved the ability to use DNA for genome encoding10, possibly by co-opting it from viruses11. Although our results do not bring direct evidence in support of this hypothesis, they are compatible with it and could even help explain such independent acquisitions of DNA in adaptive terms, as DNA is much more thermostable than RNA29.

Great care is necessary when attempting a reconstruction of events that took place more than three billion years ago. However, the strong agreement between results obtained using two types of data (proteins and rRNAs), two independent temperature proxies (protein amino acid composition and rRNA G+C content), and independently developed statistical models, is remarkable. This suggests that a similar approach could successfully be used to gain insight into other ecological features of early life. For example, it has been shown that aerobic and anaerobic bacteria differ in the amino acid composition of their proteome30; future ancestral sequence reconstructions could reveal the evolution of aerobiosis along the tree of life in relation with the geological record of oxygen atmospheric concentration.

Methods Summary

Ribosomal RNA sequences were aligned according to their shared secondary structure. Sites belonging to double-stranded stems were selected to obtain an alignment of 1,043 stem sites for 456 organisms. Protein families with wide species coverage and no or very low redundancy in all species were selected from the HOGENOM database of families of homologous genes. Only sites showing less than 5% gaps were kept, giving an alignment of 3,336 positions for 30 organisms. Phylogenetic trees were inferred using Bayesian or maximum likelihood techniques. Ancestral nucleotide and amino acid compositions were inferred for all tree nodes using the programs nhPhyML7 and nhPhyloBayes6, respectively. The G+C contents of ancestral rRNA sequences were compared to extant rRNA base compositions. The second factor of the correspondence analysis of amino acid compositions of extant prokaryotic proteins was used to estimate ancestral environmental temperatures by adding ancestral amino acid compositions as supplementary rows to the correspondence analysis. These two procedures allowed us to estimate ancestral environmental temperatures with the rRNA and the protein data sets, respectively. Confidence intervals for the estimated environmental temperatures were as follows: in the case of rRNAs, they contained 95% of the distribution obtained by a bootstrap procedure (200 replicates); for Bayesian analyses, regular 95% credibility intervals were computed from a sample of 2,000 points drawn from the posterior distribution.

Online Methods

rRNA data set

Prokaryotic small (SSU) and large (LSU) subunit rRNAs were retrieved in January 2007 from complete genomes available at the National Center for Biotechnology Information (NCBI). SSU and LSU rRNA sequences from ongoing genome projects or from large genomic fragments of important or poorly represented groups (for example, Archaea or hyperthermophilic bacteria) were added in June 2007. Eukaryotic SSU and LSU rRNA sequences were provided by D. Moreira; 65 slowly evolving sequences were selected from this data set31. Sequences were aligned using MUSCLE32. Resulting alignments were concatenated and manually improved using the MUST package33. Regions of doubtful alignment were removed using the MUST package; 2,239 sites were kept. A distance phylogenetic tree was computed using dnadist (Jukes and Cantor model) and neighbour from the PHYLIP package34. The final data set contained 65 eukaryotic, 60 archaeal and 331 bacterial sequences representative of the molecular diversity in each domain. An additional data set of 60 sequences sampling the diversity of the full data set was used in Bayesian analyses. Secondary structure predictions were downloaded from the rRNA database35. Sites that were predicted as double-stranded stems in Saccharomyces cerevisiae, Escherichia coli and Archaeoglobus fulgidus were selected to give an alignment of 1,043 sites.

Protein data set

Nearly universal protein families with one member per genome were used to avoid ill-defined orthology. Protein families from the HOGENOM database of families of homologous genes (release 03, October 2005, S. Penel and L. Duret, personal communication; http://pbil.univ-lyon1.fr/databases/hogenom3.html) that displayed a wide species coverage with no or very low redundancy in all species were selected. Additional sequences from other genomes whose phylogenetic position was interesting were considered. These were downloaded from the Joint Genome Institute (Desulfuromonas acetoxidans), The Institute for Genomic Research (Giardia lamblia, Tetrahymena thermophila, Trichomonas vaginalis) or the NCBI (Kuenenia stuttgartiensis), and were searched for homologous genes using BLAST36; only the best hit was retrieved. The protein families were subsequently aligned using MUSCLE32 and submitted to phylogenetic analysis using the NJ algorithm37 with Poisson distances with Phylo_Win38. Proteins from mitochondrial or chloroplastic symbioses and families in which horizontal transfers between Bacteria and Archaea may have occurred were discarded, and so were aminoacyl-tRNA synthetases prone to transfers39. In the rare families with two sequences from the same species, the sequence showing the longest terminal branch or whose position was most at odds with the biological classification was discarded. This provided 56 protein families (Supplementary Table 2) for 115 species, which were concatenated using ScaFos40. From the 9,218 concatenated sites, 3,336 positions with less than 5% gaps were conserved. The whole data set was used to compute the correspondence analysis and correlations between amino-acid composition and optimal growth temperature. For Bayesian analyses, 30 species among 115 were selected sampling the diversity of cellular life (Supplementary Table 1).

Multivariate data analyses

Correspondence analysis41 was performed on the amino-acid compositions of the protein data set, using the ade4 package42 of the R environment for statistical computing.

Phylogenetic tree construction

An rRNA phylogenetic tree was built from the 456-sequence alignment with both stems and loops with PhyML_aLRT43,44 with the GTR model, a gamma law with eight categories and an estimated proportion of invariant sites. The tree for the 60-sequence data set was obtained in the same manner. The phylogenetic trees for the three protein data sets (Supplementary Table 1) were obtained using MrBayes 3.1.1 (ref. 45), using the GTR substitution model and a gamma law with four categories for rates across sites. Chains were run for 1,000,000 generations and samples were collected each 100 generations, a burn-in of 1,000 samples was discarded. The majority rule consensus was computed from the 9,000 remaining samples.

Identification of fast-evolving rRNA sites

Posterior probabilities for gamma law rate categories were predicted for each site with PhyML_aLRT. Site evolutionary rates were obtained by averaging gamma law rate categories weighted by their posterior probabilities. Sites whose evolutionary rate was above the arbitrarily chosen threshold of 2.0 (Supplementary Fig. 2) were discarded, which left 940 sites.

Estimation of ancestral compositions

For the maximum likelihood approach, nhPhyML7 was applied to the rRNA stem sites alignment and the phylogenetic tree described above, and used to estimate all evolutionary parameter values, except tree topology, which was fixed. Site-specific ancestral nucleotide compositions at tree root and at internal node j descendant of node i were computed by:

where x and y are in{A, C, G, T}, L is the total tree likelihood at this site, Llow and Lupp are site lower and upper conditional likelihoods, respectively7, ω is the maximum likelihood estimate of root G+C content, and pyx is the probability of the y to x substitution on the i to j branch. For Bayesian analyses, nhPhyloBayes6 was applied to trees described above. Ancestral sequence reconstruction started, for each site, by drawing a state x at the root: x ω(x)Llow(x at root), where ω was the Markov Chain Monte Carlo45 (MCMC) estimate of root amino acid or nucleotide frequencies. Then, states x have been recursively drawn at each node j: x pyxLlow(x at j), where y was the parental node state. Given a realization of the model, this permitted the reconstruction of ancestral sequences at all nodes. Posterior distributions were sampled by 2 (for proteins) or 4 (for rRNA) independent MCMC chains, each with 1,000 to 2,000 realizations. Posterior distributions of sequence compositions combined all realizations of all chains. Protein ancestral compositions were projected on the second axis of the correspondence analysis, and rRNA ancestral compositions were summed up as G+C contents.

Statistical tests

In bootstrap analyses, all parameters but topology and branch lengths were estimated under the maximum likelihood criterion for each replicate. In tests of whether the LUCA is less thermophilic than one of its descendants, P values were the fraction of cases where the temperature estimate for LUCA in a bootstrap replicate or in an iteration of an MCMC chain was above the estimate obtained for its descendant.