Main

For our evolution experiments, we chose the well-characterized group I RNA enzyme (ribozyme) derived from the Azoarcus pre-tRNAIle. This ribozyme retains activity at unusually high temperatures (80 °C), or in the presence of high concentrations of denaturants (7.5 M urea)12. Thus, we expected this robust phenotype to tolerate mutations without losing function, making it an ideal candidate for the experimental study of cryptic variation. In addition, detailed functional and structural data on the Azoarcus ribozyme provide guidance for designing and interpreting in vitro evolution experiments13,14,15,16. We used an in vitro procedure for the evolution of a group I intron17,18. In this procedure, populations of RNA molecules evolve on the basis of Darwinian principles, through cycles of random mutagenesis and selection based on the preservation of sequences that successfully perform a catalytic task (Supplementary Fig. 1).

To produce cryptic variation, we exposed ribozyme populations to mutagenesis followed by purifying selection for the ‘native’ activity of RNA phosphate bond cleavage (Supplementary Fig. 1b). We introduced mutations at an average rate of one mutation per individual per generation (Methods). We expected that evolving populations would accumulate mutations while maintaining the native function. We carried out two independent evolution experiments, and called them “line A” and “line B”. They were identical except for the addition of a denaturant (5 M formamide) to line B, which in terms of structural stability is analogous to increasing the temperature by 13.5 °C (ref. 19; Supplementary Fig. 2).

We monitored the native activity of the two lines over ten generations of purifying selection while introducing mutations at every generation (Fig. 1a). No significant difference in activity (t-test, 95% confidence interval) exists between the initial and final round of selection for both lines. This demonstrates that the procedure was maintaining the native activity of the population despite mutagenesis, and supports the cryptic nature of any accumulated mutations.

Figure 1: Evolution during selection for the native activity.
figure 1

a, Activity (fraction of ribozyme reacted) at each generation under conditions used during selection for the native activity (RNA oligonucleotide cleavage) over 10 generations (Methods). Error bars correspond to standard errors of three measurements. b, c, Histograms, from each generation of line A and B, showing the frequency (percent of sample) of individuals with a given number of nucleotide differences from the wild-type sequence (distance). Frequencies from generations 1 (G1) and 10 (G10) are shown as solid lines, and intervening generations are shown as dotted grey lines.

PowerPoint slide

We confirmed mutation accumulation through DNA sequencing of 2,748 ± 770 molecules sampled from each generation (Methods). We determined the mutational distance between individuals in the populations and the original wild-type sequence, which confirmed an increase in population distance from the wild-type sequence over time (Fig. 1b, c). By generation 10 less than 1% of the sampled individuals had no mutations (Supplementary Fig. 3). Mutation accumulation varied at each position in the ribozyme sequence; 35 positions in line A and 19 positions in line B were “mutable”, with a rate of accumulation significantly greater than zero (Methods). Of these positions, 15 were common to both lines.

Although the accumulated mutations did not affect the phenotype of the population under ‘native’ conditions, we proposed that they could facilitate evolutionary adaptation to a new chemical environment. Thus, we challenged the resulting populations to adapt to a non-native function by changing the substrate in the selection procedure. We chose an RNA substrate with identical sequence, but with a phosphorothioate replacing the scissile phosphate. This chemical change represents a “promiscuous activity”20 of the Azoarcus ribozyme with a 200% decreased catalytic efficiency (kcat/Km), by mostly affecting kcat (ref. 13). We started new evolution lines from 1013 (20 pmol) RNA molecules taken from the last generation of line A and line B. We called these lines New-A and New-B, respectively. In addition, we started another new line from a sample of 1013 RNA molecules taken from the original initial population. We called this line New-WT. In this phase of the experiment we wanted to analyse the effect of previously accumulated mutations on evolutionary adaptation to the new substrate. Thus, we used the same reaction conditions for all three lines, and reduced the mutation rate to 0.16 per individual per generation, by replacing the mutagenic PCR step of our selection procedure with a standard PCR (Methods).

We selected for activity with the new substrate during eight generations, and measured the activity of each population at each generation (Fig. 2a). In each line the activity increased significantly between the first and last generations. However, lines New-A and New-B showed a much faster rate of adaptation than line New-WT. We calculated the rate of adaptation by dividing the percent increase in the fraction of ribozyme reacted by time (generations). The greatest difference in rate is found at generation 5, where the rates of adaptation for lines New-A, New-B and New-WT were 19.5, 15.5, and 2.5, respectively, corresponding to an eightfold faster rate of evolutionary adaptation for line New-A, and an sixfold faster rate for line New-B, relative to line New-WT.

Figure 2: Evolution during selection for the new activity.
figure 2

a, Activity (fraction of ribozyme reacted) at each generation under conditions used during selection for phosphorothioate bond cleavage, with standard error based on three measurements. b, Frequency of genotypes (percent of sample) over time (generations), and their corresponding relative fitness w. c, Comparison of kinetic parameters for the Azo* and wild-type ribozymes. d, Intermolecular activity of the AzoΔ ribozyme, under the same conditions as during selection (Methods): 200 pmol phosphorothioate substrate, 20 pmol 5′-[32P]-labelled AzoΔ. In addition, lanes 3 and 4 contained 40 pmol wild-type and Azo*, respectively. The negative control ‘No S’ contained no substrate.

PowerPoint slide

Next we identified genotypes (combinations of mutations) that were potentially contributing to the increasing activities of the evolving populations on the basis of their rapid frequency increase within the population (Methods). Two important genotypes stand out (Fig. 2b). In line New-A, the most rapidly increasing genotype, termed AzoΔ, includes deletions at positions 47–53 combined with seven point mutations (G31U, G35U, G70U, G121A, C141U, A144G and G183C). By generation 8, this genotype represented 31% of the population. In line New-B, the most rapidly increasing genotype, termed Azo*, is composed of four point mutations (G32U, G53A, C89U and G179C). By generation 8, sequences containing all four mutations accounted for 23% of the population, and various subsets of these four mutations accounted for 78% of the population. All pairs of mutations of each genotype showed significant correlation coefficients21 (P < 0.05, chi-squared), supporting the conclusion that each combination of mutations usually occurs together in the same individual (Supplementary Table 1).

To confirm the selective advantage of these genotypes, we synthesized clonal transcripts of the AzoΔ, Azo* and wild-type ribozymes for kinetic analysis (Methods). For the Azo* ribozyme, we found an increased activity with a phosphorothioate substrate as compared to the wild-type Azoarcus ribozyme (Fig. 2c). The four mutations of Azo* increase the observed rate constant (kobs) by 131%, and also increase the extent of ribozyme reacted by 76%. Thus, the presence of this genotype in the population accounts for much of the increasing activity of line New-B.

Surprisingly, the clonal preparation of the AzoΔ ribozyme showed no activity towards the phosphorothioate substrate. We proposed that this sequence lacked the ability to fold into the native state individually, but could form an active complex in conjunction with other active ribozymes. Such an intermolecular partnership was observed in several other ribozyme experiments22,23,24. To test this hypothesis, we assayed the AzoΔ ribozyme for activity with a phosphorothioate substrate alone, or in the presence of either the wild-type or the Azo* ribozymes (Fig. 2d). In these experiments, only the AzoΔ ribozyme was 5′-radiolabled with 32P, so that only the activity of this ribozyme was observable on a denaturing polyacrylamide gel. The results confirm that although the AzoΔ ribozyme is inactive individually, it regains activity upon addition of either active variant.

We also looked for the presence of the highly active Azo* genotype in lines New-A and New-WT. In line New-A, Azo* is present and increases in frequency from 1.1% to 8.0% over eight generations (Fig. 2b). The lower fitness of Azo* in this line is presumably due to the presence of individually inactive, yet highly fit AzoΔ variants. Because the increase in the frequency of Azo* is modest in this line, linear regression had not identified the four mutations as individually significant, which demonstrates a limitation of the regression approach.

Although the Azo* genotype showed increased activity with the phosphorothioate substrate, this genotype did not rise to high frequency in line New-WT, which had not acquired cryptic variation. This genotype did not appear in the first three generations, and we found only three individuals in generation 8 (0.2% of sample). Analysis of correlation coefficients in line New-WT confirms that Azo* mutations rarely occur together in the same individual (Supplementary Table 1).

Although the Azo* genotype has an increased activity in the new environment, our data indicate that the mutations that comprise this genotype had no advantage in the native environment. First, the composite activities of the populations of lines A and B did not increase during selection for the native activity, indicating that these lines had not yet discovered higher fitness genotypes. Also, the individual mutations of the Azo* genotype do not dramatically increase in frequency during selection for the native activity (Supplementary Figs 4 and 5), and the Azo* genotype (all four mutations) was not detected in line A or B.

To confirm the cryptic nature of the Azo* mutations, we engineered these mutations individually into the Azoarcus ribozyme. We then determined the activity of these variants under the conditions used during selection for the native activity (no formamide) and compared them to the wild-type activity (Supplementary Fig. 6). Only G179C causes an increased mean activity (14%), but that is not significantly different from the activity of the wild-type ribozyme (P =  0.11, t-test). mutation C32U causes a significantly decreased activity (−33%, P = 0.03). mutations G53A and C89U both cause non-significant decreased activities (−28%, P = 0.07 and −17%, P = 0.10, respectively). We conclude that the individual mutations of the Azo* genotype had no fitness benefit during selection for the native activity. However, because three of these individual mutations, and several combinations (Supplementary Fig. 6), show no significant difference from the wild type, they can remain in the population despite purifying selection for the native activity. This is consistent with the observations that 10% of sampled individuals in generation 10 of lines A and B had at least one of the Azo* mutations, but none of the mutations showed a marked increase in frequency.

Next we turn to a more detailed visual analysis of sequence space to help us understand why cryptic variation allowed faster adaptation. This space is very high-dimensional and cannot be visualized directly. However, we can study lower-dimensional projections of this space using principal component analysis of aligned sequence data sampled from evolving populations. Figure 3 shows such an analysis based on sequences isolated from three generations of the New lines. It shows that, first, lines New-A and New-B are more diverse during all generations, compared to line New-WT. Second, this analysis confirms the existence of two subpopulations of line New-A, where two clearly discernible clouds of sequences are visible at all times, one contains Azo* and the other contains AzoΔ. Third, it illustrates the high fitness of AzoΔ and Azo* in that the number of sequences belonging to these genotypes increases over generational time. Importantly, it shows that many of the sequences in generation 1 of lines New-A and New-B are close in genotype space to Azo*. Over time, the genotypes become more concentrated around the Azo* genotype. In contrast, in generation 1 of line New-WT, sequences are tightly clustered and distant from Azo*. Over time, this population becomes more diverse, and moves towards the region of space occupied by the Azo* individuals.

Figure 3: Evolution in genotype space.
figure 3

a, Principal component analysis of pooled sequence data from New-A, New-B and New-WT populations. The first two principal components are shown (‘PC1’ and ‘PC2’). Nodes represent individual sequences. The distance between nodes is proportional to the number of nucleotide differences, but may appear decreased due to the compression of multiple dimensions. The region on the graphs occupied by the AzoΔ sequences is indicated by a grey ellipse. bd, Frequency of sequences with a given number of the Azo* mutations in generation 10 of line B (B10, b), and the first generation of line New-WT (New-WT1, c) and line New-B (New-B1, d). Frequencies are presented as percentage (left y-axis) and total number (right y-axis).

PowerPoint slide

A candidate explanation for the advantage of cryptic variation that emerges from the previous analyses is that lines A and B had the opportunity to expand in sequence space, such that their sequences came close to regions where advantageous mutations could occur in line New-A and New-B. Line New-WT did not have this opportunity, and thus adapts more slowly. Thus, although the genetic variation acquired during purifying selection did not affect the population activity on the native substrate, it allowed for rapid adaptation after the environmental perturbation. This rapid adaptation coincides with the rise of Azo*, a variant with increased activity. The proximity of line B individuals to Azo*, relative to line New-WT individuals, is supported by analysing the positions in the ribozyme sequence where the Azo* mutations occur: 32, 53, 89 and 179 (Fig. 3b–d). The results show that many sequences in generation 10 of line B (B10) already possess two or three of the four Azo* mutations. No individuals in the first generation of line New-WT (New-WT1) possesses three Azo* mutations, and only a fraction of a percent possess two. The proximity of line B individuals is also supported by a clustering analysis, which shows that sequences that cluster near Azo* are present in B10 but not in New-WT1 (Supplementary Fig. 7). Thus, the cryptic diversity acquired during purifying selection for the native activity moved some of the population to regions of genotype space that happen to be proximal to a genotype with high fitness for the New substrate.

Our observations demonstrate that cryptic variation can facilitate adaptation, and why. Populations under purifying selection for a trait can evolve genotypic diversity, if there are many different genotypes with the same or similarly well-adapted phenotype. Some of this diversity is fortuitously pre-adapted to new environments, which aids the population’s evolutionary adaptation after environmental change. We note that this genotypic diversity is a signature of extensive epistasis. Indeed, such epistasis has recently been demonstrated in protein and RNA phenotypes25,26,27. Epistasis is important in our system, because several individual mutations do not provide a large fitness advantage alone, but do so in combination. The ability to explore such combination of mutations cryptically is especially important in cases where high-fitness genotypes require several interacting mutations. These observations support theoretical work which demonstrates that the release of hidden variation after perturbation is a general property of genetic systems near mutation–selection balance with epistatic or gene–environment interactions28.

The phenotype of our study system is much simpler than complex traits of higher organisms. However, with this system we can monitor population-wide genotypic change over multiple generations and study the relationship, in genotype space, between standing variation and high fitness genotypes. Our results suggest that we may understand the role of cryptic variation in complex organismal traits to the extent that we can analyse their evolution in an underlying genotype space.

Methods Summary

The double-stranded DNA template for the Azoarcus ribozyme was produced from a two-step PCR-based assembly of synthetic oligonucleotides29. Ribozyme populations were prepared from in vitro transcription (T7 RNA polymerase) and purified for length homogeneity by denaturing PAGE (6% polyacrylamide/8 M urea). Mutagenesis was achieved by a mutagenic PCR procedure24, and to a lesser extent by the inherent mutation rates of the polymerase enzymes of the selection procedure. Substrate oligonucleotides were produced by solid phase synthesis and purified by PAGE (Microsynth). Selection reactions and activity assays contained 20 pmol ribozymes, and either 100 pmol RNA oligonucleotide substrate, or 200 pmol phosphorothioate containing substrate (equal mixture Rp/Sp). Negative controls for the selection protocol were carried out for every generation by skipping the reverse transcription step, but keeping the remainder of the protocol identical, and were monitored at both PCR steps by agarose gel electrophoresis. No band was ever observed in a negative control. Kinetic parameters were determined by nonlinear curve fitting of time course data (Methods). Complementary DNA samples from each generation were appended with a primer sequence unique to that generation via a PCR reaction. Samples from all generations were combined and sequenced on a single picotitre plate with a GS-FLX system (Roche /454 Life Sciences) at the Functional Genomics Facility, Zurich. P-values from linear regression were adjusted for multiple testing using the Benjamini Hochberg procedure30. Principle component analysis was performed using the princomp function in Matlab on multiple sequence alignments of data from pooled generations.

Online Methods

RNA preparation

The dsDNA templates containing variants of the Azoarcus ribozyme were constructed by a two-step PCR-based assembly from six synthetic DNA oligonucleotides29. The templates contain 197 nucleotides of the Azoarcus group I intron, excluding the first eight nucleotides but including the nucleophilic terminal guanosine (G205), all preceded by the T7 promoter sequence to allow in vitro transcription. Transcription reactions (200 μl) contained 50 mM Tris pH 7.5, 15 mM MgCl2, 5 mM DTT, 2 mM spermidine, approximately 160 ng dsDNA template, T7 RNA polymerase (unknown concentration), and RNase-free water (Ambion), and were incubated at 37 °C for at least 4 h. Reactions were then DNase treated to remove the DNA template at 37 °C for 30 min with 10 U RNase-free DNase I (Promega). Reactions were stopped with the addition of 15 mM EDTA, and extracted two times with phenol:chloroform (5:1, pH 4.5, Ambion) to remove protein enzymes and remaining DNA template. Reactions were ethanol-precipitated and rehydrated in equal volumes RNase-free water (Ambion) and a formamide loading dye. RNA was purified for length homogeneity by denaturing polyacrylamide gel electrophoresis (PAGE, 6% polyacrylamide, 8 M urea). Purified RNA was visualized by ultraviolet light, excised from the gel, and eluted into 0.3 M sodium acetate (pH 5.5) by diffusion. Eluted RNA was passed through a 0.2 μm spin column filter (VWR), ethanol-precipitated, and rehydrated to a desired concentration in RNase-free water (Ambion). Substrate oligonucleotides GGCAU(AAAU)4A and GGCAUs(AAAU)4A (s = phosphorothioate bond) were produced by solid phase synthesis and purified by denaturing PAGE (Microsynth). Concentrations were determined by ultraviolet-absorbance on an ND-1000 spectrophotometer (NanoDrop Technologies).

Mutation rates and diversity calculations

The per generation mutation rates of our selection procedure were estimated on the basis of previously reported mutation rates from a publication that quantified the number of mutations resulting from reverse transcription and PCR of a group I ribozyme31. The authors calculated a mutation rate for both a mutagenic PCR procedure and a standard “non-mutagenic” PCR procedure. For the mutagenic PCR procedure, the authors reported a mutation rate of 0.0066 ± 0.0013 per nucleotide per PCR (95% confidence interval, n = 16,591), with very low mutational bias, by including the following in the reaction mixture: manganese (0.5 mM), a biased nucleotide pool (5:5:1:1 ratio of dCTP:dTTP:dATP:dGTP), and elevated levels of magnesium (7 mM) as well as Taq polymerase (5 U, NEB) relative to a standard PCR. We used these mutagenic PCR conditions to generate the initial populations used to start lines A, B, New-WT, and to introduce mutations at each generation of lines A and B. For the standard PCR protocol, the authors reported a mutation rate of 0.001 per nucleotide per PCR. Our standard PCR conditions were very similar, and we used this mutation rate as an estimation of our mutation rate when these conditions were used instead of the mutagenic PCR, that is, in the New lines. The lower mutation rate is very consistent with other calculations of PCR mutation rates under ‘standard’ conditions that are similar to those used during our New selection lines32. Using the average of the eight reported values of per nucleotide per cycle mutation rates (p), our per PCR mutation rate (f) can be calculated as f = np/2, where n is the number of doublings observed in our evolution procedure (n = 27, from 6 × 108 fold amplification over two PCRs). Using this formula we calculate an expected mutation rate of 0.00116 per nucleotide per PCR for our non-mutagenic procedure.

We calculated the average mutation per individual in our population as the per nucleotide per PCR mutation rate times the length of the mutable region of our ribozyme sequence (inside the primer binding sites). This calculation gives an average number of mutations per individual of 1.05 under mutagenic conditions and 0.159 under our ‘non-mutagenic’ conditions. The expected composition of populations produced from the given error rate was also calculated using binomial statistics33. The probability P of a molecule having k mutations in a population with a mutagenized region of length l produced with a mutation rate of m was calculated as: P(k,l,m) = [l!/(l − k)!k)]mk (1 − l) l − k .

Selection procedure

Active variants were selected based on a reverse splicing reaction containing 20 pmol ribozyme population, 30 mM EPPS (pH 7.5), 25 mM MgCl2, and either 100 pmol RNA oligonucleotide substrate, or 200 pmol phosphorothioate-bond-containing substrate (equal mixture Rp/Sp). Reactions were incubated at 37 °C for 1 h. A sample of this reaction (20%) was then directly subjected to a reverse transcription reaction containing 1 mM dNTPs, 5 units AMV-RT (Fermentas), the commercially supplied buffer, and 200 pmol of a DNA primer (5′-TATTTATTTATTTATTTCC-3′) complementary to the 3′ end of the substrate, and to the final two nucleotides of the ribozyme. A portion of the resulting cDNA (5%) was used as a template in a ‘selective’ PCR containing 2 mM dNTPs, 1.5 mM MgCl2, 10 pmol of the reverse transcription primer, and 10 pmol of a primer complementary to the 5′-end of the ribozyme (5′-CCGGTTTGTGTGACTTTGCC-3′). Approximately 0.1 fmol of the resulting DNA was subjected to a ‘regenerative’ PCR reaction, with primers that restore the 3′ end of the ribozyme sequence, and that append a T7 promoter sequence to the 5′ end. In lines A and B, the regenerative PCR was done under mutagenic conditions, as described above. In New lines, it was performed under standard ‘non-mutagenic’ conditions. For all PCR reactions we used a standard Taq polymerase (NEB), as well as standard Taq buffer: 10 mM Tris, 50 mM KCl, 1.5 mM MgCl2, pH 8.3 (NEB). Selection-negative controls were conducted to control for amplification of unreacted ribozymes and contaminating DNA by skipping the reverse transcription step, while keeping the rest of the protocol identical. These controls were monitored at both PCR steps by agarose gel electrophoresis. No bands were observed in the negative controls at any generation.

Kinetic experiments

Kinetic experiments were carried out under the same conditions as the selection reactions (above). The Azoarcus ribozyme variants were prepared side by side to minimize sample to sample variation. Reactions were performed in 40 μl volumes, and 5 μl aliquots were removed at six time points. The fraction of ribozyme converted to a 3′-modified species by ligation of a portion of the substrate was calculated from the relative fluorescent intensity of bands after separation by denaturing PAGE (6% acrylamide, 1,000 Vh) and staining with GelRed (Biotium). Reactions were carried out in triplicate and fit to the equation F(t) = A(1 − e−kt ), where F is the fraction reacted at time t, and A and k are nonlinear fitting parameters used to estimate the extent of reaction and the observed rate, respectively. Fitting was performed with Gnuplot (http://www.gnuplot.info).

Ultra high-throughput (UHT) sequencing

Approximately 0.03 pmol of selective PCR product was used as input into an additional PCR to prepare samples for UHT-sequencing. Reactions (50 μl) contained 1.25 units Taq polymerase (NEB), standard Taq buffer (NEB), 2 mM dNTPs, and 15 pmol primers with extensions that allow compatibility with a GS FLX System (Roche, 454 technology). In addition, one of the primers contained a unique 6 nucleotide sequence. Each unique sequence differed by at least two nucleotides to prevent confusion by sequencing errors. A different unique sequence was used for each generation in the study. PCR products were quantified on agarose gels stained with GelRed (Biotum), and adjusted to approximately equal concentrations. A portion (3 μl) of the resulting PCR products were pooled together, and sequenced on a single picotitre plate. After sequencing, the unique primer sequence was used to sort sequences by line and generation. Sequences shorter than 95% of the wild-type sequence or with average quality scores less than 35 were discarded.

Linear regression and false discovery rate control

A multiple sequence alignment was constructed separately for sequences determined in each population and each generation using the Needleman–Wunsch algorithm34. For each position in each alignment, the frequency of mutation was calculated. The regression coefficient (slope) was determined for each position by linear regression of mutational frequency over generation time. A t-test was performed and raw P-values were calculated for each coefficient to determine significance relative to either zero (lines A and B) or the expected mutation rate in that line (see above). Raw P-values were rank-ordered, and converted to false-discovery-rate-controlled P-values by the Benjamini and Hochberg procedure30.

Genotype identification

We determined the population frequency of different combinations of mutations, and identified genotypes that increased most rapidly in their frequency with respect to generation time. We limited our search to genotypes that were comprised of combinations of mutations that individually showed a significant increase in frequency with time. The significance of increase for individual mutations was determined by linear regression with correction for multiple testing (Supplementary Fig. 4). It should be pointed out that the increase in the frequency of genotypes that are under positive selection is expected to be exponential, not linear, under standard population genetic models21. Linear regression penalizes nonlinearity, and thus our determination of significance of individual mutations is a conservative approach.

Mutational co-occurrence

For each pair of mutations deemed significant by linear regression, the frequency that each mutation occurred without the other was determined and termed PAb and PaB . The frequency that each pair of mutations occurred together was also determined and termed PAB , and the frequency that neither occurred was determined and termed Pab . We first determined co-occurrence of two mutations by calculating standard linkage disequilibrium (D) by the formula: D = (PAB × Pab ) – (PAb × PaB ). A normalized linkage disequilibrium (D′) was calculated by dividing positive D values by a theoretical maximum, and negative D values by a theoretical minimum21. As our ultimate measure of correlation, r2 (the square of the correlation coefficient) was calculated by the equation r2 = D2/PaPbPAPB , where Pa and Pb are the frequency of all other bases at each of the nucleotide position of the pair of mutations under consideration. The value of r2 multiplied by the number of sequences analysed is numerically equivalent to the value of χ2, which was used to determine statistical significance21. We note that molecules in our populations are not explicitly subject to recombination35, such that one would not expect correlations between mutations to decay significantly in the relatively short time scales of our experiments.

Principal component analysis

Sequences from corresponding generations (1st, 4th and 8th) of lines New-A, New-B and New-WT were combined and aligned in a single multiple sequence alignment. Alignment sites containing more than 95% gaps were removed. Multiple sequence alignments were represented numerically (that is, gap:0; A:1; C:2; U:3; G:4), so a single sequence can be interpreted as a vector of n variables where each correspond to a nucleotide sites along the sequence. Principal component analysis was carried out using the princomp function in Matlab (The MathWorks).

Network graph

Sequences from generation 10 of line B, and the first generations of lines New-B and New-WT were combined and aligned as described above. The number of nucleotide differences between every pair of sequences in the alignment was counted (all against all distances). It was necessary to control for the number of nodes and edges to allow visualization of the full data set. Thus, sequences were clustered according to percentage identity using the cd-hit algorithm36. The representative sequence for each cluster was defined as the sequence within that cluster with the lowest average distance to all other sequences in the cluster. Network graphs were constructed with clusters represented as nodes, and edges connecting clusters containing sequences that differ by less then 10 nucleotides. Networks representation was accomplished using Cytoscape v2.7.0 (http://www.cytoscape.org).

Phylogeny

Sequences that clustered (96% identity) with the Azo* sequences from generation B-10 and New-B1, as well as the representative sequences from generation New-WT1 (less than 96% identity with Azo*) were used to construct an unrooted phylogenetic tree using maximum likelihood, and assuming a HKY85 model, using the PhyML software37. The sequence data was used to determine base frequencies and to estimate the ratio of transitions to transversions.