Comparisons of bacterial and archaeal genomes reveal dynamic evolution and suggest several general evolutionary principles14. First, there is a core of about 100 genes that are nearly universal in cellular life forms. Second, 10–30% of the genes in most microbial genomes are ‘ORFans’5,6 represented only in a few closely related organisms. Third, the remaining genes (a substantial majority in any particular genome) show intermediate evolutionary conservation, often with a patchy phyletic distribution suggestive of a complex history of gene losses and gains. Fourth, gene order is poorly conserved beyond the operon level. Finally, the conservation of the gene content and gene order decays monotonically with the sequence similarity of universal genes, indicative of a steady evolution process subject to universal constraints over long timescales.

We sought to investigate the simplest possible model of genome evolution that would account for the distinct, tripartite distribution of the gene frequencies in genomes, with the (nearly) universal, partially conserved and non-conserved (ORFans) classes of genes4. In particular, we addressed the following questions: (1) Are ORFans simply fast-evolving genes or are they qualitatively different from the conserved genes, with respect to biological functionality and the evolutionary regime? (2) Could the evolution of gene content and gene order in microbial genomes be described within the same modelling framework? To test the models, we used a previously described data set of 166 complete archaeal genomes that are organized into clusters of orthologous genes (arCOGs) and the archaeal species tree constructed from concatenated alignments of panarchaeal ribosomal proteins8.

Genome evolution was modelled as follows (Fig. 1; see Methods for details). A circular genome of 2,000 genes evolves via two types of elementary event: gene replacement and genomic segment inversion (the minimal unit of evolution here is a gene). To model a replacement, a gene in a genome is replaced by another gene, which is drawn from a large pool. To model an inversion, two pairs of adjacent genes are selected in the genome, both pairs are broken, and the segment between the breakpoints is inverted (Fig. 1). The relative rates at which genes or gene pairs are selected for replacement and inversion are determined by two gene-specific parameters, ‘T(Turnover)-value’ and ‘S(Shuffle)-value’. The T- and S-values of the new gene are drawn from the respective distributions (see section ‘Rate distributions’). The evolution of the genomes in the model follows the archaeal species tree, with the number of events simulated on each branch being proportional to its length. Generically, the model shows the expected simple behaviour, namely a pair of genomes diverges monotonically with time (phylogenetic distance between genomes), saturating at zero similarity (Figs 2 and 3 and Supplementary Fig. 1).

Figure 1: Architecture of the genome evolution model.
figure 1

Abbreviations: pi, probability of selecting the ith gene for replacement; pi,j, probability of selecting a pair of adjacent genes as an inversion breakpoint; Ti, T-value of the ith gene; Si, S-value of the ith gene. Coloured open circles denote genomes at different nodes of the tree, and coloured filled circles show individual genes.

Figure 2: Gene content similarity between archaea versus distance along the tree.
figure 2

Each data point represents a pair of archaeal genomes. Designations: data, empirical distances for archaeal genomes; inst+TP, corr shuffle, prediction of the model with an instantaneously replaced gene class, a truncated power distribution of the replacement rates in the second gene class and correlated rates of gene replacement and shuffling (model parameters indicated on the plot); loess, local polynomial regression fit.

Figure 3: Gene order similarity between archaea versus distance along the tree.
figure 3

Each data point represents a pair of archaeal genomes. Designations: data, empirical distances for archaeal genomes; inst+TP, corr shuffle, prediction of the model with an instantaneously replaced gene class, a truncated power distribution of the replacement rates in the second gene class and correlated rates of gene replacement and shuffling (model parameters indicated on the plot); loess, local polynomial regression fit.

The following distributions of the T- and S-values for genes in an evolving genome were tested: (1) a distribution that consists of one or two classes of genes with the same values within a class, (2) the truncated power (TP) distribution, limited from above by the neutral evolution rate, and (3) the gamma distribution, with shape parameter α and rate parameter β = α (see Methods for details).

As shown previously, a neutral evolutionary model, under which all genes are equally likely to be replaced9,10, poorly fits the observed gene frequency distributions1113. Nevertheless, we ran this model, with all T-values set to 1 and the shuffling rate to 0. Here, the only adjustable parameter is the gene turnover rate. A clear maximum of the log-likelihood (LL) function was identified, which provided a reference point for the likelihood analysis of more complex genome evolution models (Supplementary Fig. 2 and Table 1).

Implementation of the model with smoothly distributed T-values dramatically improves the likelihood (by >16,000 log-likelihood units, which translates into almost 7,000 decimal orders of magnitude) compared to the neutral model. On the surface defined by the relative turnover rate rT and the distribution shape parameter aT, there exists a likelihood ridge that continues to rise towards increasing rT and decreasing aT (Table 1 and Supplementary Fig. 3). The model with gamma-distributed T-values behaves similarly (Table 1 and Figure Supplementary Fig. 4).

Table 1 Evolutionary model parameters and their optimal values.

At these very low values of aT (≤0.06), the distribution of the T-values becomes extremely skewed. Models with smoothly distributed T-values show a trend towards a distribution where most replacements occur in only a few sites, whereas the rest of the genes are replaced orders of magnitude more slowly. This observation prompted us to modify the model by introducing the fast-evolving genes explicitly as a separate class (‘inst+neutral’ model). In this version, some randomly selected sites contain genes belonging to an ‘instantaneously replaced’ class, whereas genes in the rest of the sites have T-values of 1. All the ‘instantaneously replaced’ genes are replaced by genes drawn from the pool at each step of evolution, ensuring that these genes are not shared by the genomes at the end of the simulation.

The value of fINST (fraction of instantaneously replaced genes) becomes the second free parameter of the model, in addition to rT. Implementation of the two-class model leads to another dramatic improvement of the fit with the optimal value of fINST ≈ 12% (Table 1 and Supplementary Figs 5 and 6). Combining a distinct class of instantaneously replaced genes with a smooth T-value distribution among the genes with the finite turnover rates (‘inst+TP’ model) does not further improve the fit of the model for the gene content similarity. The likelihood surface exhibits a plateau that reaches comparable likelihood values at fINST ≈ 12% and aT > 0.7 (Table 1, Fig. 2 and Supplementary Figs 7–9). With a → ∞, the TP distribution degenerates to the delta function and, therefore, the model becomes equivalent to the two-class (‘inst+neutral’) model.

The model in which 12% of the genes belong to the instantaneously replaced class implies that, upon separation, two diverging lineages of microbes almost immediately lose a considerable fraction of their gene content similarity. Indeed, among the archaeal genome pairs separated by no more than 0.018 tree distance units (approximately 98% sequence identity between ribosomal proteins, which corresponds to 1% of the average tree height), the median observed similarity of gene content is 0.92, indicating that about 8% of the genes have already been replaced.

Gene replacement not only changes the gene content but also depletes the common set of adjacent gene pairs, affecting the gene order similarity. To test whether this process alone is sufficient to explain the observed differences in gene order, or gene shuffling needs to be introduced as a separate, independent process, we compared the similarity of gene pairs modelled with rS set to zero with that in the real genomes (‘inst+TP no shuffle’ model). This model predicts a much higher similarity of gene orders than observed (for 100 ‘mid-tree’ genome pairs, separated by the distance corresponding to half of the average tree height, the median predicted gene order similarity is 0.37 compared to the observed value of 0.094; Table 1 and Supplementary Fig. 10). Thus, gene replacement alone is insufficient to account for the evolution of the gene order in archaea.

Including a non-zero segment inversion rate in the two-class (‘inst+neutral’) model results in a much better description of gene order evolution. With the fraction of instantaneously replaced genes and the turnover rate fixed to the best fit values, explicit modelling of gene shuffling improves the gene pair similarity likelihood by >4,000 decimal orders of magnitude (Table 1 and Supplementary Figs. 11 and 12). The model with a class of instantaneously replaced genes, smoothly distributed T-values and an independent distribution of S-values (‘inst+TP, ind shuffle’) fails to improve the fit, producing essentially the same likelihood (Table 1 and Supplementary Fig. 13).

The T-value of a gene determines its relative propensity for replacement and is therefore inversely proportional to its evolutionary conservation. Similarly, the S-value of a gene is the inverse of its propensity to form stable pairs. In real genomes, many functionally important genes form evolutionarily conserved operons; more generally, different measures of gene conservation are correlated14,15. Could a unified measure of conservation suffice to explain the gene content and gene order similarity in archaea? We examined a model in which the relative segment inversion rate for a gene pair is determined by the product of their T-values rather than independent S-values (‘inst+TP, corr shuffle’). This dependence between gene shuffling and replacement reduces gene order similarity and yields a likelihood peak that is absent when the turnover and inversion rates are independent. When the same genes that have slower turnover rates also form slow-inverting pairs, as in the ‘inst+TP, corr shuffle’ model, the likelihood of the gene order similarity improves slightly, although not significantly, over the simple ‘inst+neutral’ model (Table 1, Fig. 3 and Supplementary Figs 14 and 15). This observation implies that the variation of turnover rates within the class of more slowly replaced genes is a salient feature of archaeal evolution. Under the optimal parameters of the model, for 88% (1 − fINST) of the archaeal genes, evolution since the last archaeal common ancestor involved, on average, 0.33 replacements and 0.60 segmental inversions per gene.

Dagan and Martin16 estimated a horizontal gene transfer (HGT) rate of 5.3 events per gene family and two to three times as many losses in a tree of 190 species. Although not directly commensurate with our estimate of gene turnover rates, an order of magnitude comparison is possible. In our data set of 166 genomes, the archaeal tree has a total branch length that is 35 times the average tree height, which is equivalent to 35 independent lines of descent from root to tips. Thus, an ancestral gene family from the slow-turnover class would be involved in 0.33 × 35 ≈ 12 events, a number comparable to the estimates of Dagan and Martin16.

We show here that evolution of gene content and gene order in archaea can be described, with high accuracy, by an unexpectedly simple model of gene replacement and shuffling with parameters that remain constant in time and between clades. The model includes only four parameters, namely the rates of gene turnover and gene shuffling, the fraction of genes with instantaneous turnover, and the shape parameter of the turnover rate distribution for the rest of the genes.

Although gene turnover alone is insufficient to explain the evolution of gene order, the turnover and shuffling rates do not have to be independent, with the latter being proportional to the former. Conceivably, for the majority of the genes in any evolving genome, about 88% according to the best estimate obtained here, both the turnover and shuffling rates are determined by the same integral characteristic of a gene that can be defined as its ‘status’ in the gene ensemble15. The ‘high status’ genes, typically, are involved in central, essential biological processes (primarily those of information transfer), are expressed at high levels, and participate in numerous interactions. Conversely, they evolve slowly at the levels of sequence change, gene turnover and operon shuffling15,17,18. At least at longer timescales, the selective constraints on operonic organization, embodied in the variation of S-values, appear to explain the evolution of archaeal gene order, without invoking the ‘selfish operon’ hypothesis7,19 or any other complex scenario for the evolution of gene order.

The central result of this work is that evolutionary models, in which the high-turnover genes are included as a distinct class with instantaneous replacement, produce a far better fit to the empirical gene frequency distributions than models with a single smooth turnover rate distribution. The optimal size of the instantaneous replacement class, about 12% of the genes, matches the typical fraction of ORFans in archaeal as well as bacterial genomes4,2022. The present results imply that the ORFans are qualitatively different from the rest of the genes and most likely evolve neutrally such that the fitness effects of gene acquisition and loss are negligible (‘zero status’ genes), so that the replacement rate of these genes is determined by the underlying mechanisms of DNA uptake, recombination and loss. The characteristic rates of these processes are orders of magnitude higher than the replacement rates for genes subject to purifying selection, hence the model with two gene classes. These findings are compatible with the hypothesis that the ORFans primarily originate from mobile genetic elements2124 and, on average, do not measurably contribute to the fitness of the organism. The ORFans can thus be viewed as ‘passively selfish’ genes that are devoid of a cellular function and persist in the biosphere simply because the combination of the rates of horizontal transfer and loss of DNA is conducive to their survival25. This apparent selfishness of the ORFans does not preclude their occasional recruitment for various biological functions, which can result in the transition of a gene from the instantaneously replaced class to the slowly evolving one.

The ORFans represent the bulk of the genes in the microbial genomic universe. Considering their extremely rapid turnover revealed here, taking into account the available estimates of the total number of microbial species and extrapolating the present findings to bacteria, we roughly estimate the size of this universe. Conservatively assuming 106 to 107 microbial species26 (some recent estimates are as high as 1011 species27), an average genome size of 2,000 genes and 10% of ORFans without overlap between species, the genetic universe can be estimated to consist of at least 2 × 108 to 2 × 109 unique genes. This estimate is at least broadly compatible with the implications of other approaches, in particular the fact that many microbial species have ‘open’ pangenomes, as indicated by the lack of apparent saturation of the union of distinct genes with the addition of new genomes9,2830.

In contrast to the ORFans, the more slowly evolving genes are best described by a distribution of turnover rates with a defined, finite mean and a variance of comparable magnitude. These genes, on average 88% in each genome but only a miniscule fraction of most supergenomes, let alone the entire genomic universe, apparently evolve under a range of selective constraints that make their replacement much less likely to be fixed in the evolving population compared to the genes of the instantaneous-turnover class. However, the special class of genes with an effectively infinite replacement rate is by far the most important feature of our model of microbial genome evolution. Indeed, a simplistic two-class model in which all genes in the second class have the same, finite replacement rate performs nearly as well as the model with a distributed replacement rate, although biologically, only the latter can be realistic.

Methods

Archaeal genome data set, orthologous gene sets and phylogeny

The set of 166 archaeal genomes (168 genomes from ref. 8 with the two Nanoarchaeota excluded due to the extreme genome reduction) was downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/) on February 2014. All genes in these genomes were classified as either arCOG (archaeal cluster of orthologous genes) members or singletons8. The set of 13,433 arCOGs covers, on average, 90% of archaeal proteins.

The phylogeny of this set of genomes was assumed to follow the tree that was reconstructed from the concatenated alignment of 56 ribosomal proteins universal for Archaea and rooted between Euryarchaeota and the TACK superphylum8. To account for the unequal sampling of genomes in the set, genome weights were calculated from this tree as described previously8. Briefly, the total weight of the tree was distributed between its subtrees proportionally to the total branch length in each subtree; the process was recursively repeated towards the leaves.

Each genome was represented as an unordered set of unique individual orthologous genes (identified by arCOG assignments for arCOG members or by unique gene IDs for singletons) or as a set of unique unordered gene pairs (identified similarly). For the purpose of counting the gene pairs, all genome partitions were assumed to be circular and pairs of identical arCOGs were omitted (empirically, the majority of the runs of the same domains represent fragmented genes).

Similarity in gene content or gene order between two genomes A and B, represented by the sets of genes or gene pairs A and B, respectively, was calculated as the size of the intersection between the sets, normalized by the size of the smaller of the two sets:

(1) S A B = | A B / min ( A , B | )

Genomes and genome dynamics events

For evolution modelling, a genome is represented as a circular array of genes. A genome size of 2,000 genes (similar to the weighted average genome size of the archaeal set, 2,130 protein-coding genes) was used throughout this work. Each gene has a label and two intrinsic parameters, referred to as ‘T (Turnover)-value’ and ‘S (Shuffle)-value’, which remain constant for any given gene.

The model includes two types of elementary event: gene replacement (turnover) and segment inversion. When a replacement occurs, a gene in a genome is selected with a probability proportional to its T-value. This gene is replaced by another gene, which is drawn from a gene pool. The gene pool is effectively infinite (implemented as containing 2 × 109 different genes, so the probability of drawing a gene that is already present in a genome is 10−6), and the T- and S-values of the new gene are drawn from the corresponding distributions (see section ‘Rate distributions’). When an inversion occurs, two adjacent gene pairs are selected in the genome, each with a probability proportional to the product of the S-values of the two genes. The segment between the second gene of the first pair and the first gene of the second pair is then inverted.

The model genome evolves along the tree, diverging at the bifurcations, with the number of events determined by the tree branch length and the two distinct rates rT and rS for the gene turnover and gene shuffling. At the end of each run of the model, the similarities between the gene content and gene order of the genomes at the tree leaves were calculated as described above (equation (1)). For each set of model parameters, multiple independent runs were performed, resulting in as many similarity values for each pair of tree leaves. This array of similarities was compared with the observed similarity for the corresponding pair of genomes, and the log-likelihood of the observations, given the model, was calculated (see equations (2) to (4)).

Rate distributions

The T- and S-values of a gene determine the relative rate of gene replacement and segment inversion. Because genes with higher replacement rates, by definition, are preferentially selected for replacement, the distribution of replacement rates in a genome shifts relative to the distribution of replacement rates in the gene pool towards genes with lower rates. It has been shown31 that if the distributions of rates in the genome arrives at equilibrium, the probability density of the rate distribution in the genome, fG(x), and the probability density of the rate distribution in the gene pool, fP(x), are related to each other as fP(x) = C x fG(x), where C is a normalization constant.

It should be noted that the two distributions, fG(x) and fP(x), differ both in shape and in their mean values. In the case of the T-values, the difference in the means is irrelevant, because T-values are not rates themselves, but rather relative rates; that is, they do not directly determine at what rate a replacement of a particular gene will occur, but the probability that a particular gene will be selected for replacement if a replacement is occurring in the genome.

Traditionally, the distribution of the evolution rates is modelled using the gamma distribution with a mean of one32. Under this assumption, in our case extended to gene replacement and shuffling, the rates could be arbitrarily high, but the fraction of genes that evolve much faster than the mean drops exponentially. Neutralist reasoning assumes that the selection coefficients are distributed in the range from minus infinity to zero33 (and that the positively selected changes are so rare that they can be safely omitted from the model). This assumption implies a maximum evolutionary rate that corresponds to a zero selection coefficient; the genes that are subject to selective constraints evolve more slowly, so the rates are distributed in the range of 0 to rMAX.

Two alternative distributions were used to model the T- and S-values. One is based on the neutralist perspective and implies that the rates are limited from above. The gene replacement rates in a genome at equilibrium were taken to follow a truncated power (TP) distribution with the probability density fG(x) = axa−1, x [0, 1], with the parameter a (a > 0) determining the distribution shape. The T-values in the gene pool then follow a distribution with probability density fP(x) = (a + 1)xa. Alternatively, the traditional gamma distribution (x [0, ∞]) could be assumed for the genes in the genome, with the shape parameter α = a (a > 0) and the rate parameter β = α = a. In this case, the T-values in the gene pool also follow a gamma distribution with shape parameter α = a + 1.

Evolution along a tree

Evolution of a model genome follows a tree. The phylogenetic tree of the 166 species of archaea described in the section ‘Archaeal genome data set, orthologous gene sets and phylogeny’ was used. With each run of the model, at the root of the tree the genome was initialized with M = 2,000 genes, with the T- and S-values following the assumed genomic distributions. At each branching point in the tree, two identical copies of the genome at the end of the parent branch were placed at the beginnings of the daughter branches. The number of events that occur along the branch was drawn at random from a Poisson distribution with the expectation of t × r × M, where t is the branch length and r is the rate per gene per unit of evolutionary time.

Gene replacements were modelled one by one, with each new gene extracted from the gene pool with its T-value drawn from the corresponding pool-specific distribution and S-value drawn from the common distribution for the pool and the genome. Selection of a gene for the next replacement was performed taking into account the T-value of the newly acquired gene. Segment inversions were performed after the replacements.

Genes that belong to the instantaneously replaced class were modelled by setting their T- and S-values to 0. This excluded them from the explicitly modelled replacement and shuffling. At each leaf of the tree, all such genes were replaced by genes drawn from the pool. This effectively guaranteed that none of these genes is shared between the genomes at the end of the simulation and therefore that they form no shared gene pairs.

Obviously, the performance of the model hinges on the correct representation of the relationship between the similarity of gene content or gene order and the phylogenetic distances separating the respective genomes. The former results directly from a comparison between extant genomes, whereas the latter also depends on the inferred phylogeny of the respective species. Therefore, in principle, the results of the analysis are tree-dependent. At least in the case of prokaryotes, the more distantly related the organisms are, the less confident we can be about the topology of the phylogenetic tree34. To assess the extent of the tree-dependence of the model, we explored the effect of perturbations in the deep relationships between the lineages on the connection between the phylogenetic distance and genome similarity. Scrambling the tree within a distance of 0.33 from the root (roughly preserving most of the within-family branching order but randomizing the tree topology below the family level) preserves about 78% of the relationship between the gene content similarity and the phylogenetic distance, and about 87% of that for the gene order similarity (Supplementary Table 1). Thus, there is only a limited effect of the topology in the deep, least reliable portion of the tree on the dependence between phylogenetic distance and genome similarity, with the implication that the fit of the evolutionary model to this empirical dependence is largely tree-independent.

Likelihood of the data given the model

At the end of each run of the model, the similarities between the gene content and gene order of the genomes at the tree leaves were calculated as described using equation (1) and recorded. For each set of the model parameters, 100 or 1,000 independent runs were performed, resulting in 100 (1,000) similarity values for each pair of tree leaves. This array of similarities si,j (similarity between the ith and the jth), obtained by simulated evolution, was compared with the observed similarity for the corresponding pair of genomes Si,j:

(2) L ( S i , j , s i , j ) = Σ k N ( S i , j , s k i , j , σ 2 ) / 100

where L(Si,j, si,j) is the likelihood of the observation Si,j given the simulated array si,j, s k i , j is the similarity between the ith and the jth genome in the kth run, N(.) is a Gaussian kernel function, a normal probability density of a distribution with μ =  s k i , j and σ2 = Var si,j at the point Si,j.

Individual genome comparisons are strongly non-independent of each other because genomes share a common evolutionary history (the closer the genomes the greater part of their history is shared). Therefore, to calculate the likelihood of the entire data set, genome weights should be taken into account. Specifically, the log-likelihood for the model was calculated as

(3) LL observations | model = Σ i , j W i , j log L ( S i , j , s i , j )

where L(Si,j, si,j) is calculated as described above and Wi,j is the weight for the comparison of the ith and the jth genomes.

For a pair of genomes, the weight of the comparison was calculated as follows. Consider the last common ancestor of the ith and the jth genome; by definition, the two genomes descend from the two different branches X and Y that descend from this ancestor. Branches X and Y define two sets of genomes, X and Y, which descend from them. The weights for the comparison of the ith and the jth genome is defined as

(4) W i , j = w i w j / k X , l Y w k w l

where wi is the individual weight of the ith genome. This weighting scheme normalizes comparisons that involve all genomes descending from sister branches to the sum weight of 1, regardless of their number, ensuring equal contributions from each internal node to the total likelihood.

Under this model, genome rearrangement via segment inversion occurs independent of gene replacement and does not affect the gene content, whereas gene replacement affects gene order conservation by breaking up ancestral gene clusters. Empirically, the gene pair similarity data are much ‘noisier’ than the data on gene content similarity (for 100 genome pairs separated by tree distances that are close to the median across all pairs, the standard deviation of the gene pair similarity is twice as large as that for the gene content). This difference between the two rate distributions implies that there is considerably more transient, clade-specific variation in the shuffling rate than in the turnover rate. Biologically, evolution of the gene complement is a much more important process, as the set of genes defines the capabilities of an organism, while the preservation of the proper gene order (and, therefore, quantitative characteristics of the gene shuffling process) clearly is of secondary importance. We therefore chose to optimize the model parameters for gene replacement first (with shuffling rate set to 0) and then optimize the shuffling rate with the turnover parameters fixed.

Likelihood peaks

LL(observations | model) values are subject to fluctuations because they are produced using a finite number of simulations. The model producing the best match to the observed data cannot thus be found simply by taking the set of parameters with the highest LL value. When a more careful comparison of models producing similar likelihoods is required, the following procedure was used.

The observed LL values near the peak were approximated as one- or two-dimensional quadratic polynomials with the log of the parameters as the arguments. The location and LL value of the peak were estimated from the coefficients of the polynomial. The root mean square of the residuals around the peak (11 values surrounding the observation closest to the peak for the one-dimensional fit, and 9 values for the two-dimensional fit) were used to estimate the standard error of the peak LL value. The range of the parameters corresponding to the height of the log-likelihood surface up to two standard errors below the peak was used to approximate the 95% confidence interval for the parameter values.

Quality of fit

The similarity of the gene content and gene order between a pair of genomes is expected to diminish with the phylogenetic distance separating the genomes, and the empirical data indeed display this trend (Figs 2 and 3). The noise, originating both from random evolutionary events and contingencies and from the imperfect match between the model and reality (for example, not all lineages probably evolve at the same rates), limits the maximum quality of fit that is possible between the model and the data. To estimate the upper limit of the goodness of fit between the model and the empirical data that can be obtained using a smooth curve, we obtained the local polynomial regression fit of the gene content and gene order similarity (in log scale) and the distance between the genomes along the tree using the loess function of the R package35. The residual sum of squares for the loess fit estimates the maximum attainable quality of fit, whereas the sum of squares of differences from the mean provides the baseline (no fit) estimate. The residual sum of squares (RSS) for a model fit can be scaled between these two values to assess the relative quality of fit for a model.

We assessed the relative quality of fit obtained with the ‘inst+TP, corr shuffle’ model by comparing the RSS of the model prediction to that of the original empirical data and the RSS of the local polynomial regression fit. The reduction in RSS provided by the model is within 77% of the best possible result for the gene content and 88% for the gene order prediction (Supplementary Table 1 and Figs 2 and 3). Thus, although, potentially, there is some room for improvement, the current model captures most of the information contained in the relationship between the phylogenetic distance and genome similarity.

Tree dependence

The topology and the branch lengths of the species tree dictate the phylogenetic distance between the genomes and, accordingly, the expected similarity between their gene content and gene order. Errors in the inference of the species tree result in incorrect estimates of the phylogenetic distances and reduce the predictive power of the model that is based on the imperfect tree. To assess the dependence of the model fit on the tree perturbation, the following procedure was performed. The species tree was scrambled by multiple random swaps of clades within a given distance Hmax from the root. Scrambling was implemented by cutting the tree at a distance Hi and randomly reattaching tips to the parent nodes, where Hi spans the range from 0 to Hmax. The distances between the leaves whose parent nodes lie at the distance H > Hmax from the root remain intact, whereas the distances between more distantly related leaves are affected. The set of genome-to-genome distances from the scrambled tree was superimposed with the empirical gene content, and gene order similarities and the strength of the retained relationship was assessed using the local polynomial regression (loess).

Code availability

All computer code developed for the analysis reported in this study is available at ftp://ftp.ncbi.nih.gov/pub/wolf/_suppl/archgen/.

Data availability

The data used for the analysis reported in this study is publicly available at ftp://ftp.ncbi.nlm.nih.gov/pub/wolf/COGs/arCOG/.