Introduction

Climate models are routinely applied to situations outside of the regimes in which they have been evaluated during their development cycle. For example, in the framework of the Coupled Model Intercomparison Project Phase 6 (CMIP6) and the Intergovernmental Panel on Climate Change (IPCC), models are used to project future climates under CO2 concentrations substantially higher than those of the recent observational period.

However, there is potential for traditional model evaluation and development to be expanded to utilise proxy data associated with paleoclimate states e.g.1,2,3,4,5,6. In particular, paleoclimate model simulations test model behaviour under a wide range of forcings, which encompass those expected in the timescale of the next few centuries and beyond7,8. The underlying philosophy is that we would expect to have more confidence in future predictions from a model which has successfully simulated both past and modern climate states, than future predictions from a model which has only successfully simulated the modern climate state.

Here we focus on three time periods, chosen firstly because they were subject to substantial CO2 forcing relative to preindustrial, so are of most direct relevance to future projections, and secondly because they have been part of ongoing international modelling efforts in the framework of the Paleoclimate Modelling Intercomparison Project (PMIP)9, so have simulations available from a variety of different climate models. The time periods are (i) the Last Glacial Maximum (LGM, 21,000 years ago), with a CO2 concentration of ~180 ppmv e.g.10 (compared to ~280 ppmv prior to industrialisation, and ~420 ppmv today), and an increase in ice sheet area and volume compared to today, in particular in the Northern Hemisphere e.g.11, (ii) interglacial KM5c within the mid-Pliocene warm period (MPWP; ~3.2 million years ago), with a CO2 concentration of ~400 ppmv e.g.12, and reduced Greenland and Antarctic ice sheets compared with today e.g.13, and (iii) the early Eocene climatic optimum (EECO; ~53.3–49.1 million years ago), with CO2 concentrations of ~1500 ppmv e.g.14, and no ice sheets. In general, older time periods have fewer locations with proxy data, and greater uncertainty in the proxy data that is available.

When evaluating climate models for the purposes of assessing their ability to project the future, the general approach is to focus on properties of the climate system that are routinely used to quantify the magnitude of future climate change, and which are robust inherent features that persist across a range of climate states15,16. It is also useful to evaluate properties that are determined by the combined effect of multiple components of the climate system (e.g. atmosphere, ocean, cryosphere) so that the integrated effect of the whole system can be assessed. Here, we focus on three large-scale properties: global mean surface temperature, polar amplification, and land–sea warming contrast. Global mean surface temperature (GMST) is the most fundamental metric and is a key focus of international agreements to limit global mean warming e.g.17. Changes in GMST are determined by processes throughout the atmosphere, ocean, and land surface; changes in GMST forced by CO2 alone can be quantified by the Equilibrium Climate Sensitivity (ECS)18. Polar amplification is also a key component of the climate system; the Arctic is currently warming at between 219 and 420 times that of the global mean, with associated impacts including sea level rise21. Polar amplification is determined by a range of processes22, including changes in heat transport23, sea ice/snow feedbacks24, and lapse-rate feedbacks25. Land-sea warming contrast has also been observed over the last 150 years, with 1.6 °C warming over land areas compared with 0.9 °C warming of SSTs, associated with a 1.1 °C GMST warming over the same period26. Land–sea warming contrast is associated with changes to the hydrological cycle and atmospheric circulation e.g.27,28, and the thermal contrast between land and ocean plays a role in monsoon circulations29.

Although these metrics are straightforward to define and quantify in a purely modelling or conceptual framework, estimating them from paleoclimate proxy records is challenging given their sparse distribution and large uncertainties e.g.30. This complicates model-data comparison and means that quantification of model improvements over time is problematic. Here we make use of assessed GMST estimates from the IPCC26, and additionally provide site-specific definitions for all the metrics, that are straightforward to apply in a paleo context (see Online methods, sections “Proxy datasets” and “Definition of metrics”), and apply the metrics to existing simulations from the fourth and third phase of the Paleoclimate Modelling Intercomparison Project (PMIP4, PMIP3). In doing so, we provide a benchmark for paleoclimate model simulations and assess improvements over time, including in some of the very latest CMIP6 models.

Results

The spatial patterns of ensemble-mean (see Online methods, section “Model simulations”) modelled surface temperature change (near-surface air temperature and SST) are shown in Fig. 1, along with paleoclimate proxy estimates at the locations for which they are available (see Online methods, section “Proxy datasets”). In general, the sparsity of the proxy data increases further back in time. An exception is the terrestrial MPWP data, which is more sparse than the (earlier) EECO; this is because of the relatively narrow time period that is used in the Pliocene terrestrial reconstruction (a window of 30 kyr in the MPWP31 compared with 4120 kyr years in the EECO32; see Discussion section). Polar amplification (more warming in the polar regions than the tropics under increasing CO2) and land–sea warming contrast (more warming over land than over ocean under increasing CO2) are qualitatively apparent for all three time periods. However, in order to quantify these features in proxies and models, and in order to assess model-data comparison, quantitative metrics are required that account for the relative sparsity of the paleo proxy data. Here, we define and use two forms of metrics: firstly, ‘true’ metrics based on the globally defined fields, and secondly ‘site-specific’ metrics, which are defined according to a particular paleo proxy dataset and calculated according to the locations of the proxies (see Online methods, sections “Proxy datasets” and “Definition of metrics”).

Fig. 1: Patterns of model and proxy temperature change relative to preindustrial.
figure 1

Patterns of a, c, e near-surface air temperature (SAT), and b, d, f sea surface temperature (SST), in paleo proxies and models of the a, b last glacial maximum (LGM), c, d the mid-Pliocene warm period (MPWP), and the e, f early Eocene climatic optimum (EECO). Modelled ensemble-mean temperature anomalies compared with pre-industrial are shown in the background colours. Proxy near-surface air temperatures and SST anomalies are shown as coloured circles (see Online methods, section “Proxy datasets”). Note the differing colour scales for each map.

Global mean surface temperature (GMST)

The true GMST metric (l,p,eΔTt) is shown in Fig. 2, for models and observations (see Online methods, sections “Proxy datasets” and “Definition of metrics”), for the three paleo time periods, and also for the Historical (1850–2014) and post-1975 (1975–2014) periods. The paleoclimate observed true GMST metrics are assessed values from the IPCC26; the equivalent site-specific global SAT and SST modelled and observed metrics (l,p,eΔTs) are shown in Supplementary Information, Fig. S1. First of all, it is interesting to note that in the observations, the ratio of mean temperature change to uncertainty in this change (i.e. the signal-to-noise ratio) is similar across the five time periods (Fig. 2, black circles and vertical error bars). The LGM has the largest signal-to-noise ratio for GMST, even larger than the historical record, indicating that it may be the most stringent target for model-data comparisons. This is associated with the fact that the LGM has a greater density of proxy data sites than the other paleo time periods. It is also important to note that the LGM has less uncertainty in the forcing boundary conditions than the other two paleo time periods (in particular CO2, for which ice core records e.g.10,33 give more accurate and precise values than is possible for the MPWP or EECO, where only indirect CO2 proxies are available). As such, the uncertainty in the GMST sensitivity to forcing for the Pliocene and EECO compared to the LGM is greater than would be implied from the uncertainties in GMST alone. However, the 5–7 °C IPCC assessment of LGM GMST cooling may be overly narrow; recent work has suggested a central GMST estimate of 4.5 °C of cooling (Fig. 2, black open circle and dashed range)34.

Fig. 2: Model and proxy global mean temperature change relative to preindustrial.
figure 2

Global mean true surface temperature (GMST) anomaly, l,p,eΔTt in models and observations from five time periods. a post-1975, b Historical, c Last glacial maximum (LGM, l), d mid-Pliocene warm period (MPWP, p), and e early Eocene climatic optimum (EECO, e). Light grey circles show CMIP6/PMIP4 models with ECS in the very likely range as assessed by Forster et al.18; models in red have an ECS greater than the assessed very likely range (>5 °C); models in blue have an ECS lower than the assessed very likely range (<2 °C). Dark grey large circles show the multi-model ensemble mean for CMIP6/PMIP4. Dark grey small circles show the multi-model ensemble mean for CMIP5/PMIP3. Black circles and very likely ranges show the IPCC-assessed temperature anomaly derived from observations26. For the LGM, the black open circle with a dashed very likely uncertainty range shows the GMST anomaly estimate from Annan et al. 34. The historical anomaly in models and observations is calculated as the difference between 2005–2014 and 1850–1900, and the post-1975 anomaly is calculated as the difference between 2005–2014 and 1975–1984. For the LGM, MPWP and EECO, modelled temperature anomalies are compared with pre-industrial. The square symbol denotes the five simulations carried out by CESM2, and the triangle symbol denotes the three simulations carried out by CESM1.2. A version of this figure with all models labelled is in the Supplementary Information, Fig. S5, and all the models in this plot are listed in order of GMST in the Supplementary Information, Tables S1S5. A similar plot of the paleo time periods, but also showing the site-specific metric, l,p,eΔTs, is shown in Supplementary Information, Fig. S1.

For each paleo time period, the multi-model mean GMST metric sits within the observed range, which is quite remarkable given that from the LGM to EECO this represents a temperature range of about 20 °C. However, the spread across the ensemble is relatively large, and many individual models sit outside the observed range (78%, 65%, 29% for the LGM, MPWP, and EECO respectively).

Previous studies have not always found a clear correlation between modern ECS and paleo GMST e.g.35,36. Although the ECS of every model in this study is not available, there is some indication that models with an ECS that is known to be greater than the IPCC-assessed range of 2–5 °C simulate too great a change in the paleo time periods (red dots in Fig. 2c–e). Similarly, models with an ECS that is known to be lower than this range simulate too small a change in GMST in the paleo time periods (blue dots in Fig. 2c–e). Only one model, CESM2, carried out simulations across all five time periods. Apart from that, CESM1.2 is the only model that carried out simulations across all three paleo time periods. The results from these two models, highlighted in Fig. 2, indicate consistency in relative GMST changes across the paleo time periods for a particular model. However, more models carrying out simulations across multiple paleo time periods would allow this to be explored further, and allow emergent constraints on ECS37 from multiple time periods to be developed. This would also require all PMIP models to carry out 4 × CO2 simulations alongside their paleo simulations in order to calculate their ECS.

It also appears that both high and low ECS models can simulate the Historical period in good agreement with observations (Fig. 2b), and low ECS models can simulate the post-1970 warming (Fig. 2a). Therefore, paleoclimates may be a better discriminator of high- and low-ECS models than the observational periods (which is consistent with findings from an assessment of ECS that included paleoclimate evidence38). This may be due to the fact that the paleoclimate simulations are close to equilibrium with the CO2 forcing, whereas the Historical simulations are transient and as such, have a GMST that is influenced by a transient pattern effect e.g.39, and/or it may be related to uncertainties in the aerosol forcing over the historical period40. However, more paleo simulations are required to further confirm this relation. In particular, there is a need for more paleo model simulations to be carried out with the same models that carry out the Historical CMIP simulations (this lack of consistency between the CMIP6 and PMIP4 model ensembles arises, at least in part, due to the long integration lengths required for full equilibrium of paleoclimate simulations).

It is also apparent that for all three paleo time periods, there has been an improvement in the modelled GMST in the PMIP4/CMIP6 paleoclimate model simulations compared with the previous CMIP5/PMIP3 simulations (large versus small dark grey dots in Fig. 2c–e). This improvement is likely due to a combination of updated boundary conditions, and improvements to the models themselves. Key changes in boundary conditions in PMIP4 compared with PMIP3 include updated ice sheets for the LGM41, updated palaeogeography and representation of ocean gateways for the Pliocene42, and a consistent experimental design for the EECO including a new palaeogeography43. It is harder to robustly identify particular model improvements that may be relevant, because there is no clear lineage between the models in PMIP3 and PMIP4, but, for some models at least, improvements in model representation of cloud microphysics are playing an important role e.g.44,45.

Polar amplification

The site-specific polar amplification metric (see Online methods, section “Definition of metrics”), (l,p,eΔPs), is shown in Fig. 3a. Because the MPWP and EECO are warmer than the preindustrial whereas the LGM is colder, the observed site-specific metric from proxies is positive for the EECO and MPWP but is negative for the LGM (black circles in Fig. 3; in the Online methods, see the subsection “Proxy datasets” for a description of how the error bars are calculated). For all three time periods, this indicates a polar amplification associated with increasing temperature (i.e. a decrease in meridional temperature gradient with increasing temperature).

Fig. 3: Metrics of polar amplification and land-sea warming contrast.
figure 3

Site-specific metrics for a SST polar amplification (l,p,eΔPs) and b land–sea warming contrast (l,p,eΔLs), for last glacial maximum (LGM, l), mid-Pliocene Warm Period (MPWP, p), and early Eocene climatic optimum (EECO, e). Black circles and very likely ranges show the observed site-specific metric (s), dark grey circles show the model ensemble mean site-specific metric (large circles for CMIP6/PMIP4 and small circles for CMIP5/PMIP3), and light-grey/red/blue circles show the individual CMIP6/PMIP4 model site-specific metric. The EECO observed metric shown with an open circle and dotted error bar excludes SST data from the southwest Pacific. All metrics are calculated relative to the preindustrial.  See Supplementary Information, Fig. S2, for a version that also includes the site-specific metrics.

For the LGM, the proxies indicate a site-specific SST polar amplification of about −0.4 °C, whereas the model ensemble mean indicates a greater amplification of −0.7 °C (large dark grey circles in Fig. 3a). The proxy value sits within the model range, but the model range is large compared with the uncertainty range from the proxies, from 0.1 °C (IPSLCM5A2) to −1.4 °C (CESM2). For the MPWP and EECO, the polar amplification indicated by the proxies is greater than in any of the models, although for the MPWP two models do get close to the observed value of 1.7 °C and are within the uncertainty range of the proxy metric. For the EECO, the model-data disagreement is much starker, with nearly double the polar amplification in the proxies (12 °C) than in the model with the greatest value (CESM2; 7 °C). This discrepancy is primarily because of exceptionally warm proxy temperatures in the southwest Pacific. Many reasons for possible warm biases in the proxy temperatures in this region have been proposed, including a seasonal bias in mid- and high-latitude SST proxies32,46, and/or uncertainties in the functional form of different paleo-temperature proxies (e.g., TEX86) in the upper-temperature range47,48. Since data from this region represent a large number of the high latitude records available from the EECO, they bias the proxy-based metric towards extremely high values. With the SSTs from the southwest Pacific excluded, the proxy polar amplification decreases from 12 to 4 °C, and the model and data are in closer agreement (see Supplementary Information, Fig. S2a). Note that our site-specific proxy-based metrics are not comparable with previous estimates of Eocene polar amplification e.g.44,49, which were based on Mg/Ca estimates of deep ocean temperatures, and designed to be comparable with true model metrics.

There has been little change in the ensemble mean LGM or EECO SST polar amplification between PMIP4 and PMIP3 (Fig. 3a, compare large and small dark grey circles), although improvements in cloud parameterisations since PMIP3 have been shown to improve the simulation of polar amplification in the EECO for individual models44,50. However, for the Pliocene, there has been a substantial improvement. At least some of this improvement is likely related to the closure of the Bering Strait in the PMIP4 experimental design, which has been shown to increase Pliocene temperatures in the North Atlantic51. However, the proxies still indicate greater amplification than the models (0.8 °C for PMIP4 and 0.25 °C for PMIP3, compared with 1.7 °C in the proxies).

For all three time periods, the site-specific polar amplification metric (l,p,eΔPs) has a similar value to the true metric l,p,eΔPt for most models (see Supplementary Information, Fig. S2a). Across the ensemble, the true metric is greater than the site-specific metric in the MPWP (by 0.05 °C), and less than the site-specific metric in the EECO (by 0.4 °C); indicating that despite the sparsity of the proxy data, there is enough data for the site-specific polar amplification metric to be meaningful. However, the exception to this is for the CESM2 model at the LGM (red dot and star in the LGM panel of Supplementary Information, Fig. S2a), where the site-specific metric (−1.4 °C) is very different, and even of opposite sign, to the true metric (0.3 °C). This is because although the CESM2 LGM ΔT metric is greater than any other model (Fig. 2), the LGM polar SSTs can not drop below the freezing point of seawater, resulting in relatively low polar amplification in the true metric (see Supplementary Information, Fig. S3b).

There is not enough proxy SAT data in the tropics to define an SAT polar amplification metric for the MPWP or the EECO, and there is not enough data in the Southern Hemisphere to define a global SAT polar amplification metric for the LGM. However, it is possible to quantify the absolute changes in high-latitude SATs for all three time periods (see Supplementary Information, Fig. S4a–c), and for the LGM a Northern Hemisphere-only polar amplification metric can be defined (see Supplementary Information, Fig. S4a). This shows that the Northern Hemisphere LGM polar amplification is very well simulated by the PMIP4 model ensemble mean (−4.1 °C) compared with the proxies (−4.2 °C). For the Pliocene, the model ensemble is colder than the proxies in general in the Northern Hemisphere high latitudes, related to less warmth in the Eurasian and Northern America continental interiors than indicated by the proxies. It has been suggested that the warm proxy temperatures in this region may be related to seasonal biases and/or the lack of modern analogues for the associated pollen records52. For the EECO, the Southern Hemisphere high latitude terrestrial temperatures are well simulated by the ensemble mean, which further supports that the Southwest Pacific SSTs proxy temperatures are biased too warm. For the Northern Hemisphere, the models simulate a greater polar amplification than the proxies, but this is largely due to a set of proxy temperatures at 45°N in North America, which are relatively cold and may be influenced by the local topography of the Rockies.

Land-sea warming contrast (LSWC)

The site-specific land–sea warming contrast (LSWC) metrics, (l,p,eΔLs), are shown in Fig. 3b. The proxies indicate a negative (positive) LSWC for the LGM (MPWP), indicating that for both these time periods the land surface SAT warms more than the ocean SST under warming GMST. However, for the EECO the proxies indicate a negative LSWC under warming GMST. Again, this is related to the super warm southwest Pacific proxy SST temperatures, and discounting SSTs from that region results in a positive LSWC for the EECO (see open circle and dotted error bars in Fig. 3b, and see Supplementary Information, Fig. S2b). The terrestrial proxies for the Eocene are from a wider time window (56.0–47.8 Ma) than the marine proxies (53.3–49.1 Ma)32, and in many cases have uncertain paleoaltitude, and so this may also be playing a role. For both the LGM and MPWP, the model ensemble has a lower magnitude LSWC than the proxies, and this discrepancy is greater in the PMIP4/CMIP6 models than in the PMIP3/CMIP5 models. For the MPWP, the proxy SAT locations are all in the mid-latitudes of the Northern Hemisphere, and as discussed above, in this region the models simulate colder temperatures than indicated by the proxies (see Supplementary Information, Fig. S4b), and it is this discrepancy which leads to the discrepancy in land–sea warming contrast. The model site-specific and true metrics differ from each other quite considerably (see Supplementary Information, Fig. S2b), with the true metrics being lower than the site-specific metrics for all time periods by 70%, 50%, and 40% for the LGM, MPWP, and EECO, respectively.

Discussion

There is a remarkable relationship between the modelled GMST metric, ΔT, and the polar amplification metric, ΔP, across the three time periods, in both the site-specific and true metrics (Fig. 4a). This is also supported in the proxies, in particular when the southwest Pacific sites are excluded from the EECO; in this case, both models and proxies point to an approximately linear relationship between the two metrics. The fact that this relationship is so linear is surprising given the greatly reduced (or non-existent) sea ice in the EECO, indicating that other mechanisms of polar amplification (for example related to cloud feedbacks) are compensating for each other across different time periods, resulting in the linear relationship. This relationship is also seen in proxy estimates of global mean temperature and meridional temperature gradient from across the last 95 million years53.

Fig. 4: Relationship between global mean surface temperature, polar amplification, and land-sea warming contrast.
figure 4

Relationship between metrics for a GMST (l,p,eΔTt,s) and polar amplification (l,p,eΔPt,s), and b GMST and land–sea warming contrast (l,p,eΔLt,s), for the last glacial maximum (LGM; blue, l), mid-Pliocene warm period (MPWP; orange, p), and early Eocene climatic optimum (EECO; red, e). Large circles and very likely ranges show the observed site-specific metric (s), small circles show the model site-specific metric for all CMIP6/PMIP4 models, and stars show the true model metric (t) for all CMIP6/PMIP4 models. The square shows the preindustrial. The EECO observed metric shown with an open circle excludes SST data from the southwest Pacific.

In the models, there is also a clear relationship between the GMST metric and the LSWC metric (Fig. 4b). In this case, there is a non-linear relationship, with LSWC increasing at lower GMST, but then flattening out under the high temperatures of the EECO. This relationship, including saturation, is consistent with a theory based on contrasting surface humidities and lapse rates over land and ocean28. The LGM proxy data is consistent with this relationship, but Pliocene LSWC in the proxies is greater than in the models, even accounting for the error bars in the proxy metric. In the EECO, the proxies indicate a complete reversal in this relationship, but when the EECO southwest Pacific sites are excluded again, the models and proxies are more consistent, especially accounting for the large error bars of the EECO proxy estimates of GMST and LSWC.

In this paper, we have used metrics derived from paleo proxy data to evaluate climate model simulations of the LGM, MPWP, and EECO. We find that model ensemble mean GMSTs are in exceptionally good agreement with the proxy data for all three paleo time periods, and that this agreement has improved in CMIP6/PMIP4 compared to CMIP5/PMIP3. The LGM is shown to be a very stringent target for model evaluation and development due to its large signal-noise ratio, and well-defined boundary conditions. There are indications that model evaluation using the paleo proxy record can be a better discriminator of models with very high or very low climate sensitivity than using the Historical observational period. Models also simulate polar amplification, and the relationship between GMST and polar amplification, in reasonable agreement with proxies. However, there are uncertainties associated with the proxy records in (i) the MPWP within the northern hemisphere continental interiors, and ii) during the EECO, particularly in the southwest Pacific. In addition, some proxy terrestrial sites are from high-elevation regions that are not resolved in the models, or, for the EECO, are from regions for which the paleoelevation is uncertain. Furthermore, the relatively wide temporal window of the EECO (~4.1 Myr) means that the proxy signal is affected by orbital forcing and temporal variations in CO2. All of these proxy uncertainties should be further explored in future work in order to maximise the utility of the paleoclimate proxy record for model development. Land-sea warming contrast is reasonably well simulated at the LGM, but less so at the MPWP and EECO. The models indicate an increasing but saturating relationship between GMST and LSWC, consistent with theory.

Overall, the paper provides a framework for paleo model evaluation that can be used for future model development in the framework of CMIP7 and beyond6,8,54. The framework also provides a traceability to previous model generations, allowing a robust assessment of model improvements over time, through successive model development cycles.

Online methods

Model simulations

The most recent experimental designs for the three time periods above are described in detail in ref. 41 for the LGM, ref. 42 for the MPWP, and ref. 43 for the EECO. These experimental designs describe standard boundary conditions (e.g. CO2, non-CO2 greenhouse gases, ice sheets, and vegetation) to be implemented in models and protocols for the simulations themselves (e.g. run length and initial conditions). Simulations carried out using these experimental designs are all classified here as PMIP4/CMIP6 simulations. The models that carried out these PMIP4 simulations are of varying complexity and include models developed for use in CMIP6, as well as earlier iterations of CMIP. The large-scale features of these PMIP4 simulation results are discussed in ref. 4 for the LGM, ref. 1 for the MPWP (as part of the PlioMIP2 project), and ref. 3 for the EECO (as part of the DeepMIP project). Simulation results are also presented for previous model simulations in the framework of PMIP3/CMIP5, described in ref. 4 for the LGM31, for the MPWP, and ref. 55 for the EECO. Tables listing all the simulations used in this paper are given in Supplementary Information, Tables S1S5.

Note that for the EECO, the NorESM1_F model uses palaeogeography with a different reference frame than the other models and, as such, is only included in the GMST metric and not in the polar amplification or land–sea warming contrast metrics, which are reference frame-specific. Also for the EECO, there are fewer models presented here than in ref. 3. This is because here we only include those models that carried out simulations in the range ×4–×8 preindustrial levels of CO2, in accordance with CO2 proxy estimates for the EECO3. The exception is CESM2.1slab, which we include for context and which was run at ×3.

Proxy datasets

In order to evaluate the model simulations, we use existing syntheses and compilations of paleo proxy data for all three time periods.

For the GMST metric, we make use of the IPCC AR6 assessments of GMST change for the three paleo time periods26. These are based on a thorough review of the literature and are designed to be global metrics directly comparable with the global mean output from models (i.e., they are ‘true’ metrics, see Online methods, section “Definition of metrics”). For the LGM, we also include the GMST metric of ref. 34.

For the polar amplification and land–sea warming contrast metric, we use site-based data; for the LGM, we use ref. 56 for the sea surface temperatures (SSTs) and ref. 57 (at the locations defined in ref. 58, which are the actual proxy locations that inform the global assimilated dataset of ref. 57) for the land air temperatures (LATs). For the MPWP we use ref. 59 for the SSTs and ref. 60 for the LATs. For the EECO we use ref. 61 for the SSTs and LATs.

Definition of metrics

For changes in GMST, polar amplification, and land-sea warming contrast, we can define two types of quantitative metrics. Firstly, ‘true’ quantities, Qt, which in theory require SST, LAT and near-surface air temperature (SAT) values to be defined over the entire ocean and globe respectively (i.e. at all gridcells of a model or global gridded observational dataset). SSTt is the ocean-only true global mean SST; LATt is the land-only true global mean SAT; and SATt is the true global mean SAT. Secondly, ‘site-specific’ means; SSTs, LATs, and SATs. These are similar to the true quantities, but rather than averaging over all gridcells they are defined according to a particular paleo proxy dataset and are averaged only over those cells/locations that include at least one proxy data point in that dataset. True quantities, Qt, can, in theory, only be defined for globally gridded output, whereas site-specific quantities, Qs can be defined either for global model output or for proxy datasets. In practice, the IPCC-assessed paleoclimate GMST metrics are also considered to be ‘true’ metrics, as discussed in the section “Proxy datasets”. Site-specific quantities are simply the average of the temperatures at each site in the proxy dataset. All quantities can be defined for a particular time period (x, where x can be e for EECO, p for MPWP, l for LGM, or pi for preindustrial) and can also be defined for selected latitude ranges (r), \(\scriptstyle{{x}\atop{\rm {r}}}Q\), so that, for example, the site-specific mean SST in the range 90S to 30S during the EECO, is written \(\scriptstyle{\hskip14pt{{\rm{e}}}\atop{{-90:-30}}}{\rm{SST}}^{s}\).

We then define three key metrics as a function of these quantities. In particular, the change in true or site-specific (t,s) mean temperature relative to the preindustrial (ΔT), for the LGM (l), MPWP (p), or EECO (e) is

$${\scriptstyle{{l,p,e}}\atop }\!\!{\Delta T}^{{{t,s}}}{= }\,\, ^{{{l,p,e}}}{{{SAT}}}^{{{t,s}}}{-}^{{{ \, pi}}}{{{SAT}}}^{{{t,s}}}$$
(1)

for SAT, and similarly for SST and LAT. The polar amplification metric (ΔP) is

$${\scriptstyle{{l,p,e}}\atop }\!\!{\Delta P}^{{{t,s}}} = {}_{-30:+30}^{{\hskip14pt{l,p,e}}}{{{SST}}}^{{{t,s \,}}}{-}_{\, \pm 60:\pm 90}^{{\hskip16pt{l,p,e}}}{{{SST}}}^{{{t,s \,}}}{-}_{ \, -30:+30}^{{\hskip22pt{pi}}}{{{SST}}}^{{{t,s \,}}}{+}_{ \, \pm 60:\pm 90}^{{\hskip23pt{pi}}}{{{SST}}}^{{{t,s}}}$$
(2)

for SST, and similarly for LAT. The land–sea warming contrast metric (ΔL) is

$${\scriptstyle{{l,p,e}}\atop }\!\!{\Delta L}^{{{t,s}}} { \,\, = \,\, }^{{{l,p,e}}}{{{LAT}}}^{{{t,s \, }}}{-}^{{{ \, l,p,e}}}{{{SST}}}^{{{t,s \, }}}{-}^{{{ \, pi}}}{{{LAT}}}^{{{t,s \, }}}{+}^{{{ \, pi}}}{{{SST}}}^{{{t,s}}}.$$
(3)

The proxy compilations that we use are published with associated uncertainties in temperature for each individual site. However, the meaning of these uncertainty ranges is unclear in some cases, and inconsistent across different time periods. Here we interpret all published uncertainties as representing a range of uniformly distributed uncertainty. In order to estimate the associated uncertainty in the polar amplification and land-sea warming contrast site-specific proxy metrics that we present, we use Monte Carlo sampling to generate 100 proxy datasets and use these to generate 100 associated metrics, from which we report a mean and a 90% uncertainty range (consistent with the IPCC ‘very likely’ range).

Developments since IPCC AR6

IPCC AR6 includes a figure showing ensemble mean maps and zonal means of the SST and SAT data analysed in this paper (ref. 18, Fig. 7.13 therein). Compared with the IPCC figure, here we have carried out some developments, and incorporated these into our overall analysis: (1) Here, in Supplementary Information, Fig. S4, the horizontal lines showing the banded mean SSTs, and the values given in the plot for the values of the polar amplification associated with these bands, are calculated using the ensemble mean SSTs only for those gridboxes where all models have an ocean grid ocean (cdo operator ‘ensaver’). In the equivalent IPCC plot, the values given are the same as in Fig. S1, but the horizontal lines were calculated using the mean of the models for all gridboxes for which at least one model had ocean (cdo operator ‘ensmean’). (2) Here, for extracting the modelled SST at the location of a proxy, for SST proxy locations that are defined as land in the models, the nearest ocean gridcell is used to define the model value. In the IPCC, due to a coding error, the nearest-but-one ocean gridcell was used. (3) Here, we assign an uncertainty of ±5 °C for any proxy data that does not have an associated uncertainty in the original reference. In the IPCC, due to a coding error, an error of zero was assigned. (4) Here, with the exception of NorESM stated above, all models are used to calculate all three metrics. In the IPCC, the EECO CESM2.1slab simulation was not included in the map of the ensemble mean map or in the plot of the zonal mean.