Introduction

Soils have a potential to mitigate climate change1,2 by sequestering and storing carbon3,4 (C). However, warming and land use changes are expected to increase decomposition and lead to soil organic carbon (SOC) losses, causing an amplifying feedback on climate change5. Whether soils mitigate or exacerbate climatic changes depends on the balance between (i) C inputs from vegetation and exogenous organic matter, and (ii) C outputs from microbial respiration, dissolved organic C leaching, soil erosion, combustion by wildfires and volatile organic compounds emissions6. Mathematical models are crucial to understand and predict this balance, as they condense complex processes into a mathematical formalism suitable for quantitative analyses.

Soil organic carbon models translate theoretical hypotheses into a simplified overview of the ecological system described by schematic representations and mathematical equations (Fig. 1a). By confronting model simulations against empirical observations, the underlying hypotheses of a model can be tested (Fig. 1b, c). After testing, models can be used to infer the effects of environmental drivers, such as climate change or land use, on SOC dynamics (Fig. 1c). As a tool to connect data and theory, a model is neither true nor false7. Rather, the value of a model comes from its ability to explain and account for a set of phenomena8, so that the validation process of a model depends on the purpose and context in which it is applied (Box 1).

Fig. 1: What is a SOC model?
figure 1

A schematic representation of (a) SOC model as a mediator between a theoretical and an empirical field, (b) links between models and theory (c) as well as between SOC models and empirical field. In b, Micr. biom. Stands for microbial biomass.

Since the early 1930s, a wide variety of models to represent the dynamics of SOC have been developed for various spatial and temporal scales, climatic conditions, land-uses and land-covers9,10. At least two families of model structures can be identified: (i) approaches relying on conceptual SOC pools decaying according to first order kinetics11,12, and (ii) approaches that resolve microbial and physical processes controlling SOC decay and stabilisation by describing extra-cellular enzymatic reactions, diffusion and/or sorption kinetics13,14 (Fig. 1b). In the former category, first-order kinetics imply that decay rates are proportional to the SOC stocks of the various pools considered, with rate modifiers to implicitly represent the effects of key factors, namely soil temperature, soil moisture, and clay content on microbial and physical processes15,16. In the latter category, nonlinear kinetics consider the feedbacks between microbial activity and SOC substrates by representing the decay rate as a function of SOC and/or microbial C stocks17,18. Yet, no consensus has emerged on any approach for understanding and predicting SOC dynamics19. As a consequence of the wide range of approaches, predicted SOC values exhibit large discrepancies across models, irrespective of model category or temporal and spatial scales20,21,22, indicating that more robust validation procedures of SOC models need to be developed23. Comparison of model performance is also complicated by lacking standardisation of model validation criteria24 (Fig. 1c). While validation of models is a critical step to improve confidence in SOC model predictions, so far, no comprehensive review of the different approaches to model validation has been undertaken.

We fill this gap by systematically comparing the validation procedures of ~250 SOC models spanning 90 years of model development history in relation to their scope and main features. We raise three interrelated questions: (i) What are the different SOC model scopes, features and validation procedures in the scientific literature and how have they changed since the first SOC models in the 1930’s? (ii) How, and to what extent, are different model categories evaluated against observations? (iii) How can the diversity of model’s scope and features help enhance the reliability of models aimed at predictions? Answering these questions is crucial for applying and developing adequate SOC models for prediction while leveraging the complementarity of different model scopes and features.

To answer these questions, we summarize model validation methods based on existing definitions in the literature (Box 1) and then conducted a systematic literature review to assess the types of model validation used. The review consisted in four steps: (i) literature screening based on a previous SOC model review from 200910 and expert elicitation following a workshop on “Diversity and complementarity in SOC modelling approaches” in October 2021, (ii) literature selection aiming at excluding publications out of the scope of the present study, (iii) systematic literature analysis, and (iv) refinement of literature selection (Fig. S1). Our systematic quantitative review is also complemented by a qualitative analysis of the history of model validation of both first-order and nonlinear SOC models, illustrating different pathways of model validations for three first-order kinetic models (Century, AMG, Q-model) and two nonlinear kinetic models (MIMICS, Millennial) (Figs. S35 and Box S1 & S2).

Model features, scopes and validation across time

Model features, scope, and validation procedures are regarded here as primary data and represented through an historical perspective (Fig. 2), and bar chart analyses (Fig. 3).

Fig. 2: Temporal evolution of model number, features, scopes and validation procedure.
figure 2

Evolution of (a) the number of publications reviewed in the present research and model features (expressed as percentages of all models in each time interval) regarding: (b) decomposition kinetics, (c) model level of interest, (d) model scope, and model validation according to (e) the procedures applied and (f) validation conditions. Curly brackets in (d) group models including ‘prediction’ among their scopes; curly brackets in (f) highlight the percentage of models undergoing independent diachronic validation.

Fig. 3: Validation of models aiming at prediction.
figure 3

Types of validation depending on the source of data used with all SOC models reviewed since 2000—accounting for 92 models in 45 publications. Numbers inside the bar charts indicate the total number of each combination of type and condition of model validation.

Regarding model features, we focused on the representation of decomposition kinetics (Fig. 2b) and the ecological system level of the models (Fig. 2c) (see also SI1). These ‘system levels’ characterise the context in which models are used, and partly also reflect the spatial scale of application (starting from smallest scales at the ‘microbial community’ level). We found that 28 to 37% of the 137 models until the 2000’s applied non-linear kinetics including both SOC and microbial biomass or enzymes. This number increased to 55% of the 139 models published after 2010 (Fig. 2b). The recent increased number and proportion of SOC models based on nonlinear kinetics can be explained by at least two reasons. First, nonlinear kinetics allow us to account for transient dynamics and feedback responses of soil microorganisms to changing environments and SOC decomposition25,26. While linear models rarely capture these effects27, they were preferred for their simplicity and inherently stable behaviours. In the last decades the need to include nonlinear effects in the context of climatic changes has motivated a revival of nonlinear models. Second, since the 2010’s, an increasing number of Earth System Models (ESM) and Dynamic Global Vegetation Models (DGVM) intended to incorporate these feedback effects to predict global C dynamics under climate change18,28. This new direction of ESM and DGVM development was probably motivated by their poor SOC prediction performance29, leading to further publications arguing for inclusion of soil microbial processes in ESM30. Yet, most models in the 2010–2021 period were not developed for global scale applications (only 4%, Fig. 2c), with most models describing processes from the soil- to the ecosystem- level (69% and 18%, respectively, of all models in that period, Fig. 2c). A minority of models describe processes at microscopic scales, including models at the microbial community level (9%, Fig. 2c). This dominance of soil- and ecosystem-level models, might reflect the spatial scales at which most data are available (Fig. 2f). Thus, despite the call for including nonlinear kinetics in ESM and DGVM, nonlinear models are still primarily developed for smaller scales. Moreover, models were increasingly needed to provide decision support to enhance SOC sequestration at the plot or farm level, which require local- rather than global-scale predictions31.

Regarding model scope, we distinguished three objectives: hypothesis-testing and formalisation, data interpretation (diagnostic models), and prediction (prognostic models, see also SI1). Those categories are not exclusive, which means that a given model can fall into more than one category. Because prediction is the scope for which model validation is the most critical, we focus here on this objective (Fig. 2d). Between one third and one half of the models were developed with this objective in mind, with a notable decline between the 2000’s and the 2010’s, suggesting that most of these newer models are in an early phase of theoretical development and testing before being possibly used for prediction32,33.

Finally, we characterised validation by considering two relevant axes: the dependence of the validation dataset to the calibration dataset (Fig. 2e) and the source of the validation dataset (Fig. 2f, see also SI1). Four types of model validation dataset are considered: (i) independent diachronic validation, which provides the most confidence in model accuracy34, (ii) independent non-diachronic validation, which tests model ability to reproduce spatial pattern of SOC stocks, (iii) non-independent validation, which corresponds to model calibration35,36, and (iv) no validation/calibration (Box 1). Before the 1970’s, most models were not evaluated except via qualitative comparisons with observations (Fig. 2e). Since the 1980’s, more models have been evaluated, but non-validated models remained an important fraction. However, as the percentage of non-validated models remains fairly constant since the 1990’s while the total number of models increases (Fig. 2a), the number of non-validated models has actually increased. Independent diachronic validation, while remaining a minor fraction, became more widespread in the last two decades (Fig. 2a, e), indicating increased attention to model ability to predict temporal changes in SOC (Fig. 2e).

In addition, we examined four sources of data utilised for validation: (i) against laboratory experiments, (ii) against field experiments, (iii) against observation networks, and (iv) against reconstructed datasets in which observation values from field measurements are scaled up to construct gridded datasets37. Although the validation of SOC models against observation networks and reconstructed data sets increased since the 1990’s, we found that the majority of validations are still based on field or laboratory experiments (Fig. 2f), which is expected as the majority of models describe processes at the level of soil or ecosystem (Fig. 2c).

Model features, scopes and validation since 2000

To assess recent trends in SOC model validation, we analysed the relative distributions of model features and scopes depending on the type of validation—either independent diachronic or other types (Fig. S2). We found that 27% of first order kinetics models were validated against diachronic independent data while this was the case for only 15% of nonlinear kinetics models. This indicates that simpler models based on first order kinetics has been validated more thoroughly with independent time-series observations than the ‘more mechanistic’ microbial models. In contrast, independent validation has been performed in similar proportion to test both models at the soil level or below and models at the ecosystem level or above.

Additionally, we analysed the types of validation used for models proposed since year 2000 and aiming at prediction. To do that, we considered two relevant axes: (i) types and (ii) conditions of model validation. Approximately 40% of model validations were based on field data in predictive models, but were not based on independent and diachronic data (Fig. 3). This proportion has remained almost constant since 1933 (Table S1). Similarly, 15% of validations were based on laboratory data, but they were not independent diachronic (Fig. 3). In several cases, the validation consisted of calibrating the model and assessing its performance against the same calibration data38,39. In many cases, independent validation based on space-for-time substitutions was used17,40,41,42. Only 23% of models aiming at prediction were tested using an independent diachronic validation, despite its importance for testing the ability of models to simulate SOC temporal dynamics (Box 1). This group included 14 first order models, namely the LPJ43, MOMOS44, CN-SIM45, CIPS46, AMG47, Roth-C48, PRIM49, ORCHIDEE-PRIM50, Yasso51, N14CP-Agri52, CASA-CNP53, DAISY54 models, as well as three unnamed models55,56, and six nonlinear models, namely the SOMKO57, Ecosys58, BACWAVE-WEB59, MIMICS28, CORPSE53 and one unnamed model60. Only 7 models (8%) were evaluated against network measurements at the territorial- or national- level, but not diachronically and independently. However, type of data does not seem to covary with evaluation type, in that a fairly constant proportion of model validations are independent diachronic (26% for lab experiments, 24% for field, 25% for reconstructed).

Recommendations for model validation

Our review identified a lack of independent diachronic validation based on SOC network measurements (Fig. 3). Filling this gap will be crucial to evaluate objectively regional to global SOC predictions in the context of climate and land use changes. Reuse of highly valuable data sets and novel measurement networks will be central to this effort. Both incubation and field experiments provide valuable empirical observations to evaluate SOC models. In particular, existing decadal field experiments5,61 represent a valuable source of data. They provide long-term time-series data on SOC stocks in different climatic conditions, but remain underutilised. Given the time required to develop high value long-term data sets and measurement networks, much increased access to, valorization and more extensive use of these resources will be key to this effort. At a larger scale, validations against observations from measurement networks allow the predictive value of SOC models to be assessed under a vast range of land-use and pedoclimatic contexts.

National and macro-regional soil monitoring networks are routinely resampled and data are becoming more available (e.g., with the second campaign of the French Soil Network Measurement44 and fourth resampled of the European LUCAS topsoil network62 and with the regional or national soil monitoring systems existing in 18 European countries63), and will therefore enable validating models in a large set of contexts and at large spatial and temporal scales. To this end, some existing limitations still need to be overcome. In particular, the different soil databases already in use64,65,66 require harmonisation in terms of spatial resolution, reported variables and measurement methods. Harmonized datasets will provide reliable model input variables30, such as C inputs from plants or amendments67 and initialisation of SOC pools68, allowing meaningful model inter-comparisons. These limitations have been identified as a key research priority to provide compatible benchmark data sets for a consistent validation of SOC models. In the case of SOC model implementation within ESMs, an additional barrier is that ESMs often lack the adequate forcing data at the site resolution. Because data scarcity is still largely limiting large-scale diachronic validation, global datasets, such as the Soil Respiration Database69, remain invaluable to validate the ability of models to predict the spatial variability of C fluxes from soils—for which time-series are not a prerequisite. However, these datasets are insufficient to validate the predictive accuracy of the temporal dynamics of SOC stocks.

Models tend to become increasingly complex, including more and more biogeochemical processes70 (Box S2). Yet, adding processes or compartments to SOC models might result in hidden compensating biases arising from the overfitting of model parameters21,71. Therefore, evaluating model ability to reproduce newly explicitly represented processes and/or soil compartments is required to address this bias. To this end, validation of new models including e.g., microbial processes should be conducted on field experiments that allow controlling the effects of environmental or management changes on these processes. More generally, it is worth emphasising that the more observations are collected—not only on SOC stocks and SOC compartments, but also on bulk density, vertical SOC profile, clay content, moisture content, C inputs from plants, microbial biomass, etc.—the better it is for stringently validating all model outputs.

To summarize, the next steps in model validation to improve reliability and accuracy should be, in order of priority: (i) To gather and homogenize time series of long-term field trials and measurement networks that could be used as benchmark datasets to systematically validate independently and diachronically all models aimed at predicting SOC dynamics. Such a benchmark dataset would allow stringent comparisons of SOC models, helping to choose the best-suited models depending on the spatio-temporal scale of interest, land-use and pedoclimatic contexts. (ii) To maintain and develop SOC monitoring networks to build spatially-explicit maps of the temporal dynamics of SOC stocks, which in turn will allow validating SOC models from regional to global scales (this type of validation is currently sorely underrepresented, Fig. 3). (iii) To avoid compensating biases, to validate models incorporating new representations of soil processes and/or soil compartment against independent time-series data specifically related to those processes and compartments.

Model diversity can help improve prediction ability

Each model considers a number of biogeochemical processes governing the dynamics of SOC, which is translated into their mathematical formalism (Fig. 1). Can we take advantage of the diversity of SOC models to improve those aimed at predictions or projections and evaluate their relative performance? To answer this question, let us consider the common object of interest for all these models – the dynamics of SOC – and the role of models as a mediator between empirical and theoretical knowledge72 (Fig. 1).

Different SOC modelling approaches, reflecting distinct theoretical assumptions, can use the same empirical observations for calibration or validation53,73. Comparing the accuracies of SOC models describing different processes with respect to the same set of observations allows us to identify the mathematical formalism achieving the best performance. For instance, Lawrence et al.25 developed four SOC models of varying mechanistic complexity and compared their ability to simulate soil respiration observed in a laboratory incubation experiment to test which mechanisms improve model performance. However, the use of the models that showed the best performance might be hindered by the low-quality or lack of available input data and parameters when the intended scale is larger or longer than the one for which models were tested and parametrized, for instance when moving from the laboratory incubation to the field level. To overcome this barrier, new indicators based on widely available observational data could be developed to transfer processes identified as important by complex models toward simpler models. One example is the integration of the priming effect (enhanced or inhibited decomposition of native SOC by added fresh organic matter) in SOC models. Priming can be described by modelling microbial biomass and enzymatic reactions (Michaelis-Menten kinetics)25 (Fig. 1c), but including these processes could be hampered by the lack of reliable input data and parameter estimates at the intended scale of application. To circumvent this issue, an empirical relationship between C input and mineralization rate can be used to surrogate explicit microbial processes and thus represent the priming effect in a simple way49,74. Moreover, predictive models can benefit from the comparison with more complex models designed for other goals, as such comparisons provide a safeguard against being locked into one type of model and help to identify processes that may have been neglected or incorrectly represented at the scale at which predictions are made25,49.

Different SOC models can also be used to run equivalent numerical experiments to compare how their mechanistic representations affect the simulations. Bridging empirical knowledge, condensed into model calibrations, and theoretical knowledge gained by analysing the simulation results, has two virtues in terms of model complementarities. First, it enables researchers to explore how feedback mechanisms can result in distinct behaviours in different SOC models26,75,76. For instance, Sainte-Marie et al.75 explored how different conceptualisations and mathematical formalisations of SOC depolymerisation by different decomposer groups impact the chemistry and amount of organic matter during decomposition and at steady state. This analysis has a heuristic value in itself, independent of the subsequent validation of the model24. Second, bridging empirical and theoretical knowledge allows for testing the sensitivity of different models to shifts in driving variables20,76. For instance, Ito et al.20 highlighted the high sensitivities and uncertainties on the impact of land-use on future SOC dynamics at the global level in 15 DGVMs.

Model diversity can also result in synergies when models are implemented into ensemble modelling approaches77,78,79. Considering multiple structurally diverse models for prediction allows to estimate the uncertainty of the simulated variables due to the different processes represented and their sensitivities to driving factors80. Weighted averages and model selection based on performance criteria reduce the prediction error and provide more robust predictions, relative to single model simulations77. However, standardizing modelling methods remains a fundamental issue in multi-model inter-comparison exercises77,78,79. For example, the choice of the parameters to calibrate, the initialization method to use for conceptually different soil compartments, and the estimation of forcing variables make model inter-comparisons difficult. Protocols are needed that also account for data availability81. Multi-model ensembles thus represent promising tools to account for uncertainties in the simulations and provide greater reliability than individual models.

Outlook

Soil organic carbon models are expected to predict SOC stock changes and to provide a sound biogeochemical context for simulating coupled soil-vegetation interactions. Yet, reliable model validation lags behind increasing model complexity. Based on a systematic review of ~250 models over 90 years, we advocate for stringent independent diachronic validation of prediction-aimed models at all scales based on sites and networks that provide time series of SOC stocks and/or C fluxes at plot, national and regional scales. Continued efforts to maintain these datasets are thus imperative to increase reliability and accuracy of SOC projections and predictions. In parallel to observation-based validation, predictive models can also be conceptually evaluated by comparing them to models designed for other goals, to identify processes that might have been neglected or incorrectly represented at the scale at which predictions are made.