Introduction

The Arctic is an especially climate-sensitive region of the Earth system, which is evident from the enhanced rate of warming compared to the global average1. In addition to temperature changes, the Arctic has also experienced substantial declines in e.g., sea ice extent, thickness, and concentration2 as well as thawing permafrost3, Greenlandic ice sheet melt4, increasing riverine fresh-water and sediment input into the ocean5, and increasing marine biological activity6. While anthropogenic CO2 emissions are still increasing7, short-lived climate forcers (e.g., black carbon and sulfate aerosols) have been decreasing or plateauing in recent decades8,9. These changes suggest that the anthropogenic influence on Arctic aerosols is decreasing in importance while natural influences on processes affecting aerosols are relatively increasing9. These changes motivate the need to explore the natural components of the Arctic system, especially natural aerosols, which are an important factor in determining the Arctic radiative budget due to their direct and indirect effects10,11. Aerosols are responsible for a large proportion of climate modeling uncertainty12,13,14 and including natural aerosol processes in climate models can help reduce their uncertainty15. Aerosols are expected to respond to the changes in the Arctic environment but determining the direction and magnitude of this response is complex16,17,18. Therefore, investigations on the role and drivers of natural aerosols in the climate system are crucial for evaluating the effects of climate change in this delicate environment9.

Methanesulfonic acid (MSA) in the Arctic is derived mainly from the oxidation of naturally emitted, marine dimethyl sulfide (DMS)19. While the open ocean is the major source of DMS emissions, other sources can contribute as well such as melt ponds, biomass burning, coastal tundra, and lakes20,21,22. DMS is produced from the enzymatic cleavage of dimethylsulfoniopropionate23,24 and represents the largest source of marine biogenic sulfur in the atmosphere25,26. DMS mixing ratios follow a seasonal cycle with near-zero values during polar night (October to February) and elevated and highly variable values during polar day (March to September), typically ranging between 30 and 180 parts per trillion by volume (pptv) for monthly averages21,27,28,29,30,31. Once ventilated into the atmosphere, DMS is oxidized by OH, O3, NO3, and halogens, a process that proceeds via either the addition or abstraction pathways, which are temperature dependent19. In the atmosphere, DMS has a lifetime of approx. 1–2 days21, which is dependent on latitude in the Arctic32. Gas-phase DMS oxidation ultimately yields MSA or SO2 (which is further oxidized to sulfuric acid), with the addition pathway resulting in higher yields of MSA33. DMS can also be oxidized in the aqueous phase (through processing in clouds or on deliquesced particles34) mainly via O3 and OH, which has been demonstrated through modeling and measurements to be an important formation mechanism35,36,37,38,39,40. After dissolving into a droplet, DMS is oxidized into MSA through multi-phase chemistry and during droplet evaporation, MSA is released into the gas-phase and can further influence secondary aerosol production37,39,40. Particulate MSA also follows a seasonal cycle with near-zero values during polar night and elevated values during polar day41,42,43,44. Peak MSA concentrations are experienced during May (~10–40 ng m−3) and typically remain elevated until the end of summer (the seasonal cycle of MSA will be discussed in Section “Seasonal cycle and source regions of Pan-Arctic MSA”). Trends in MSA have shown high spatial and temporal variability in the Arctic, for instance, MSA has increased at Utqiaġvik/Barrow during July-September by 2.5% yr−1 over 1998–201745, at Kevo, Finland, MSA has increased during June/July by 0.68% yr−1 over 1964–201046, and at Alert, CA, MSA has shown decreasing trends during the cold and warm periods over 1980-2000 but increased during the warm period over 2000-20098. The variable trends in MSA are likely related to interannual variability in source strength and transport patterns over the different considered periods8,45,47,48. MSA has a lifetime of several days depending on the environmental conditions49. MSA has been observed to reside mainly in the accumulation mode (aerosols with a diameter > 100 nm)50,51,52 although recent observations have indicated MSA to be in the Aitken mode (~25 < diameter < 100 nm) during a formation event53. MSA is removed from the atmosphere via dry and wet deposition (with wet deposition being the most important37) or further oxidized into sulfate in the aqueous phase39.

MSA is an important constituent in the Arctic atmospheric system, as it has mainly been observed to aid in the condensational growth of aerosols, but has also been modeled to participate in the nucleation of new aerosol particles54,55,56. While its role in nucleation has yet to be directly observed in the field57,58, it has been observed in chamber studies59. More importantly, while only representing a fraction of the total aerosol mass, MSA is critical for the growth of particles to cloud condensation nuclei (CCN) sizes31,32,60 thereby affecting cloud microphysical properties such as cloud lifetime, albedo, and precipitation efficiency61,62,63,64. MSA has been shown to selectively condense onto alkaline particles and not acidic or hydrophobic particles in polar regions65,66. The summertime Arctic is characterized by low particle concentrations, and thus an aerosol-sensitive CCN regime67,68,69, therefore, variations in the number concentration of CCN-active aerosols can have a large effect on the radiative balance of clouds12. Understanding the sources and atmospheric drivers of MSA is therefore vital for accurate modeling of the Arctic climate system, given that aerosol-cloud interactions are one of the largest sources of uncertainty in global climate modeling15.

Machine learning (ML) algorithms and models are powerful tools for understanding environmental phenomena since they can account for non-linear, complex relationships from large datasets that simpler analysis methods wouldn’t be able to accommodate. ML has several advantages over numerical models (NMs), namely computational efficiency, flexibility, and the ability to handle complex non-linear tasks70. ML models are data-driven, meaning they rely solely on the explanatory data and not on pre-defined physical equations, which allows them to efficiently capture complex non-linear relationships for many different applications. NMs rely on physical equations, emission inventories, and assumptions that require accurate parameterizations and estimates, thus limiting their flexibility. As NMs increase in their complexity, the computational costs and runtime increase disproportionately therefore capturing complex atmospheric phenomena can become prohibitively computationally expensive. ML is not without its disadvantages, including the need for large amounts of curated training data and the difficulty in interpreting complex models. Efforts are being made to create hybrid ML-NM models that maximize the advantages of both approaches while minimizing their disadvantages. ML has been demonstrated to be effective in the analysis of environmental processes including predicting sea surface DMS concentrations71,72, aerosol-cloud interactions73, CCN concentrations70,74, and global particle number concentrations75. In the Arctic, k-means clustering has been utilized at several sites to understand the dynamics of aerosol particle number size distributions76,77,78,79,80 and their links to environmental conditions (e.g., open water extent, transport patterns) as well as their sources and climate-relevant properties. Song et al. 81 used a random forest regressor to understand the drivers of different aerosol types on Svalbard and found that solar radiation, surface pressure, and temperature were drivers of biogenic-type aerosols. Hu et al. 82 used principal component analysis coupled with a generalized additive model to show the marine and terrestrial influences on isoprene and monoterpene-derived secondary organic aerosols in the Arctic. Overall, this highlights the applicability of ML models to understand atmospheric phenomena and their advantages over NMs.

In this study, we leverage in situ observations of particulate MSA from several stations geographically dispersed around the Arctic (Fig. 1a), along with FLEXible PARTicle dispersion model (FLEXPART83) backward dispersion simulations and ERA584 reanalysis meteorological datasets. Our objectives are to (1) identify the Pan-Arctic seasonal cycle and source regions of particulate MSA, (2) elucidate the relationships between MSA and atmospheric variables, and (3) project the future response of MSA based on trends of environmental variables in ERA5. First, we explore the Pan-Arctic seasonal cycles and source regions of MSA. We then use the average trajectory-weighted source regions of MSA and the derived average trajectory-weighted meteorological conditions from ERA5 as input into an ML model and use the SHapley Additive exPlanations (SHAP85) approach for model explainability (see Methods for further details about trajectory-weighted source regions and meteorological conditions as well as SHAP methodology). Following this, we project the future MSA concentrations using the same trained ML model and extrapolated trends in the ERA5 data. We utilize a leave-one-out approach to explore which trends in ERA5 variables are contributing to future changes in MSA. Finally, we discuss the limitations and implications of our results.

Fig. 1: Locations and MSA seasonal cycle for each station.
figure 1

a Map of the four Arctic stations used in this study. Stations are indicated with a red star. The map background is from Natural Earth. b The seasonal cycle of MSA at Alert (red), Gruvebadet (blue), Thule (cyan), Utqiaġvik/Barrow (magenta), and all stations combined (Pan-Arctic) (black). The thick lines represent the median and the shading represents the interquartile range for each month over the period 2010–2017 for Alert, Gruvebadet, and Thule. For Utqiaġvik/Barrow, the period is 2008–2014.

Results and discussion

Seasonal cycle and source regions of Pan-Arctic MSA

The seasonal cycle of particulate MSA at each of the four Arctic stations (locations are displayed in Fig. 1a) and for all stations combined, is displayed in Fig. 1b. For a description of each station see the Methods section. For all stations, MSA increases beginning in April and decreases in September, except for Alert which experiences increases beginning in March and decreases beginning in October. This period (April-September) corresponds to polar day, receding sea ice, increase in atmospheric oxidants as well as phytoplankton blooms2,28,86,87. During October-March, MSA is at near-zero concentrations, mainly due to polar night in the Arctic when little biological activity and a minimum of photochemical activity occur26,88. Combining these geographically dispersed stations with their varying seasonal cycles allows us to understand the Pan-Arctic seasonal cycle of MSA as represented by the median of data from all stations (Fig. 1b). Pan-Arctic MSA shows a sharp increase from April to May, stays elevated from May to July (11–13 ng m−3), and gradually decreases until September. This period of elevated MSA during polar day corresponds to enhanced solar radiation and marine biological activity in the Arctic25,26,88. The seasonal cycles presented in this study agree with previous studies which analyzed different periods8,42,47,48. The Pan-Arctic seasonal cycle shows variations from April through September; therefore, we limit our analysis to these months.

The Pan-Arctic seasonal cycle can be attributed to the different geographical locations of each station, as well as varying transport patterns, in relation to biologically active regions of open water. To help contextualize the Pan-Arctic MSA seasonal cycle, the transport climatology was explored for all stations combined for April through September (Fig. 2a). A noticeable feature is the greater geographical extent of air masses during April and September compared to May through August, which is likely a reflection of the extent of the polar dome89,90, which allows for efficient transport from lower latitudes. Whereas during May through August, air masses are mainly confined to marine areas in the Arctic91. It should be noted that the polar dome begins to recede during May, therefore May is often classified as a transition month. Overall, during the peak summer months, air masses mainly arrive from the Greenland/Barents Seas, Baffin Bay/Canadian Archipelago, the Bering Strait/Sea, and the central Arctic Ocean (Fig. 2a).

Fig. 2: Monthly Pan-Arctic transport pathways, MSA source regions, and model prediction/bias.
figure 2

a Transport pathways for all stations expressed as the normalized residence time. b Potential source contribution function for MSA. c Concentration weighted trajectory of MSA using in situ observations. d Model prediction of the CWT MSA. e Model bias (difference between modeled and observed values) expressed as a percentage. Each row uses the same color scale on the right.

To elucidate the source regions of MSA between April and September, we applied a Potential Source Contribution Function (PSCF) and a Concentration Weighted Trajectory (CWT) analysis using in situ observations of MSA and FLEXPART (see Methods Section for details). Figure 2b and c displays the Pan-Arctic PSCF and CWT, respectively. The PSCF shows the probability for potential source regions, however, it doesn’t show the contribution to observed concentrations that can be expected for a given air mass origin, therefore, we complemented the PSCF with a CWT analysis (Fig. 2c). The PSCF for April shows two regions with a high probability of being source regions, (i) the area extending from the Greenland Sea to lower Baffin Bay, and (ii) continental Alaska. The first area is likely related to the biological activity there6, while the second likely represents transport pathways from the northeastern subarctic Pacific Ocean45,47,71, as observed from May to August (Fig. 2b). During September, the potential source regions of MSA are delocalized around the Arctic. Similar source regions of MSA are observed from May to August (Fig. 2b), which include the Greenland, Barents, Labrador, and Bering Seas. Overall, the marine areas of the Atlantic and Pacific sectors of the Arctic exhibit the highest probability of being an MSA source region during May through July. The CWT analysis supports the PSCF as it shows that the most probable source regions also exhibit the highest contributions to MSA. However, differences exist, for example, during May the PSCF shows that both the Atlantic and Pacific sectors appear to have similar probabilities of being source regions while the CWT shows the Atlantic sector has a higher relative contribution, highlighting the complementary nature of these two approaches. The most probable source regions of the Atlantic and Pacific sectors also make the largest contributions to MSA during the summer, which is likely due to a combination of relatively stronger biological activity and air mass transport from these regions.

Elucidating the relationships between MSA and atmospheric variables

A transport climatology and source region analysis are useful, but they cannot elucidate the relationships between atmospheric variables and observed MSA concentrations. To explore these links, we developed an ML approach to model the trajectory-weighted concentration of MSA (or concentration-weighted trajectory, CWT), we derived the trajectory-weighted meteorological conditions (or meteorological-weighted trajectory, MWT) using FLEXPART and ERA5 to use as input for the machine learning model (See Supplementary Fig. 1 for maps of the MWTs). MWTs consider the average meteorological conditions for the actual location of air masses (using the gridded ERA5 data) and are analogous to our target variable (CWT), thus providing appropriate explanatory variables for our specific ML task and the target variable. Appropriate explanatory variables are important as the prediction efficacy of ML models largely depends on the appropriateness of the explanatory data provided (see Methods for a full description of MWTs and their calculation). The XGBoost library implementing gradient-boosted decision trees was selected among other competing models due to computational efficiency, performance, and high generalization accuracy on many data modeling tasks92. The relevant variables were selected using recursive feature elimination (See Supplementary Fig. 3). The model was optimized using the Optuna93 library, which implements a Bayesian approach to hyperparameter tuning, and the SHAP approach was utilized to explain the model. See the Methods Section for further details about the steps in this methodology.

Before interpreting the model results, we evaluate its performance using spatially stratified groups for the training and test sets for all months combined (See Supplementary Fig. 4 for a schematic of the spatially stratified groups) as well as evaluation of the test set on a monthly basis using the slope, R2, and root mean squared error expressed as a percentage of the mean (RMSE %). Figure 3a and b display the results of this model evaluation. The training and test set evaluation shows overall good performance of our model, with a test set mean (±SD) slope and R2 of 0.82 ± 0.02 and 0.84 ± 0.02, respectively, compared to the training set slope and R2 of 0.91 ± 0.003 and 0.94 ± 0.002, respectively (Fig. 3a). These metrics indicate that our model has good performance (high R2/slope and low RMSE % values) and is not strongly overfitting the data (which would be characterized by very high training set metrics and a large difference between training and test set metrics with the latter usually being very poor). When evaluating our model on a monthly basis, we can see that May through August are better reproduced by the model than April and September. Supplementary Fig. 5 shows further investigation of the model performance with boxplots of the percent bias for deciles of the CWT MSA for each month. In this study, bias is defined as the difference between modeled and observed values. A common theme for the distribution of the percent bias in each decile is apparent: at lower values of CWT MSA, the model overpredicts, while at higher values of CWT MSA, the model underpredicts, leading to an overall underestimation of MSA by our model (Fig. 3a). This pattern is also observed when analyzing the spatial bias of our model. Figure 2d and e shows the predicted MSA source regions and the percent bias from the CWT values, which displays this concentration regime bias. It should be noted that the observed CWT MSA is likely biased by the stations being mainly located in the European and North American Arctic (Fig. 1a) and three of the stations experience peak MSA during May and the other in August. There are no long-term observations of MSA in the Siberian Arctic8 and while air masses arriving at these stations cover the entire Arctic area, the Eastern Arctic is likely under-represented in the CWT MSA dataset. Our model overestimates MSA during April/May in the Canadian Archipelago and Alaska, which is likely a result of the low concentrations observed at Utqiaġvik/Barrow during these months (Fig. 2d). These geographic and temporal variabilities should be kept in mind when interpreting the results. Overall, the spatial bias appears to be randomly distributed (with a slight influence from the geography and timing of MSA observations) and the model gives overall satisfactory results, indicating that the model is robust, accurate, and appropriate for the given modeling task.

Fig. 3: Evaluation metrics for the model performance.
figure 3

a Train and test sets metrics and b monthly metrics for the test sets. The color in each cell indicates the highest value for that row. The upper number in each panel represents the mean and the numbers in brackets represent the standard deviation.

To investigate the drivers of MSA using our model, we used the SHAP methodology which quantifies a variable’s contribution to the model output for a given observation85, whereby variables with positive (negative) SHAP values are actively increasing (decreasing) the model prediction (see Methods for further details). Figure 4 shows the overall variable importance while Supplementary Fig. 6 shows the variable importance on a monthly basis. The overall (or global) variable importance is represented by the median of the absolute SHAP values, with larger values having a greater impact on the model output. It should be noted that the SHAP values do not represent how well these explanatory variables explain the behavior of our target variable in the natural environment but how well these variables explain the behavior of our target variable in our model, therefore representing purely statistical relationships. Given these explanatory variables are environmental data, some correlations between them are unavoidable (Table S3), this can lead to difficulties when explaining the model using the SHAP methodology94. Therefore, to avoid over-interpreting the model results due to correlated variables, we refrain from discussing the rankings of individual explanatory variables but rather discuss which explanatory variables can represent certain processes in the model. From Fig. 4, we can see that the variables with the highest impact on the model are SW radiation, temperature, LW radiation, and low cloud cover, after which the variable importance is low with little variation for the remaining variables. These top four variables are all connected to either the gas- or aqueous-phase oxidation of DMS, which indicates that our model gives physically meaningful and interpretable results. It has been observed before that SW radiation is an important factor connected to MSA81,95,96. SW radiation is necessary for the photochemical production of oxidants (OH, O3, halogens)97, both in the gas- and aqueous-phase, and for stimulating marine biological activity98. SW radiation is therefore expected to have a positive impact on MSA concentrations, indeed MSA is mainly observed during the sunlit, summer months8,47. Temperature affects DMS emission and oxidation to MSA through multiple different mechanisms. Importantly, temperature will affect the sea surface temperature and sea ice extent/concentration, with positive temperatures increasing the DMS emission and reducing sea ice cover26,99,100,101. Temperature would also affect the oxidation of DMS after emission. The gas-phase DMS oxidation produces higher yields of MSA at colder temperatures19,102. The thermodynamic phase of clouds will also be affected by temperature, with positive temperatures indicating the presence of mostly liquid-phase clouds and negative temperatures suggesting mixed- or ice-phase clouds, although this relationship is non-linear103. The thermodynamic phase of clouds would affect the aqueous-phase oxidation of DMS19,36,37,38, with liquid-containing clouds being more efficient than ice-containing clouds. Overall, it appears temperature will have a positive relationship with DMS emission and oxidation to MSA. LW radiation is used here as a proxy for the presence of liquid-containing clouds as liquid cloud water will enhance the downwelling LW radiation compared to clear sky conditions104. While low-level liquid-containing clouds will affect the downwelling LW radiation, other factors including temperature and liquid water content will also have an effect, this should be kept in mind when interpreting the results. Low cloud cover indicates the fraction of low-altitude clouds present although it does not specify the phase (which has implications for the aqueous-phase oxidation of DMS and LW downwelling radiation). The presence of low-level clouds has been demonstrated to aid in the aqueous-phase processing of DMS39,50, although low-level clouds can also reduce the amount of incoming SW radiation. LW radiation and low cloud cover therefore serve as proxies for aqueous-phase processes and are both expected to have an overall positive impact on MSA. These overall top four variables are also consistently amongst the most important variables on a monthly basis (Supplementary Fig. 6), therefore, we will focus our further discussion on these top four variables.

Fig. 4: The overall importance of each explanatory variable.
figure 4

The bars represent the median of the absolute SHAP value while the black lines represent the interquartile range.

To investigate the relationship of SHAP values between different values of the explanatory variables (explanatory variables refer to variables input into the ML model. i.e., MWTs), we binned the explanatory variables into 10 equally spaced bins and calculated the median and interquartile range of SHAP values for each bin for each month, as displayed in Fig. 5. Supplementary Fig. 7 shows the same figure but including all explanatory variables instead of only the top four. This allows us to discover how certain values of our explanatory variables affect the model output. We also show the geographic distribution of SHAP values for the top four variables and all variables in Fig. 5 and Supplementary Fig. 8, respectively.

Fig. 5: Relationships between SHAP and MWT values for the top four explanatory variables.
figure 5

The relationships between the explanatory and SHAP values for a SW surface radiation, b temperature, c LW surface radiation, and d low cloud cover for each month (columns). Ten equally spaced bins were calculated for each variable, and the median (red line) and IQR (blue shading) of the SHAP values were computed for each bin. The value listed on the x-axis is the midpoint of each bin. The light gray bars represent the relative frequency of the MWT values for each bin (right y-axis).

For SW radiation, a complex pattern emerges for the relationship between the explanatory and the SHAP values. During April, July, and August, a mostly linear, monotonic relationship is displayed whilst during May and June a positive, non-monotonic relationship exists (Fig. 5a). During May and September, an entirely positive and negative relationship is observed, respectively. The overall relationship between explanatory and SHAP values of SW radiation supports the postulate that the photo-chemical oxidation of DMS or the production of key oxidants (OH, halogens, or O3, within the aqueous phase or transferred to afterward) is a key atmospheric process leading to the formation of MSA, which is plausible based on theory. Geographically, SW radiation shows ubiquitously positive SHAP values over the Arctic region during May through July and mainly negative SHAP values during April, August, and September (Fig. 6a), corresponding to the seasonal cycle of SW radiation from ERA5 (Supplementary Fig. 2).

Fig. 6: Spatial distribution of SHAP values for the top four explanatory variables.
figure 6

The spatial distribution of SHAP values for a SW surface radiation, b temperature, c LW surface radiation, and d low cloud cover for each month (columns). Each row uses the same color scale on the right.

For temperature, during each month the overall relationship between the explanatory and SHAP temperature values is non-linear and there appears to be a threshold value of ~273 K, where afterwards SHAP values turn positive. Below this value, the average temperature SHAP values are all negative and mainly show a flat relationship with temperature values (Fig. 5b). This indicates that areas with above-freezing temperatures will make a positive contribution to the model output of MSA and areas with below-freezing temperatures will make a negative contribution. This is likely related to areas of open ocean with no sea ice, as melting sea ice regulates the overlying atmospheric temperature to around 273 K, thus areas with no sea ice can regularly experience above-freezing temperatures, whereas areas with melting sea ice in the summer experience temperatures around freezing. The absence of sea ice also allows for the emission of DMS into the atmosphere22,105. Song et al. 81 observed an opposite relationship between biogenic type aerosols (which were characterized by high fractions of MSA) and temperature, with a positive impact from temperature up until 273 K and a negative impact afterwards81, suggesting above-freezing temperatures would act to decrease MSA. This pattern was attributed to the role of disappearing sea ice with positive temperatures (which likely serves as a habitat for sympagic algae106) rather than the temperature-dependent kinetics of DMS oxidation. It should be noted that Song et al. 81 was only for a single year (2015) and used in situ measurements of temperature (representative of the conditions at the measurement location) as opposed to spatially-resolved temperature as in this study (representative of conditions in source regions and transport pathways), which would affect the interpretations. Studies from Utqiaġvik/Barrow show positive relationships between MSA and air temperature45. Above-freezing temperatures will also affect the sea surface temperature, the stability of the boundary layer, and the presence of liquid-containing clouds. While the gas-phase oxidation of DMS into MSA has been shown to proceed at faster rates under colder temperatures19,102, the lifetime of intermediates in the aqueous-phase oxidation of DMS into MSA is shorter at warmer temperatures107 and the aqueous-phase oxidation requires the presence of liquid water. Conversely, in areas with below-freezing temperatures, the presence of sea ice (physical barrier to the emission of DMS) as well as mixed-phase clouds (containing ice and supercooled water) is likely negatively affecting this relationship. The geographic distribution of temperature SHAP values closely follows the relationship between explanatory and SHAP values for temperature, with positive SHAP values in regions with positive temperatures (Fig. 6b), which are mainly located at more southerly latitudes. Explanatory and SHAP values for temperature reach their maximum in the marine areas of the Atlantic and Pacific sectors during June and July (Fig. 6b and Supplementary Fig. 1), highlighting the importance of these two regions to MSA formation.

For LW radiation, a different pattern is observed. During April, the contribution of LW radiation to the model output is mostly flat and negative but becomes positive at the highest values. Conversely, during May, a mostly positive contribution from LW radiation is observed which becomes negative at the higher values (Fig. 5c). During June through September, a U-shaped relationship is observed, with a positive contribution from LW radiation at the highest and lowest values, although the most frequent values of LW radiation make a negative contribution. This pattern also suggests that gas- and aqueous-phase processes are both contributing to the model output, with higher values of LW radiation indicating aqueous-phase oxidation and lower values reflecting gas-phase oxidation. Spatially, LW radiation SHAP values show contributions of little magnitude during April and September (Fig. 6c), with slightly positive values in the Atlantic and Pacific sectors and slightly negative over the central Arctic Ocean and continental regions. During May, LW radiation SHAP values are mainly positive geographically except for the Bering Sea region while during June through August, LW radiation SHAP values are positive over the marine areas of the Atlantic and Pacific sectors as well as over the Greenlandic continent (Fig. 6c).

For low cloud cover, the overall relationship is mainly linearly positive and there also appears to be a threshold (~0.7) for positive SHAP values (Fig. 5d). The highest SHAP values are found for the highest values of low cloud cover (which are also the most frequent) and at lower values of low cloud cover, the SHAP values appear to show an increasing tendency (Fig. 5d). This suggests that for low cloud cover both high and low values are contributing to a positive model output of MSA, which also indicates the dual processes of gas- and aqueous phase oxidation of DMS19 during clear and cloudy conditions, respectively, as highlighted for LW radiation. However, it should be noted that the variability of the response for small values of low cloud cover is large (Fig. 5d). For low cloud cover, positive SHAP values are mainly confined to marine areas while continental areas mainly show negative SHAP values. Low cloud cover SHAP values are consistently positive over the central Arctic Ocean, where the MWT is at its highest values (Supplementary Fig. 1). Low cloud cover SHAP values reach a maximum in the Bering Sea during June and in the Greenland/Barents Sea during July (Fig. 6d). The geographic distribution of low cloud cover SHAP values is similar to those of LW radiation, although with noticeable differences which highlights the complementary nature of these two variables as proxies for aqueous-phase processes. LW radiation SHAP values are negative over the central Arctic Ocean while low cloud cover’s SHAP values are mainly positive. This could indicate differences in the cloud phase for this region compared to the marine regions of the Atlantic and Pacific sectors. While low cloud cover indicates that clouds are present, the absence of elevated LW radiation values over the central Arctic Ocean suggests that these clouds are either ice- or mixed-phase clouds, which would emit less LW radiation downwards, as opposed to the more likely liquid-containing clouds in the more southerly regions (Supplementary Fig. 1).

The overall relationships suggest that on average there are two processes making positive contributions to the model output of MSA, gas- and aqueous-phase oxidation of DMS. The positive SHAP values at higher values of SW radiation and lower values of LW radiation suggest that clear sky conditions are also contributing to MSA production through the gas-phase oxidation of DMS. The temperature and low cloud cover threshold values for positive SHAP values as well as the positive SHAP values at higher values of LW radiation (and conversely the negative SHAP values at lower values of SW radiation) indicate that liquid-containing clouds in sufficient amounts are vital for MSA production (i.e., aqueous-phase oxidation of DMS). The model utilized in this study analyzes the geographic and temporal distributions of the explanatory variables based on monthly aggregates, therefore, any temporal information is seasonal and not instantaneous, hence the dual processes identified (gas-phase oxidation in clear conditions and aqueous-phase oxidation in cloudy conditions). This is supported by the relationships observed for other variables. Higher values of total column cloud liquid water and RH both show positive SHAP values (Supplementary Fig. 7), indicating aqueous-phase processes. It should be noted that both these variables are associated with higher uncertainty in ERA5108,109,110. Boundary layer height, wind speed, and mean sea level pressure all show indications of a U-shaped pattern for positive SHAP values (Supplementary Fig. 7). Low values of mean sea level pressure are associated with storms that bring cloudy conditions and high wind speeds (which also increases the boundary layer height and promotes vertical mixing in the ocean surface), these conditions are likely associated with the aqueous-phase processes of MSA production. High values of mean sea level pressure are associated with calm and sunny conditions with low wind speeds, which is likely related to the gas-phase processes contributing to MSA production. Peak MSA concentrations are observed in May (Fig. 1b), therefore, we examined the causes for this peak in more detail (Supplementary Note 1). The peak MSA concentrations in May can be linked to strong marine biological activity and meteorological processes which facilitate the efficient transport from southerly latitudes.

Overall, our model indicates that two main processes are affecting Pan-Arctic MSA concentrations: gas-phase oxidation in clear conditions (indicated through large and mainly positive SHAP values of SW radiation) and aqueous-phase oxidation in cloudy conditions (indicated through U-shaped pattern for LW radiation and low cloud cover values against their SHAP values) which is occurring mainly in the marine areas of the Atlantic and Pacific sectors with temperatures above freezing (threshold for positive temperature SHAP values above 273 K).

Future projections of Pan-Arctic MSA using trends in ERA5 data and machine learning

As discussed in the introduction, MSA concentrations have undergone significant trends, positive and negative, in various parts of the Arctic8,45,46,47,48 and it is plausible that MSA concentrations will change also in the future. Given the relevance of MSA for cloud formation in the Arctic, it is important to understand how the compound might evolve as environmental conditions change in the Arctic.

The ability of our model to represent the seasonal cycle of MSA concentrations and source regions with confidence allows us to project the future response of Pan-Arctic MSA based on past trends of ERA5. We took the entire available record (1979–2022) of the surface level ERA5 variables used in the ML model, aggregated it to monthly means, and performed a trend analysis using the Mann-Kendall/Theil-Sen methodology at a confidence level (CL) of 95% for each month separately (see Methods for further details). Monthly trends for each variable are displayed in Supplementary Fig. 9. The percentage of variables with statistically significant (SS) trends in each grid cell is displayed in Supplementary Fig. 10 which shows that during April and September, most variables are changing in the central Arctic Ocean and during the other summer months the SS trends are more geographically dispersed. Temperature and LW radiation display mainly positive trends during all months and the highest fraction of grid cells with SS trends, wind speed and mean sea level pressure show the lowest fraction of grid cells with SS trends, while all other variables display a mix of positive and negative trends (Supplementary Fig. 10). We project the slope of the trends for SS grid cells forward for different decadal intervals in the future (10 through 50 years in steps of 10 years) and add the total change over a given period to the original value. We assume only SS grid cells will change and the linear trends will continue at the same rate (see Supplementary Note 2 for discussion of limitations and assumptions of this methodology). We use these projected atmospheric conditions to model the future Pan-Arctic MSA, over different decadal intervals up to year 50, using the previously trained ML model. We have constrained these projected atmospheric conditions to physically realistic values (see Table S3 for details on the constraints). Overall, this resulted in a small fraction of data being imputed, which did not significantly affect the results but ensures our projected atmospheric conditions are physically reasonable.

The median and interquartile range of the projected Pan-Arctic MSA concentrations for each month and different decadal intervals is displayed in Fig. 7a. The distributions show slight changes over the decadal intervals, therefore, we tested for SS differences between MSA concentrations for year 0 and each decadal interval. Depending on the properties of the data (normality or homogeneity of variance), either a two-tailed Welsch’s or Student’s t-test was applied on the 95% CL. All of the distributions for each month and decadal interval are significantly different from year 0 (p < 0.05) except for year 10 in May (p = 0.59). The median and interquartile range of the absolute and relative changes in MSA concentrations for each month and decadal interval is displayed in Supplementary Fig. 11. It is evident that there are positive and negative changes in MSA over the entire Arctic. Therefore, we calculated the total relative changes in Pan-Arctic MSA for each decadal interval compared to year 0 (See Methods), the results are displayed in Fig. 7b. The bars and vertical lines represent the total relative changes using the slope and the lower/upper confidence intervals (CIs) from the Theil-Sen estimator, respectively, to project changes in MSA. Two patterns are readily evident from Fig. 7b, first the slightly negative, non-monotonic changes during April and May, and second, the positive, monotonic changes during June through September.

Fig. 7: Projected changes in the seasonal cycle of Pan-Arctic MSA.
figure 7

Changes in the distribution of projected Pan-Arctic MSA values for each decadal interval are displayed in a. The middle line represents the median while the upper and lower limits of the box represent the 25th/75th percentiles. The total relative change in MSA expressed as a percent for each decadal interval compared to year 0 is displayed in b. The vertical lines for each bar represent the upper/lower confidence intervals (CIs) of the Theil-Sen slope used for extrapolating the trends in ERA5.

The two patterns presented in Fig. 7b allude to a shift in the seasonal cycle of future projections of Pan-Arctic MSA, with small decreases during the late spring/early summer and larger increases during mid to late summer. Currently, the Pan-Arctic seasonal cycle of MSA peaks during May and stays elevated until July (Fig. 1b), whereas our results postulate that the peak duration of the future seasonal cycle is expected to broaden and the peak month will possibly shift from May to later in the summer. Interannual variation in MSA concentrations is not uncommon in the Arctic8,41,42,43,47, here we project (over long time-scales on the order of decades) that peak Pan-Arctic MSA concentrations will likely undergo a systematic shift to later in the summer although interannual variability is still observed (non-monotonic changes in April/May). The geographic distributions of the changes in MSA reveal that each month experiences both positive and negative changes depending on the geographic location (Supplementary Fig. 12). We focus on the geographic distributions of the year 50 projections as other decadal intervals produce similar patterns but with smaller magnitudes. During April, decreases can be observed over the Arctic Ocean and parts of Siberia while sporadic increases are mainly confined to the periphery of the Arctic, overall the large area with decreases outweighs the sporadic increases. During May, strong decreases are observed in the Greenland, Barents, and Kara Seas as well as the Arctic Ocean, while increases are mainly observed around the southern Greenland, Bering Strait, and the Chukchi Sea. The strong decreases offset the increases to produce an overall decrease during May. During June and July, decreases are mainly confined to the Barents Sea and the northern Pacific Ocean while increases are evident in all other regions, except for the central Arctic Ocean which shows little to no change. During August, decreases are mostly distributed around the marine areas while increases are observed mainly around the Canadian Archipelago and Siberian coast. The Canadian Archipelago is an area where our model overestimates MSA (Fig. 2e) and the increases in this region are likely an upper limit given the overestimation by our model in this region. The low CWT values and presence of model bias in these regions likely contribute to the large CIs for the projections of MSA in Fig. 7b during August. During September, increases are observed mainly in the Greenland Sea region. Interestingly, the regions around Svalbard are experiencing decreases in MSA during all months except for September and the regions around the Bering Strait are mainly experiencing increases during all months (Supplementary Fig. 12). The geographic distribution of the projected changes in MSA is a combination of the trends for each variable and their contribution to the model output in these regions. Overall, while the Pan-Arctic MSA concentrations are projected to change in the future, these changes are heterogeneously distributed geographically.

To elucidate which variables are responsible for the shift in the seasonal cycle, we devised a leave-one-out approach to calculate the contribution of each variable to the projected changes. To inspect the relationship between the trends in a variable and how said variable contributes to the projected changes, we classified each grid cell according to the direction of the trend and contribution. The leave-one-out and classification methodology are described in more detail in the Methods section. The results of the leave-one-out analysis and classification of the year 50 projections are displayed in Fig. 8 and Supplementary Fig. 15, respectively. The effects of the main variables and their trends are summarized in Table 1. During April, positive RH trends are contributing to the projected decrease in MSA. During May and June, increases in MSA due to positive temperature trends are counterbalanced by decreases in MSA due to positive LW radiation trends. During July and August, a more complex picture is presented. For instance, increases in MSA can be attributed to positive LW radiation and negative BLH trends, and decreases in MSA are due to negative SW radiation trends. During September, positive temperature trends are driving the projected increase in MSA, whereas negative low cloud cover trends are contributing to decreases in MSA. Overall, the contribution of a variable and the relationship to said variable’s trends to the projected changes in MSA is highly complex with different variables contributing in different ways in different months. The variables making the highest contribution to projected increases in MSA are temperature, LW radiation, and BLH, while RH, LW, and SW radiation are mainly contributing to the projected decreases in MSA. Of these only BLH and SW radiation show negative trends, highlighting the intricate balance between the (linear) trends of a variable and the (non-linear) relationships with said variable and MSA.

Fig. 8: Explanatory variable contribution to the projected changes in the Pan-Arctic MSA seasonal cycle.
figure 8

The total relative change in Pan-Arctic MSA as a result of the leave-one-out analysis for each explanatory variable and decadal interval. The colors represent the different explanatory variables and the hatching represents different decadal intervals.

Table 1 Summary of future projections of MSA, which main explanatory variables are contributing to these projections, and how

Atmospheric implications

Changes to the MSA seasonal cycle, as projected by our model, will have an effect on the properties of secondary particles in the Arctic which, in turn, will have direct and indirect implications for the radiative balance. Changes in MSA will affect the physical properties of aerosols, namely mass and volume, which would be especially meaningful during summer and early autumn when total aerosol mass/volume concentrations are at a minimum8,91. The increase in mass/volume will also increase the scattering efficiency, thus having direct implications on the radiative balance. This will also depend on future changes to the underlying surface properties (snow-covered vs bare land or sea ice vs open ocean in a future environment). The changes in MSA concentrations will also likely indirectly affect the aerosol number concentration and therefore CCN concentrations. The additional particulate MSA could affect the number concentration by partitioning into the gas-phase following aqueous-phase formation and droplet evaporation thus providing an additional precursor (or condensable) vapor source. The additional gas-phase MSA could aid in the growth of newly formed particles and possibly participate in nucleation54,55, although this will also depend on changes in sulfuric acid111, iodic acid56,112, and other constituents (e.g., reduced nitrogen compounds and organic acids)113. The number of particles arising from the additional gas-phase MSA that could potentially reach CCN sizes will impact cloud microphysical properties and thus the radiative balance114. This also applies to particles whose hygroscopicity changes from changes in MSA115, as more hygroscopic particles (from additional or less MSA depending on the chemical composition) will more effectively form clouds. These effects would be most pronounced during summer when the sub-micron organic aerosol fraction (which is less hygroscopic) is high44,116,117,118,119, sea salt aerosol is low120, the condensation sink (CS) is lowest, and the cloud regime is CCN-sensitive67. This would also depend on the meteorological conditions (updraught velocities and cloud cover)69 and pre-existing aerosol load in a future environment. Clouds mainly reflect solar radiation during the summer (due to decreases in the underlying surface albedo) and our model projects the largest relative increase in MSA during June and July, therefore, the changes in MSA would likely result in additional negative radiative forcing thus cooling the surface. The changes in MSA could also affect the fate of aerosols. The increase in CCN active particles (either through growth to CCN sizes or changing hygroscopicity in existing particles) will affect the wet removal of the aerosol population. More numerous or more active CCN particles might be preferentially removed via precipitation, therefore even though the source strength is increasing the overall effect on the atmospheric burden might be decreasing, highlighting the complex nature of projecting future changes in aerosol properties9,16,18.

Two important aspects of the environmental system that are pertinent to MSA burdens but are missing from our model are precursor sources (DMS and oxidants) and MSA sinks (wet and dry removal). Studies have shown that in recent decades DMS emission, net primary production, and chlorophyll-α are all increasing in the Arctic6,101,121,122, which could be attributed to a combination of increased open water and nutrient input. Changes in the Arctic ecosystem (shallowing of the oceanic mixed layer depth, increases in sea surface temperature, and acidification)105 will also likely affect the emission of DMS. Geographically, ice break-up, chlorophyll-α, and polar spring blooms are all moving northward123. Biogeochemical modeling studies suggest that under enhanced CO2 levels, DMS emissions will decrease annually overall although with vernal decreases related to increased oceanic mixed layer depth and summer increases due to decreases in sea ice extent124. These studies suggest that not only are sources of DMS strengthening but they are also moving increasingly northward which will positively affect the overall MSA production. Trends in atmospheric oxidants relevant for MSA production indicate that in the Arctic, ozone125,126 and bromine monoxide127 are increasing also. A recent modeling study showed bromine levels are expected to increase in the future due to an increase in first-year sea ice and sea salt aerosol128. Stevenson et al. 129 showed that OH levels are increasing globally and in the Arctic. Together these studies advocate for increasing emission of DMS and atmospheric oxidant levels in the future, both of which will positively affect MSA production. Changes in future precipitation patterns will also affect future MSA concentrations, by physically removing aerosols from the atmosphere but also by lowering the CS for vapors. Using accumulated precipitation along back-trajectories, studies have shown that accumulated precipitation is increasing at two High Arctic sites, mainly during autumn80,130. Interestingly, modeling studies have shown precipitation to increase substantially, especially during summer131,132,133,134. These studies suggest wet removal will become more important in a future Arctic climate and will likely affect our projections of MSA.

Anthropogenic activities (i.e., shipping, tourism, and resource extraction) will also undergo future changes135,136, with predicted increases due to sea ice retreat137,138 and will have an effect on future aerosol and cloud properties139. Increases in anthropogenic activity will likely result in increased emissions of nitrogen oxides (NOx), volatile organic compounds (VOCs), BC, CO, sulfur oxides (SOx), greenhouse gases, and production of O3140. Increased atmospheric burdens of BC and organic aerosol (from VOC oxidation and condensation) from Arctic anthropogenic activities would increase the CS. This would suggest more condensation of MSA onto pre-existing aerosol surfaces rather than contributing to growth of newly formed particles. This would depend on the chemical composition of the coating material and aerosol physicochemical properties65,66. Indeed increased SOx emissions (and thus increased sulfate levels) would increase aerosol acidity thus affecting the partitioning of MSA to the gas-phase following aqueous-phase formation114. Increased levels of NOx and O3 could alter the branching pathways of DMS oxidation to produce more SO2 instead of MSA19,37,96. These changes in the aerosol burden due to anthropogenic activities will in turn affect the Arctic environment, such as increased sea ice/snow melt due to increased BC deposition as well as altered nutrient cycling due to inorganic aerosol deposition. The expected increase of Arctic anthropogenic emissions of CO2 and BC (and reduced emissions of SOx due to regulations) will lead to positive radiative forcing140. These increased emissions will likely have the most impact on local air quality in regions experiencing expanded anthropogenic activities (shipping lanes and areas in proximity to industrial activities) and their overall impact on the aerosol population might be small and will likely have negligible impacts on the radiative budget138. Overall, the increased anthropogenic activity in the Arctic will affect natural aerosol burdens and dynamics, however, how exactly remains elusive.

This analysis shows that the seasonal cycle of the geographic source regions of MSA can be accurately modeled using the average atmospheric conditions and ML, highlighting the potential use of this methodology to be applied to other natural components of the Arctic atmospheric system, i.e., sea salt aerosol and gas-phase DMS. We also provide guidance as to which atmospheric variables are most important for explaining variations in MSA and how, thus emphasizing the need for an accurate representation of the conditions affecting gas- and aqueous-phase oxidation of DMS in chemical transport models. We also show, based only on changes in atmospheric conditions and not changes in oceanic and biological factors affecting DMS emission/oxidation, that MSA is projected to undergo a shift in the seasonal cycle. Other models and observations have shown that these oceanic and biological factors (sea ice extent/concentration, chlorophyll-α, and DMS emissions) are changing over these time ranges, however, precipitation patterns and intensity are also changing. This suggests that changes in oceanic variables will act in conjunction with changes in atmospheric variables to increase Pan-Arctic MSA concentrations although precipitation will likely affect these changes. The competition between source and sink strength will ultimately determine how aerosol components change in the future. How the Arctic environment responds to future climate change (including atmospheric and oceanic conditions) remains an open research question and provides an avenue for future research.

Methods

In situ aerosol observations

In situ filter samples of methane sulfonic acid (MSA) were collected at four Arctic stations (Alert, Gruvebadet, Thule, and Utqiaġvik/Barrow). For Alert, Gruvebadet, and Thule, the sampling period covered 2010–2017, while for Utqiaġvik/Barrow, it covered 2008–2014. The period 2010–2017 was selected since MSA measurements were available at most stations from April through September, while MSA observations were available at Utqiaġvik/Barrow from 2015–2017 the sampling frequency changed relative to the previous years (2008–2014) and the data coverage is sparse45. Therefore, 2008–2014 was selected as the period for Utqiaġvik/Barrow due to a consistent sampling frequency and adequate data coverage. Figure 1 displays the location of each station and details about each station are given below. While there are differences in sampling (different inlet and temporal resolution) and analysis (different ion chromatographs) at each station, these are not expected to drastically affect the concentration levels reported from each station. Ion chromatography analysis by two different laboratories for samples from Alert in 2018 showed good agreement118 and in general ion chromatography displays reproducible results141. Therefore, we can be confident that the reported MSA concentrations from different stations are comparable.

Alert, located on Ellesmere Island in Nunavut, Canada (82.5° N, 62.4° W, 210 m above sea level (asl)) is the most northern land-based research station in the High Arctic and has been operational since 1980142. Aerosols, with no upper size limits, were sampled onto filters (20 × 25 cm Whatman 41) using a high-volume sampler with a duration of seven days. Filters were shipped from Alert to Toronto for analysis. MSA was extracted from one-eighth of the filters using 12 mL of deionized water and sonication for 1 h (>97% extraction efficiency)143. Samples were left at room temperature for 48 h before analysis by ion chromatography, with a detection limit ranging from 0.03 to 0.4 ng m−347,48. MSA was quantified using a Dionex 4500i ion chromatograph with a 200 µL injection loop, AS4A column, conductivity detector, and an eluent of 5 mM Na2B4O7 at 2.0 mL min−1. In-between analyses, the column was flushed with 28 mM Na2B4O7 and then re-equilibrated with 5 mM Na2B4O7. A micromembrane suppressor (H2SO4) was included to reduce the baseline conductivity and therefore background noise. The step function for the eluent concentration during analysis was incorporated to avoid interferences from the later elution of stronger anions143. The analytical precision and accuracy are listed at 5 and 2%, respectively, while the uncertainty from random errors during field sampling and laboratory handling/analysis is estimated to be <13%143,144.

Gruvebadet Observatory, located in Ny-Ålesund, Norway on the Svalbard Archipelago (78.9° N, 11.9° E, 50 m asl), has been sampling aerosol chemical composition since 2010 from March to September. The time resolution of each sample was 24 h. Main wind directions indicate that the station is not influenced by local anthropogenic emission sources which are a few kilometers away145.

The Thule High Arctic Atmospheric Observatory (THAAO) is located in Thule, Greenland (76.5° N, 68.8° W, 220 m asl) off the coast of Baffin Bay. The time resolution of each sample was 48 h. The dominant wind directions indicate that local anthropogenic emission sources, which are located a few kilometers away, are not influencing the station146.

At Gruvebadet and Thule, aerosols, with a diameter of less than 10 micrometers (PM10), were collected on 47 mm diameter Teflon filters (2 μm nominal porosity and 99% capture efficiency for 0.3 μm diameter particles) using a TECORA Skypost sequential sampler at a flow rate of 2.3 m3 h−1 according to UNI-EN123442. The filters were placed in plastic Petri dishes, frozen, and shipped to Italy for extraction and analysis. Filter quarters were extracted using 10 mL of Milli-Q ultrapure (>18 MΩ) in an ultrasonic bath for 20 min. The complete ionic content of the extracts was determined by using 3 chromatographic systems: one system for cations, one for inorganic anions and oxalate, and one for fluoride, MSA, and other low molecular weight organic acids (e.g., acetate, glycolate, propionate, formate, and pyruvate). MSA concentrations were determined by injecting 1 mL, using a Gilson 222 autosampler, into a Dionex Thermo-Fischer DX600 ion chromatograph utilizing a Thermo-Fischer Dionex TAC-2 pre-concentration column (50 mL dead-volume) and a Thermo-Fischer Dionex AS11 separation column using a gradient elution of Na2B4O7 solution from 0.075 mM to 2.5 mM as well as electrochemical suppression. This chromatographic method allows the complete separation of MSA from the nearest peak of pyruvate (peak resolution = 0.9). Prior to each analysis, a cleaning step using a 45 mM Na2B4O7 solution was implemented to ensure reproducible results147. The detection limit was 0.1 µg L−1 147, which field blanks were always below. The reproducibility of the MSA analysis should not differ by more than 5% (<5% relative standard deviation)41,42.

The Barrow Atmospheric Baseline Observatory site is located near the town of Utqiaġvik, Alaska (referred to as Utqiaġvik/Barrow) (71° N, 156.6° W, 10 m asl) and is part of the National Oceanic and Atmospheric Administration (NOAA) Earth System Research Laboratory (ESRL) Global Monitoring Division (GMD). The wind direction in real-time, only from the clean air sector (0°–129°), was used to avoid influence from local pollution148. Aerosols were collected using a Berner-type multi-jet cascade impactor with aerodynamic D50 cutoff diameters of 1 and 10 µm (only submicron chemical composition data was used in this study). Sample volumes of 1 (43 m3) to 5 (172 m3) days were collected depending on the time of year. After collection, samples were sealed in tubes and shipped to NOAA’s Pacific Marine Environmental Laboratory (PMEL) for chemical analysis by ion chromatography (Metrohm Compact IC 761). Filters were extracted by wetting with 1 mL of spectral grade methanol followed by an additional 5 mL of distilled deionized water and sonicating for 30 min. MSA was analyzed using a Phenomenex Star IonTM A300 anion column in front of a Metrosep ASUPP5 250/4.0 column with a 1.0 mM NaHCO3 and 5.0 mM Na2CO3 eluent and a 70 mM H2SO4 regenerant. The column and eluent combination ensure that MSA is completely resolved from other organic acids (e.g., pyruvic acid). The detection limit was 0.001 mg L−145. Concentrations of MSA are reported at 0 °C and 1013 mbar. The relative uncertainty including errors arising from the ion chromatography analysis, blanks, liquid extract volume, and volume of sampled air is ±11% (95% confidence level)149,150,151.

ERA5

ERA5 is the fifth-generation atmospheric reanalysis product from ECMWF84. The Integrated Forecast System (IFS) cycle 41r2 is the numerical model forming the basis of ERA5. ERA5 is an improvement on the previous reanalysis from ECMWF (ERA-Interim), with improved spatial and temporal resolution, and is considered the best guess of the state of the atmosphere. ERA5 is available for the entire globe at hourly resolution, although in this study ERA5 output for north of 45° N and every third hour was used to match the geographical extent and temporal resolution of the FLEXPART output. The surface-level (or single-level) ERA5 dataset on a 0.5° × 0.5° regular latitude-longitude grid with a spatial resolution of ~31 km was utilized. Relative humidity was calculated using 2 m air temperature and dew point temperature following the method of Pernov et al. 110. Here we use ERA5 data from 1 April to 30 September for 2008–2017 for the MWTs and data from 1979–2022 for the trend analysis. Data was obtained from the ECMWF Climate Data Store (CDS) (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview, last accessed November 24, 2022). For a complete list of the variables used in this study see Table S1. While ERA5 is one of the best-performing reanalysis datasets available152, it is not without its limitations including non-physical variability and trends due to changes in the observing system (although mainly affects high altitudes), shifts in tropospheric oceanic humidity due to the incorporation of microwave imager data (this is only present in the 1990s), and a warm bias of 5 K in the surface temperature over Arctic sea ice (mainly present during winter)84,153,154. Recently, ERA5 surface level variables were compared against continental ground-based stations spanning at least one decade for most sites. Overall ERA5 performed well for temperature, solar radiation, and pressure although less so for relative humidity and wind speed/direction110. These limitations will not affect our results or our interpretations.

FLEXPART

Simulations of gridded air mass residence times were performed with the Lagrangian particle dispersion model FLEXPART v9.183 to determine the source origins and transport pathways. The model has been driven with meteorological data from the ERA5 reanalysis with 0.5° × 0.5° resolution and 137 vertical levels available every third hour. A release of 50,000 tracer particles was initialized every three hours at the location of the atmospheric observatories and traced backward in time for up to 10 days. For Alert, Thule, and Utqiaġvik/Barrow, particles were released at 10 m above ground level (agl). For Gruvebadet, to account for the complex topography, particles were released within the altitude range of 10–100 m agl. The simulations were computed for a passive air tracer without removal processes. The main output from FLEXPART consists of 3-dimensional fields of residence time in units of seconds. Lagrangian dispersion models have been shown to be more representative than Eulerian or semi-Lagrangian models but they are not without their limitations including simplification of atmospheric processes (deposition, advection, and diffusion)83, limitations in the vertical resolution155, and choice of input meteorological data156. We calculated a Pan-Arctic transport climatology which is expressed as a normalized residence time for each month separately. For every FLEXPART release during a given month, the residence time in each grid cell was summed below an altitude of 1000 m and for a transport length of seven days backward in time. This total residence time was then divided by the grid cell area and normalized by the sum of all grid cells over the geographical domain (>50° N) and expressed as a percentage to give the normalized residence time. The output of the FLEXPART model is seconds and divided by the grid cell area gives units of s km−2, therefore, the normalized residence time allows for a meaningful comparison across all latitudes. Similar approaches have been utilized in the Arctic before to display transport patterns80,91,157. Considering these different transport parameters, we selected a transport length of 7 days to capture the timeframe of DMS emission and oxidation to MSA and a maximum altitude of 1000 m to capture the transport pathways not only in the boundary layer but also in the lower free troposphere. We performed a sensitivity analysis for these selections by varying the maximum altitude (200, 500, and 1000 m) and transport length (5, 7, and 10 days). Overall, extending the transport length and maximum altitude expanded the trajectory footprint, although, the general interpretations of the transport climatologies, potential source contribution function, and concentration-weighted trajectory remain unchanged.

Potential source contribution function

The Potential Source Contribution Function (PSCF) was used to identify the probability of MSA source regions at the different stations158,159,160. It should be noted that the PSCF represents the probability for potential source regions and preferred transport pathways for an atmospheric species at a receptor site and is a non-quantitative source analysis and as such is limited in distinguishing strong from moderate sources160. The 75th percentile of MSA in a given month was used as a threshold value as is common in the literature79,161,162,163. Due to the long sampling times of the filter packs, all trajectories occurring during a sample above the threshold were used in the PSCF calculation. The sum of the residence time for samples above the 75th percentile was normalized by the sum of the residence for all periods when MSA observations were available according to Eq. (1).

$${{PSCF}}_{i,j}=\frac{{m}_{i,j}}{{n}_{i,j}}$$
(1)

Where nij is the total residence time in grid cell ij and mij is the total residence time when the concentration was above the 75th percentile. The weighting scheme (W) adapted from Bressi et al. 164 was used to down-weight grid cells with low residence times.

$${Wij}=\left\{\begin{array}{ll}{0.75} \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{0.75}* \max \left(\log \left({n}_{{ij}}+1\right)\right) \,<\, {n}_{{ij}} \,<\, \max \left(\log \left({n}_{{ij}}+1\right)\right)\\ {0.5}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{0.5}* \max \left(\log \left({n}_{{ij}}+1\right)\right) \,<\, {n}_{{ij}} \,<\, 0.75* \max \left(\log \left({n}_{{ij}}+1\right)\right)\\ {0.15}\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{0.25}* \max \left(\log \left({n}_{{ij}}+1\right)\right) \,<\, {n}_{{ij}} \,<\, 0.5* \max \left(\log \left({n}_{{ij}}+1\right)\right)\\ 0.05\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,{n}_{{ij}} \,<\, 0.25* \max \left(\log \left({n}_{{ij}}+1\right)\right)\end{array}\right\}$$

Concentration weighted trajectory

The concentration-weighted trajectory (CWT) was used to estimate the source contribution of geographical regions to the observed MSA at the receptor sites and expresses the average contribution to the concentration expected if a trajectory passes over a given grid cell165,166. CWT is capable of determining the relative strength of potential sources and transport pathways by incorporating MSA concentrations at the receptor site with FLEXPART trajectories according to Eq. (2):

$${C}_{{ij}}=\frac{{\sum }_{l=1}^{S}{C}_{l}{\tau }_{i,j,l}}{{\sum }_{l=1}^{S}{\tau }_{i,j,l}}$$
(2)

where Ci,j is the trajectory weighted average concentration in the grid cell ij, l is the index of the sample, S is the total number of samples, Cl is the concentration observed at the sampling location (receptor site) during sample l, and τi,j,l is the residence time in grid cell ij during sample l. The CWT was downweighed in the same manner as the meteorological weighted trajectory (MWT) as described below.

Applying the weighting scheme used for the PSCF would result in meteorological variables being substantially reduced beyond realistic values if grid cells experienced the lower threshold of the weighting scheme. This approach is valid for variables that have a lower limit of zero, although for variables with a continuous distribution (e.g., temperature), applying this weighting scheme would result in unrealistic values. To overcome this, we applied a cutoff value of 0.25*max(log(nij + 1)), with grid cells below this value being replaced with NaN and grid cells above this value being kept as is. Different levels for this threshold were tested and were found not to majorly influence the results. We ultimately chose this value (i.e., the second level of the weighting scheme) as a compromise between including grid cells with a large enough residence time to be statistically robust and obtaining a large enough geographic footprint that represents the actual transport pathways.

Meteorological weighted trajectory

The MWT was used to estimate the environmental conditions of geographical regions of air masses that influenced the receptor sites. The MWT can be thought of as trajectory-weighted average meteorological conditions of air masses ending at the receptor site. The MWT was calculated using ERA5 reanalysis data and FLEXPART back trajectories. The ERA5 data was averaged using the arithmetic mean during the filter collection periods as well as during the preceding days corresponding to the trajectory length (7 days). This allows us to weigh the average meteorological conditions by the air mass residence time thus producing the trajectory-weighted meteorological conditions during this filter sample. Averaging over the filter collection period and preceding days was employed to resemble the situation for MSA collection, as conditions during filter collection and the preceding days of transport to the measurement site affect the measured concentration for a particular filter. The MWT closely reflects the monthly mean ERA5 values (Supplementary Figs. 1 and 2) but accounts for spatial variability in meteorological conditions and air mass transport pathways. The MWT also requires a cutoff value to be applied to reduce the influence of grid cells with low residence times (therefore we have less statistical confidence in them), which is the same used in the CWT calculation. The MWT was calculated according to Eq. (3):

$${M}_{i,j}=\frac{{\sum }_{l=1}^{S}{M}_{l}{\tau }_{i,j,l}}{{\sum }_{l=1}^{S}{\tau }_{i,j,l}}$$
(3)

where Mi,j is the trajectory weighted average meteorological conditions in the grid cell ij, l is the index of the sample, S is the total number of samples, Ml is the mean meteorological conditions during sample l and preceding days, and τi,j,l is the residence time in grid cell ij during sample l.

Machine learning methodology

In this study, we utilize a supervised, regression form of ML. Here we describe the explanatory variables and target variables, variable selection, hyperparameter tuning, the machine learning model, and the explainability of the model.

The target variable used was the CWT values of MSA for individual grid cells over all summer months. The explanatory variables (or features) used were the MWT values of the ERA5 data for individual grid cells over all summer months. No additional feature engineering (e.g., standardization or normalization) was applied prior to modeling. All summer months were modeled together to better capture the seasonal cycle of the target and explanatory variables as well as capture the dynamic range of MSA values. Since the target we are modeling is geographically distributed and not temporally (i.e., time series), the variables and target were split into training and test sets based on their geographic distribution. The area north of 50° N was divided into 20 equally spaced groups of 18° each, these groups were further divided into 4 equally spaced sub-groups of 4.5° each to obtain a 75/25 train/test split. Three sub-groups were used as a training set while the fourth was used as a test set, this process was repeated using each sub-group as a test set. A schematic of the geographic distribution of train/test sets is depicted in Supplementary Fig. 4.

A set of 15 explanatory variables from ERA5, which plausibly could contribute to atmospheric MSA production, was initially selected as explanatory variables (Supplementary Table 1). From this initial set, the final explanatory variables were selected using recursive feature elimination. Starting with 15 variables, one variable per iteration was recursively eliminated. Elimination was based on the importance of each variable as assigned by the model, whereby the variable with the lowest importance was eliminated. This process was repeated until one variable remained. The number of variables was evaluated using the R2 score from the training sets using a different number of variables. The R2 scores from the test set were also calculated for comparison. The results of this analysis are displayed in Supplementary Fig. 3. Ultimately, nine variables were selected, based on the decrease of R2 scores with a fewer number of variables. The recursive feature elimination was implemented via the scikit-learn package (v1.0.2). All modeling was performed in the Python programming language (v3.10.2).

We tested ensemble and boosted (ensemble) methods, such as Random Forests, Adaptive Boosting, (extreme) gradient-boosted regression trees, and gradient-boosted generalized additive models, as implemented in several python libraries (respectively, scikit-learn, AdaBoost, XGBoost, CatBoost, LightGBM, and IntepretML). All tested models produced similar validation metrics, although we found the Gradient-boosted regression trees as implemented in XGBoost to be the most computationally efficient approach. The XGBoost library implements gradient boosting of shallow decision trees. It works by sequentially training decision trees to minimize the residuals of the trees already in the model92. This allows for the residuals from the previous trees to be learned, reducing the regression loss function and obtaining accurate predictions by passing data through the trained (shallow) trees sequentially. XGBoost implements a regularized model formalization to prevent overfitting and to improve computational efficiency. Finally, XGBoost is also able to handle collinearity amongst the explanatory variables, which is important for this study as we include several variables related to the aqueous phase processes. We use the XGBoost package (v1.6.2)85.

The XGBoost implementation of gradient-boosted trees requires hyperparameter tuning to ensure optimal performance. We used a Bayesian search approach for testing the optimum configuration of hyperparameters, which was implemented using the Optuna93 library (v3.0.3). The tested hyperparameters the range of values explored, and the optimum values are listed in Supplementary Table 2. The objective of the hyperparameter tuning procedure is to maximize the mean R2 score. Tuning was performed for 250 trials; preliminary tuning revealed a plateauing of the mean R2 score after ~150 trials indicating that 250 trials are sufficient to ensure the optimal setting was selected. Hyperparameter values were sampled using the Tree-structured Parzen Estimator (TPE) algorithm167 and trials were pruned using the Hyperband pruner168. The final set of hyperparameters was selected based on the compromise between overall performance (high R2 scores) and agreement between the training and test set evaluation metrics (prevention of overfitting).

We utilized SHapley Additive exPlanations (SHAP) values85 which is based on Shapley values169, to evaluate the impact of the explanatory variables on the model output. SHAP is a model-agnostic methodology designed to evaluate variable importance based on coalitional game theory170, where each explanatory variable is treated as a “player” in a “game” (model framework) and SHAP aims to evaluate each player’s contribution to the “payout” (model output). For each grid cell, the SHAP values represent a variable’s marginal contribution over the mean model output considering all possible combinations of explanatory variables. SHAP can provide both local and global explanations as opposed to other commonly used variable importance methods (such as those provided by XGBoost itself: gain, split count, permutation importance) that only provide a measure of global importance171. The global importance for each variable is calculated as the median of the absolute SHAP values for said variable which gives an overview of the most important variables. However, it does not consider the relationships between the SHAP and explanatory variable value (positive or negative relationship, linear or non-linear). Therefore, we assessed the relationship between the explanatory and SHAP values by discretizing the explanatory values into 10 equally spaced bins and calculated the median and 25th/75th percentiles for each bin. These two approaches allow us to evaluate the overall global importance and the relationship between explanatory and SHAP values for each variable. The local importance was assessed by creating maps of SHAP values for each variable, which allows us to illuminate the geographic distribution of each variable’s contribution to the model output. The SHAP approach was applied via the shap package (v0.41.0).

Trend analysis

Trends in ERA5 environmental variables for each grid cell were determined using the non-parametric Mann-Kendall test and Theil Sen’s slope methodology172,173,174,175 and calculated using monthly means. These tests require the data to be serially independent and homogeneously distributed, however, the environmental variables often exhibit statistically significant (SS) lag-1 autocorrelation and high amounts of seasonality. We have calculated trends for each month separately to account for seasonality and used monthly means to minimize auto-correlation. SS trends were determined at the 95th % confidence level (CL). Uncertainty in trends was assessed using the 95th % confidence interval (CI) of the slope.

Future projections methodology

To project the response of MSA concentration in the future, we extrapolated the MWT values for grid cells that exhibited SS trends. We assumed the SS trends would continue at the same rate in the future and calculated the absolute change over a given decadal interval (10, 20, 30, 40, and 50 years in the future) and added these changes to the original MWT values (referred to year 0). Extrapolating the trends of meteorological variables into the future can result in physically unrealistic values (e.g., fractional cloud cover greater than 1), therefore, we applied constraints to the extrapolated values (Table S4). We imputed values of the lower/upper limit if extrapolated values exceeded these limits for explanatory variables which are physically bounded (low cloud cover, relative humidity, cloud liquid water). For boundary layer height, we imposed a lower limit of 20 m to avoid physically unrealistic values, using lower limits of 10 and 30 did not produce any meaningful differences. Overall, only a small percentage of the data was imputed to avoid physically unrealistic values (Table S4). We then used the previously trained XGBoost model to project the MSA concentrations for each of the five decades. To assess the total changes in Pan-Arctic MSA, we use Eq. (4) to calculate the total relative change of MSA expressed as a percentage for each year T.

$${{Total\; Relative\; Delta}}_{T}=\frac{{\sum }_{i=1}^{I}{\sum }_{j=1}^{J}({{MSA}}_{T,i,j}-{{MSA}}_{0,i,j})}{{\sum }_{i=1}^{I}{\sum }_{j=1}^{J}{{MSA}}_{0,i,j}}* 100 \%$$
(4)

Where I and J are the total number of meridional and zonal grid cells, respectively. MSAT and MSA0 are the Pan-Arctic MSA concentrations for year T in the future and year 0, respectively.

To evaluate how each variable contributed to the future projections of MSA in certain decades, we applied a leave-one-out approach to the trends for each variable. We projected the future changes in MSA by not applying the extrapolated trends for one variable at a time for each decade interval. We compared these results to the results obtained by including all variables for the same decade interval to give the total relative delta for a decade interval T and without variable V, as displayed in Eq. (5).

$${{Total\; Relative\; Delta}}_{T,V}=\frac{{\sum }_{i=1}^{I}{\sum }_{j=1}^{J}\left({{MSA}}_{T,A,i,j}{-{MSA}}_{X,V,i,j}\right)}{{\sum }_{i=1}^{I}{\sum }_{j=1}^{J}{{MSA}}_{T,A,i,j}}* 100 \%$$
(5)

Where I and J are the total number of meridional and zonal grid cells, respectively. MSAT,V and MSAT,A are the Pan-Arctic MSA concentrations for year T without variable V and for year T with all variables included, respectively. To investigate the relationship between the contribution of each variable to the leave-one-out analysis and the trends in said variable, we classified each grid cell according to the direction of the trend (positive or negative) and the contribution to the leave-one-out analysis (positive or negative). This was performed only for grid cells that exhibited SS trends. The classification included four categories: positive trend and positive variable contribution, positive trend and negative variable contribution, negative trend and positive variable contribution, and negative trend and negative variable contribution. The results of this analysis are displayed in Supplementary Fig. 15.