Generating high-resolution land use and land cover maps for the greater Mariño watershed in 2019 with machine learning

Vallet, Améline; Dupuy, Stéphane; Verlynde, Matthieu; Gaetano, Raffaele

doi:10.1038/s41597-024-03750-x

Download PDF

Data Descriptor
Open access
Published: 23 August 2024

Generating high-resolution land use and land cover maps for the greater Mariño watershed in 2019 with machine learning

Scientific Data volume 11, Article number: 915 (2024) Cite this article

274 Accesses
3 Altmetric
Metrics details

Subjects

Abstract

Land Use and Land Cover (LULC) maps are important tools for environmental planning and social-ecological modeling, as they provide critical information for evaluating risks, managing natural resources, and facilitating effective decision-making. This study aimed to generate a very high spatial resolution (0.5 m) and detailed (21 classes) LULC map for the greater Mariño watershed (Peru) in 2019, using the MORINGA processing chain. This new method for LULC mapping consisted in a supervised object-based LULC classification, using the random forest algorithm along with multi-sensor satellite imagery from which spectral and textural predictors were derived (a very high spatial resolution Pléiades image and a time serie of high spatial resolution Sentinel-2 images). The random forest classifier showed a very good performance and the LULC map was further improved through additional post-treatment steps that included cross-checking with external GIS data sources and manual correction using photointerpretation, resulting in a more accurate and reliable map. The final LULC provides new information for environmental management and monitoring in the greater Mariño watershed. With this study we contribute to the efforts to develop standardized and replicable methodologies for high-resolution and high-accuracy LULC mapping, which is crucial for informed decision-making and conservation strategies.

Land use and land cover (LULC) performance modeling using machine learning algorithms: a case study of the city of Melbourne, Australia

Article Open access 19 August 2023

Detecting desertification in the ancient oases of southern Morocco

Article Open access 08 November 2023

Annual Impervious Surface Data from 2001–2020 for West African Countries: Ghana, Togo, Benin and Nigeria

Article Open access 18 July 2024

Background & Summary

Land Use and Land Cover (LULC) play a key role in environmental planning, management and monitoring. Accurate LULC information is key for evaluating potential risks to ecosystems and biodiversity, ensuring food security, mitigating natural hazards, and facilitating effective urban planning. LULC maps are often used as an indicator or a proxy of natural and economic processes in environmental modeling. For instance, they are used as inputs in models aiming to map population distribution^1,2, poverty or income^3,4, ecosystem services (carbon storage, water yield, etc.)^5,6, ecological accounting⁷.

Over the last decades, remote sensing and satellite products have revolutionized the detection and mapping of LULC, as they provide a spatially extensive, multi-temporal and time saving source of information about LULC⁸. Earlier LULC mapping studies have intensively used medium and low-resolution earth observation satellites, such as LANDSAT (MSS and TM), ASTER, MODIS, SPOT, but with important limitations. First, they often lead to confusion between land-cover types because of a limited number of spectral bands to distinguish them. Second, they poorly captured changes in vegetation overtime, because of low return frequencies. And finally, they showed a limited ability to capture fine details and small-scale features on the Earth’s surface because of their rough spatial resolution^9,10,11. New satellites, such as Pléiades, Landsat 9, Sentinel-2, with high return frequencies of multitemporal products, large multispectral sensors and very high-resolution imagery address the above-mentioned limitations and offer new opportunities to LULC mapping^12,13.

The methods used for classifying LULC from remote sensing products have also considerably evolved in the recent years, with machine learning algorithm driving the latest developments in LULC mapping. Techniques such as Random Forest, Support Vector Machine, and Artificial Neural Networks were found to significantly improve the accuracy and efficiency of traditional approaches, that historically relied on manual interpretation of satellite imagery or simple spectral analysis^8,14,15. Machine learning algorithms are very flexible regarding input data, which enables them to process multisource remote products - including LiDAR, radar, and hyperspectral imagery - of varying resolution and spectra. In addition, they allow a full automation of the classification process and enable efficient analysis of large volumes of data.

Recently published high resolution global LULC datasets are making use of new remote sensing products and advanced machine learning classification algorithms. For instance, WorldCover, launched in 2020 by the European Space Agency, is an open-access global land cover map at 10 m resolution, including 11 classes, based on both Sentinel-1 and Sentinel-2 images¹⁶. Other similar initiatives include GlobeLand30¹⁷, ESRI 2020 global LULC map¹⁸ or Google Earth Engine Dynamic World NRT¹⁹. While these global datasets have the advantage of providing new information about countries with limited data until now (e.g. South America, Africa), they often contain limited number of LULC classes, and show varying levels of accuracy, strongly depending on ecological biomes^19,20. Indeed, the main challenges to LULC mapping consist in the detection of specific ecosystems, such as wetlands or mangroves and the detection of small-scale features, such as agro-forest mosaics, urban areas, dispersed settlements. Integrating multiple sources of remote sensing products, at different time periods to capture changes in vegetation, with precise in-situ data is often mentioned as the way to improve their detection^9,21,22.

The aim of this study is to apply a new method, the MORINGA processing chain, to generate a high resolution and detailed (21 LULC classes) LULC map for the greater Mariño watershed (Peru) in 2019, using the most recent remote sensing imagery (Sentinel-2 and Pléiades) and a random forest algorithm. The greater Mariño watershed is an important area for biodiversity conservation and water management in the Andes, and accurate LULC mapping is crucial for informed decision-making about natural resources. Identifying changes in LULC over time, will allow for more effective management and conservation efforts, and will facilitate better management and conservation strategies. With this study we also contribute to the efforts to develop standardized and replicable methodologies for high resolution, and high accuracy LULC mapping.

Material and Methods

Study site

The greater Mariño watershed stretches over 403 km² along the eastern slopes of the Southern Peruvian Andes, in the region of Apurimac, Peru (Fig. 1). The local climate is dry and hot in the interandean valleys and cold and humid on the highlands. Annual precipitations are also highly variable, with a dry season (June to August) characterized by lower rainfalls in contrast with the wet season (December to march)²³. The elevation varies from 1614 to 5180 m, with very diverse landscapes and ecosystems: dry forests, glaciers, wetlands (bofedales) and more than a dozen of high-elevation lakes. Approximately 70000 people live in the watershed, mostly in two urban areas, Abancay and Tamburco. Agriculture at high and mid elevations is subsistence oriented, whereas at low elevations both crop and livestock farming are commercially oriented and more intensive. There are also tourism activities in the Ampay Forest Sanctuary, which protects 36 km² of land. Like other mountain social-ecological systems, the greater Mariño watershed provides important but vulnerable ecosystem services that contribute substantially to people good quality of life in the area. Some landscape planning instrument oriented toward biodiversity and ecosystem conservation have been implemented in the past, or are under implementation. These include, for example, the creation of a protected area (the Ampay National Sanctuary), a payment for hydrological services, and several nature-based solution programs, such as reforestation schemes or wetland restoration projects²⁴.

Overview of the MORINGA processing chain

The LULC classification was produced thanks to the MORINGA processing chain, a supervised object-based LULC classification methodology/technique using multi-sensor satellite imagery²⁵. It has been applied recently to several tropical agrosystems of the world, including La Réunion island¹², Madagascar^26,27, Senegal²⁸, Haiti^29,30. The MORINGA chain is composed of four steps (1) segmentation of a Very High Spatial Resolution (VHSR) satellite image (such as Spot 6/7 or Pléiades); (2) object level extraction of spectral and textural predictors derived from several High Spatial Resolution (HSR) satellite images (such as Sentinel-2, or Landsat 8) at different dates, along with the VHSR satellite image and other remote sensing products (such as DEM); (3) training and validation of a random forest classifier using a field database (possibly at different levels of a LULC nomenclature); (4) application of the classifier to the whole study area to map LULC (Fig. 2). The pre-processing of satellite images, so that they can be used at steps 1 and 2, is also part of the MORINGA processing chain.

The MORINGA processing chain is compiled within a Python 3.8 environment and relies mainly on the GDAL/OGR library and the Orfeo ToolBox (OTB) version 7.2 (https://www.orfeo-toolbox.org). It is complemented with custom modules for specific steps (e.g., for computing reasons the calculation of object-based statistics at step 2 makes use an ad-hoc C++ module, “obiatools”, whose source code is available at https://gitlab.irstea.fr/raffaele.gaetano/obiatools). Some pre-processing steps are also performed out of the Python under QGIS (e.g. slope calculation). The source code of the Moringa processing chain is available at https://gitlab.irstea.fr/raffaele.gaetano/moringa. The implementation of these different steps in the greater Mariño watershed, as well as the satellite images used are described more in detail in the following sections.

Field database and land-use land-cover nomenclature

Fieldwork was carried out in May and June 2016, which corresponds to the beginning of winter and the dry season (i.e. the end of the peak of the growing season), and in agricultural areas, to the beginning of harvest. Sampling sites were selected through a mix of systematic sampling (points distributed in all the study area to capture the altitudinal gradient effect) and stratified sampling (to ensure that sufficient observations are collected for each LULC class). Some sampling sites were located outside the greater Mariño watershed - while maintaining a close proximity - in order to sample specific LULC classes which are scarce in the study area (e.g. Pine plantations, Polylepis sp. forests). At each sampling site, we recorded GPS coordinates, took pictures in the direction of the four cardinal points, and registered the vegetation and species observed. Each sampled site was then digitalized into a polygon delimiting a plot with homogeneous LULC inside, which was classified into one of the categories presented in Table 1 (level 3). This nomenclature is aligned with other LULC maps provided at national^31,32 or regional^33,34 scale. The VHSR image was used for the delineation of the polygon, based on photointerpretation. 1698 polygons composed the final field database, covering a total of 16.75 km² (Table 2, Table SM 2).

Table 1 Description of the land-use and land-cover classes.

Full size table

Table 2 Three level LULC nomenclature.

Full size table

Satellites images and their pre-processing

Topography

TanDEM-X Digital Elevation Model (DEM) was obtained thanks to the European Space Agency (ESA), through its scientific research support program. TanDEM-X is part of ESA Third Party Missions Programme, that comprises 50 satellites dedicated to earth observation (https://earth.esa.int/eogateway/missions/terrasar-x-and-tandem-x). TanDEM-X is almost identical to its twin, TerraSAR-X, with which they fly on close formation to produce high accuracy and resolution elevation models (12 m spatial resolution), thanks to a powerful radar system: Synthetic Aperture Radar (SAR)³⁵. Pixels with no data were filled with mean elevation of the study area, using OTB BandMathX application.

Very high spatial resolution (VHSR)

We used three Pléiades images (of different sizes) acquired on the 7th of octobre 2019 (i.e. at the end of the dry season) simultaneously for both the panchromatic and the multispectral mode, at a spatial resolution of 0.5 and 2 m respectively (Table SM 1). These images are distributed commercially by AIRBUS Defense and Space at primary geometric processing level and a basic radiometric processing (12-bit native). The access to the Pléiades images was funded and facilitated by DINAMIS, a French institutional data hub that provides an access to high and very high resolution optical and radar data (https://dinamis.data-terra.org). DINAMIS is part of the Data Terra national research infrastructure, whose main mission is to develop an integrated platform for Earth system data, services and products (https://www.data-terra.org).

Pre-processing consisted in (1) the calculation of Top Of Atmosphere (TOA) reflectance, by correcting distributed images for sensor calibration and radiation incidence, and (2) the orthorectification of images using TanDEM-X DEM (with OTB OrthoRectification application). The three pre-processed tiles of Pléiades panchromatic and multispectral images were then mosaicked, and finally, the two resulting mosaics were pansharpened using the Bayesian fusion algorithm (OTB Pansharpening application), to obtain a multispectral image at 0.5 m spatial resolution. Pléiades multispectral image at 2 m resolution was then no longer used in the processing chain (only the pan-sharpened image at 0.5 m resolution is used).

High spatial resolution (HSR)

We also used a time series of 333 Sentinel-2 images, acquired between the 1st of January 2018 and the 30th of October 2019 to capture the vegetation dynamics all along the year before the acquisition date of the Pleiades image (Table SM 1). Sentinel-2 images are provided by two satellites (Sentinel-2 A and B), deployed by the European Space Agency (ESA) and the Copernicus program. The time span between the acquisition by either satellite is five days. The images were downloaded free of charge through the PEPS platform (https://peps.cnes.fr) at level 1 C (i.e. orthorectified TOA reflectance). The Sen2Cor (https://step.esa.int/main/snap-supported-plugins/sen2cor/) atmospheric correction processor for Sentinel-2 images allowed to obtain a level 2 A Bottom-Of-Atmosphere (BOA) reflectance product from distributed level 1 C images, as well cloud, cloud shadows and snow masks³⁶.

Two Sentinel-2 tiles (T18LYL and T18LYK) were necessary to cover the whole study area: they were mosaicked to generate a time series of Sentinel-2 mosaics at different dates. Although already orthorectified, Sentinel-2 images were also readjusted to the VHSR Pleiades image using OTB HomologousPointsExtraction application with red band (Pléiades band 1, Sentinel-2 band 3) as a reference (step 2 of MORINGA processing chain). To eliminate clouds, we created synthetic images every 20 days (gapfilling processing, ImageTimeSeriesGapFilling OTB application). The final Sentinel-2 time series is thus composed of 22 synthetic images, from the 25th of July 2018 and 5th of October 2019.

LULC classification with the MORINGA processing chain

Predictors calculation: topographic, textural and spectral indices

Several indices were calculated from the VHSR and HSR images (Table 3), to be later used as predictors in the classification model. Following previous studies, four textural indices developed by Haralick³⁷ were calculated using the panchromatic Pléiades image^27,38,39. Textures are important for detecting landscape patterns, such as tree or crop rows, easily detectable in the VHSR image. Textural indices were computed thanks to HaralickTextureExtraction OTB application. Four sizes of sliding window were used for each index, with radius values of 1 (i.e. a sliding window of 3 × 3 pixels), 5 (11 × 11 pixels), 11 (23 × 23 pixels) et 21 (43 × 43 pixels) (Table 3).

Table 3 Textural and spectral indices computed from VHSR and HSR images.

Full size table

Nine spectral indices were also calculated from Pléiades pansharpened image and from the Sentinel-2 time series of synthetic images (Table 3), using OTB RadiometricIndices application. Sentinel-2 sensor delivers 13 spectral bands, ranging from 10 to 60 m resolution, but only the 10 bands with a resolution of 20 m or less were exploited in this study (i.e. three 60 m resolution bands were discarded), as direct predictors in the classification model, but also to compute 6 spectral indices that are commonly used to characterize and classify LULC (Table 3).

Finally, slope was calculated from TanDEM-X DEM with QGIS and used as a predictor in addition to elevation. To classify LULC, we therefore used a total of 352 Sentinel-2 derived predictors ( = 22 dates * 10 bands + 22 dates * 6 spectral indices), 20 Pléiades derived predictors ( = 4 spectral indices + 4 textural indices * 4 radius) and 2 TanDEM-X derived predictors (elevation and slope).

Object detection by segmentation of the VHSR image

For the segmentation of Pléiades pansharpened mosaic, we used a method proposed by Baatz and Schäpe⁴⁰ and implemented in OTB LargeScaleGenericRegionMerging remote application, available at https://gitlab.irstea.fr/remi.cresson/LSGRM⁴¹. Various tuning tests were performed on different sub-regions of the study area before selecting the following values (tested values are indicated between brackets):

scale parameter: 350 [70–450]
weight parameter on the shape: 0,3 [0.1–0.8]
weight parameter on compactness: 0,7 [0.5–0.7]

This segmentation step partitions the image into homogenous objects and extracts their contours. The geometries delimited in the image were exported as a shapefile, and for each constitutive element (i.e. for each object of the segmentation), we extracted the mean values of each of the 374 textural, spectral and topographic predictors presented in the previous section. The segmentation was then intersected with the polygons of the field database for which LULC was recorded/identified, and for each element of the intersection the mean values of predictors were also extracted, which composed the training dataset of the classification algorithm (35 392 training elements). Extractions were made thanks to the C++ “obiatools” module.

Random forest training

The random forest algorithm was used to classify LULC from the training dataset produced at the previous step. This algorithm is based on an ensemble of classification or regression decision trees, each created using random subsets of predictors and training data, whose predictions are combined by majority voting or averaging^42,43. Over the last two decades, the use of random forest for remote sensing applications has received an increasing attention due to its capacity to handle large datasets (of observations and predictors) and missing data, its processing speed, and high accuracy^8,14. Applications focused for instance on mapping LULC^27,44, vegetation biomass^45,46, urban areas^47,48 and habitat quality and health^49,50,51.

One random forest model was trained at the level 3 of the LULC nomenclature, using OTB TrainVectorClassifier application, and the following tuning options:

Maximum depth of the tree: 25
Minimum number of samples in each node: 10 (OTB default value)
Cluster possible values of a categorical variable into K < = cat clusters to find a suboptimal split: 10 (OTB default value)
Size of the randomly selected subset of features at each tree node: square root of the total number of predictors (OTB default value, in this application: \(\sqrt{374}=19.34\))
Maximum number of trees in the forest: 800
Sufficient accuracy (OOB error): 0.01 (OTB default value)

All observations in the training dataset whose size was greater than 25m² were used for training the classifier, which was then applied to each element of the segmentation for which we extracted predictors values, in order to generate a level 3 LULC classification.

Predictors importance (also called variable or feature importance) was calculated in order to highlight which predictors contributed more to the classification, and were the most influential. Predictor importance is commonly used as a tool for interpretating machine learning algorithms and explaining how particular predictions are made⁵². Predictors importance were calculated using Python module scikit-learn, and a random forest model-specific importance score based on mean accumulation of impurity decrease (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html).

Elevation showed the highest importance, then followed by two textural indices (Haralick contrast with radius of 21 and 11), and two vegetation and water spectral indices derived from Sentinel-2 HSR images in August 2019 (Fig. 3A). Slope also appeared as an important predictor, which suggests that considering topography is crucial for LULC classification in areas of high relief such as the Andes. Half of the 16 textural indexes were among the most important predictors, which also indicates that Pléaides-derived textures drove the LULC classification and explained a large amount of our training dataset variance. Finally, several Sentinel-2 spectral indices and bands at different dates were among the most important predictors, which underlines the importance of considering time series of multispectral images for characterizing vegetation dynamics during the classification process.

Post-processing procedure

The post-processing of the LULC classification produced by the MORINGA chain consisted in four steps: (1) conversion to raster; (2) smoothing by majority filter; (3) cross-checking with GIS data and (4) manual correction by photo-interpretation. All post-processing operations were conducted at the finest level of the classification (level 3), and then scaled-up thanks to the nested structure of the nomenclature.

First, the vectorial classification obtained with the MORINGA chain was converted to a raster format, at the resolution of Pléiades’ pansharpened and panchromatic images (0.5 m). The resolution of the Pleiades image was preferred over that of the Sentinel images (10 and 20 m), as the Pleiades image is the one used for the construction of the field database (polygon delineation based on Pléiades image photointerpretation), and segmentation, which are two crucial steps for the supervised classification. As the object were identified at a 0.5 m resolution, it is essential to convert the MORINGA classification into a raster at the same resolution to ensure their integrity. Indeed, the 0.5 m resolution allowed to preserve the isolated landscape features identified during segmentation (such as rural buildings, or roads): they would be merged with neighboring LULC classes with a rasterization at lower resolution.

Second, a majority filter resampling was used to remove isolated pixels and smooth out the classification contours, with OTB ClassificationMapRegularization application and a radius of 3 (corresponding to a 7 × 7 pixels sliding windows). This smoothing only removed objects whose size was inferior to 1.75m² (in comparison the size of a residential house in rural areas is approximately 10m²), and therefore did not alter the identification of the isolated landscape features mentioned above.

Then, we cross-checked the LULC classification with external data sources to detect unexpected behavior of the MORINGA classifier. For each LULC class of the nomenclature at level 3, specific GIS references, all accessible in open-access, were identified (Table SM 3) and intersected with the classification to highlight potential errors. All disagreements between the classification and the reference GIS data were systematically inspected and eventually corrected manually by photo-interpretation of the Pléiades image, using the Thematic Raster Editor (ThRasE) a QGIS Python plugin that allows flexible and fast raster editing (https://plugins.qgis.org/plugins/ThRasE/). For instance, crops and pastures classes were compared to the map of agricultural areas (https://siea.midagri.gob.pe/portal/informativos/superficie-agricola-peruana) developed by the Peruvian ministry in charge of Agriculture⁵³, and water bodies to the Global Surface Water Explorer⁵⁴. Other data sources were provided by the European Commission, the Peruvian Ministry of Agrarian Development and irrigation, the Ministry of the Environment of Peru and the OpenStreetMap community.

Finally, the classification was carefully screened using the tile-by-tile navigation option of ThRasE (tile size of approximately 4 km²), and the Pléiades image as a reference (with true and false color composites to highlight vegetation areas). All the classification errors detected were manually corrected. The road network LULC category was added at this stage, by combining elements of the classification from different LULC classes (built-up areas mainly, but also other land use classes at lower percentage). OpenStreetMap data was used to confirm the location of photo-interpreted roads⁵⁵.

Vectorization

The post-treated classification raster was converted to a vector database using the Raster To Polygon conversion tool from ArcGIS Pro, with the polygon simplification option activated to smooth contours. The Repair Geometry tool was then applied to inspect polygons for geometry problems and repair them, with the “Delete Features with Null Geometry” option set off.

Data Records

The final LULC classification (Fig. 4) and its description is available at the Recherche Data Gouv repository under the CC BY 4.0 license, in both raster and vector format (https://doi.org/10.57745/DDP1ZR)⁵⁶. The raster format is only provided for the level 3 of the LUCL nomenclature at 0.5 m resolution, but the three nomenclature levels are provided in separate layers of the geopackage file (Table SM 5). The field database used to train the random forest is accessible at the same repository and under the same license; this dataset contains LULC observations at the three nomenclature levels, in a geopackage file (Table SM 6). All three datasets are delivered in the local UTM projection (WGS 84 UTM 18 S, EPSG code 32718).

Technical Validation

Random forest cross-validation and performance metrics

In the random forest algorithm, the subset of training data left out from each tree (also called Out-Of-Bag -OOB- observations) can be used for assessing the prediction error rate, yielding the so-called OOB error, a measure of the classifier performance. Random forests can therefore be trained and validated using all available observations. However, as some noted, this approach can lead to a biased estimation of performance, because of overfit and because it does not consider the size of training observations^57,58. In this study we therefore decided to implement, in addition to OOB error, a second approach for estimating the random forest classifier performance, based on cross-validation.

Cross-validation is a procedure to estimate classification performance, where the training dataset is split into K separate folds. For each fold k, a random forest model is trained on the K-1 other folds (i.e. excluding k fold data), and then applied to the k fold data, to assess its performance, taking into account the size of training observations. It is worth noting that the K models developed during the cross-validation procedure are slightly different from the overall model fitted using all observations from the training dataset, as they are trained with only a subset of the data: the objective is not to generate final predictions (i.e. final LULC classification), but to evaluate the quality of the classification model⁵⁸. In this study we implemented a 5-fold cross-validation, and we estimated the quality of the classification in each fold using the following performance metrics (that were calculated on training observations weighted by their surface, Fig. 3B). The same metrics were also calculated before and after the post-processing, considering all observations available in the training database (Table 1).

F1 score, a harmonic mean of the precision and recall, ranging from 0 to 1, computed for each LULC class separately⁵⁹.
An overall accuracy score, computed as the average of each LULC accuracy score (corresponding to the total surface of correctly classified objects divided by the total surface of training observations)⁵⁹.
Cohen’s kappa, which reflects level of agreement between the proposed classification, and a random one⁶⁰.
Pontius’ quantity disagreement (Q, which measures the differences in the proportion of area or quantity of each LULC class), allocation disagreement (A, which measures the measures the differences in the spatial arrangement or allocation of each LULC class) and total disagreement (D, calculated as the average of Q and A). Pontius metrics have been proposed to address some of the limitations of Cohen’s kappa, by explicitly considering the spatial allocation of LULC classes, distinguishing between false positives and false negatives, and not assuming that the disagreement is due to chance⁶¹.

We used Python module sklearn.metrics (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.htm) to calculate F1 score, accuracy and Cohen’s kappa during cross-validation. Pontius’ metrics (A, Q, and D), were manually calculated for each fold validation observation, by generating a corresponding confusion matrix for, using OTB ComputeConfusionMatrix application. The same application was used to calculate all performance matrix before and after post-treatment, considering all available observations from the training database.

Corrections applied to the classification during post-processing

Corrections were applied to 8.5% of the study area (a map locating the exact changes is provided in Figure SM 1). The most frequent error was crops confounded with dry shrublands and semi-arid steppes of the valley (15% of the area corrected during post-processing) (Table 4). Mixed shrublands, found at higher elevation and grasslands were often misclassified into wetlands (14% of corrections), and eucalyptus plantations confounded with other types of woodlands (8% of the corrected surface). Frequent confusions were also observed between types of shrublands (11% of total corrections).

Table 4 Corrections applied to the MORINGA classification (rows) during post-treatment (columns).

Full size table

Some LULC classes, that did not cover large portions of the study area, showed higher levels of post-treatment corrections (Table SM 4). For instance, 77% of the areas classified as beach and riverine rocks by the MORINGA were confounded with rocks and natural bare soils. And 57% of the area classified as lakes were indeed rocks and natural bare soils. The confusion between surface water and bare soils can be explained by relief and shadow effects, as observed in other publications^62,63,64. The presence of clouds on the Pléiades image affected the quality of the segmentation in small areas of the study site: the contours of the objects affected by clouds were corrected manually during this post-processing stage. The confusion between riverine rocks and bare soils is due to the close resemblance of their multispectral signal and suggest that other topographic parameters could be added to the MORINGA predictors, such as distance to river network, to improve the distinction between these LULC classes.

Final classification validation

The overall accuracy (i.e. the arithmetic mean of F1-scores from each LULC class) and Cohen’s K index showed a very high agreement between the post-processing map and the training database. The final level of disagreement quantity obtained after post-processing (Pontius Q), was of 0.0042, while the allocation disagreement (Pontius A) was of 0.0053 (Table 1). This means that most of the disagreement (approx. 60%) is explained by the precise location of the different LULC classes in the maps (Pontius A), and not each LULC class relative importance (Pontius Q). Pontius total disagreement (D) disagreement) was very low, which confirm the strong agreement between the post-processing map and the training database.

The slight decrease of overall accuracy and Cohen’s K index observed after post-processing can be explained by changes in F1-score in two LULC classes (Table 1): “Beach and riverine rock” and “Fruit crop”. Fruit crops are among the most complicated classes of LULC to detect, along with wetlands, small-scale fields, and urban areas, that machine learning algorithms typically tend to misidentify^65,66,67,68. For “Beach and riverine rock”, the change in accuracy can be explained by an error in the training database, where a polygon of 5451m2 was wrongly classified as “Beach and riverine rock” instead of “Rock and natural bare soil”, among the 14 polygons identified as “Beach and riverine rock” areas in the training database (Table SM 2).

Code availability

The source code of the Moringa processing chain is available at https://gitlab.irstea.fr/raffaele.gaetano/moringa.git. It is complemented with custom modules for specific steps (e.g., for computing reasons the calculation of object-based statistics at step 2 makes use an ad-hoc C++ module, “obiatools”) whose source code is available at https://gitlab.irstea.fr/raffaele.gaetano/obiatools.

References

Sorichetta, A. et al. High-resolution gridded population datasets for Latin America and the Caribbean in 2010, 2015, and 2020. Sci. Data 2, 150045 (2015).
Article PubMed PubMed Central Google Scholar
Stevens, F. R., Gaughan, A. E., Linard, C. & Tatem, A. J. Disaggregating Census Data for Population Mapping Using Random Forests with Remotely-Sensed and Ancillary Data. PLOS ONE 10, e0107042 (2015).
Article PubMed PubMed Central Google Scholar
Bosco, C. et al. Exploring the high-resolution mapping of gender-disaggregated development indicators. J. R. Soc. Interface 14, 20160825 (2017).
Article PubMed PubMed Central Google Scholar
Steele, J. E. et al. Mapping poverty using mobile phone and satellite data. J. R. Soc. Interface 14, 20160690 (2017).
Article PubMed PubMed Central Google Scholar
Cabral, P., Feger, C., Levrel, H., Chambolle, M. & Basque, D. Assessing the impact of land-cover changes on ecosystem services: A first step toward integrative planning in Bordeaux, France. Ecosyst. Serv. 22, 318–327 (2016).
Article Google Scholar
Vallet, A. et al. Dynamics of Ecosystem Services during Forest Transitions in Reventazón, Costa Rica. PLOS ONE 11, e0158615 (2016).
Article PubMed PubMed Central Google Scholar
Chen, Y., Vardon, M., Keith, H., Van Dijk, A. & Doran, B. Linking ecosystem accounting to environmental planning and management: Opportunities and barriers using a case study from the Australian Capital Territory. Environ. Sci. Policy 142, 206–219 (2023).
Article Google Scholar
Talukdar, S. et al. Land-Use Land-Cover Classification by Machine Learning Classifiers for Satellite Observations—A Review. Remote Sens. 12, 1135 (2020).
Article ADS Google Scholar
Ban, Y., Gong, P. & Giri, C. Global land cover mapping using Earth observation satellite data: Recent progresses and challenges. ISPRS J. Photogramm. Remote Sens. 103, 1–6 (2015).
Article ADS Google Scholar
Guo, H., Fu, W. & Liu, G. Scientific Satellite and Moon-Based Earth Observation for Global Change. (Springer Singapore, 2019).
Kramer, H. J. Observation of the Earth and Its Environment: Survey of Missions and Sensors. (Springer Science & Business Media, 2002).
Dupuy, S., Gaetano, R. & Le Mézo, L. Mapping land cover on Reunion Island in 2017 using satellite imagery and geospatial ground data. Data Brief 28, 104934 (2020).
Article PubMed Google Scholar
Wang, D. et al. Evaluating the Performance of Sentinel-2, Landsat 8 and Pléiades-1 in Mapping Mangrove Extent and Species. Remote Sens. 10, 1468 (2018).
Article ADS CAS Google Scholar
Belgiu, M. & Drăguţ, L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 114, 24–31 (2016).
Article ADS Google Scholar
Wang, J., Bretz, M., Dewan, M. A. A. & Delavar, M. A. Machine learning in modelling land-use and land cover-change (LULCC): Current status, challenges and prospects. Sci. Total Environ. 822, 153559 (2022).
Article CAS PubMed Google Scholar
Zanaga, D. et al. ESA WorldCover 10 m 2020 v100. Zenodo https://doi.org/10.5281/zenodo.5571936 (2021).
Chen, J., Cao, X., Peng, S. & Ren, H. Analysis and Applications of GlobeLand30: A Review. ISPRS Int. J. Geo-Inf. 6, 230 (2017).
Article Google Scholar
Karra, K. et al. Global land use/land cover with Sentinel 2 and deep learning. in 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS 4704–4707 https://doi.org/10.1109/IGARSS47720.2021.9553499 (2021).
Brown, C. F. et al. Dynamic World, Near real-time global 10 m land use land cover mapping. Sci. Data 9, 251 (2022).
Article PubMed Central Google Scholar
Venter, Z. S., Barton, D. N., Chakraborty, T., Simensen, T. & Singh, G. Global 10 m Land Use Land Cover Datasets: A Comparison of Dynamic World, World Cover and Esri Land Cover. Remote Sens. 14, 4101 (2022).
Article ADS Google Scholar
Szantoi, Z. et al. Addressing the need for improved land cover map products for policy support. Environ. Sci. Policy 112, 28–35 (2020).
Article PubMed PubMed Central Google Scholar
Zhang, C. & Li, X. Land Use and Land Cover Mapping in the Era of Big Data. Land 11, 1692 (2022).
Article Google Scholar
SENAMHI. Caracterización Climática de Las Regiones Apurímac y Cusco. (2012).
SUNASS. Documento de orientación para la implementación de los Merese Hídricos. https://www.sunass.gob.pe/sunass-te-informa/publicaciones/documento-orientacion-implementacion-merese-hidricos/ (2021).
Gaetano, R. et al. The MORINGA processing chain: Automatic object-based land cover classification of tropical agrosystems using multi-sensor satellite imagery. https://agritrop.cirad.fr/594650/ (2019).
Dupuy, S., Defrise, L., Gaetano, R., Andriamanga, V. & Rasoamalala, E. Land cover maps of Antananarivo (capital of Madagascar) produced by processing multisource satellite imagery and geospatial reference data. Data Brief 31, 105952 (2020).
Article Google Scholar
Dupuy, S. et al. Analyzing Urban Agriculture’s Contribution to a Southern City’s Resilience through Land Cover Mapping: The Case of Antananarivo, Capital of Madagascar. Remote Sens. 12, 1962 (2020).
Article ADS Google Scholar
Jolivot, A. Cartographie de l’occupation du sol de la zone des Niayes (Sénégal) en 2018 (1.5 m de résolution). CIRAD Dataverse https://doi.org/10.18167/DVN1/KJAS6S (2021).
Dupuy, S., Lelong, C. & Gaetano, R. Rapport méthodologique: Cartographie de l’occupation du sol sur le site des NIPPES à Haïti. https://agritrop.cirad.fr/597938/ (2021).
Gaetano, R., Dupuy, S. & Lelong, C. Nippes - Haïti - 2020, Land Cover Map at high spatial resolution. CIRAD Dataverse https://doi.org/10.18167/DVN1/ZAN2WN (2021).
MINAM. Mapa nacional de ecosistemas del Perú - Memoria descriptiva. https://geoservidor.minam.gob.pe/wp-content/uploads/2017/06/MEMORIA-DESCRIPTIVA-MAPA-DE-ECOSISTEMAS.pdf (2019).
MINAM. Mapa nacional de cobertura vegetal: Memoria descriptiva. http://www.minam.gob.pe/patrimonio-natural/wp-content/uploads/sites/6/2013/10/MAPA-NACIONAL-DE-COBERTURA-VEGETAL-FINAL.compressed.pdf (2015).
Cuadros Loayza, J. A., Peña Caytuiro, R. & Valenzuela Trujillo, J. J. Memoria Descriptiva de La Cobertura y Uso de La Tierra Del Proceso de Meso Zonificación Ecológica Económica de La Región Apurímac. 201 http://sigrid.cenepred.gob.pe/docs/PARA%20PUBLICAR/OTROS/Estudio_de_cobertura_y_uso_de_la_tierra_del_proceso_de_meso_ZEE_de_la_region_Apurimac.pdf (2016).
UE-Prodesarrollo Apurímac. Caracterización Ecológica Económica de La Microcuenca Mariño. (2010).
Wessel, B. TanDEM-X Ground Segment – DEM Products Specification Document. https://tandemx-science.dlr.de/ (2018).
Main-Knorn, M. et al. Sen2Cor for Sentinel-2. in Image and Signal Processing for Remote Sensing XXIII (eds. Bruzzone, L., Bovolo, F. & Benediktsson, J. A.) 3 https://doi.org/10.1117/12.2278218 (SPIE, Warsaw, Poland, 2017).
Haralick, R. M., Shanmugam, K. & Dinstein, I. Textural Features for Image Classification. IEEE Trans. Syst. Man Cybern. SMC-3, 610–621 (1973).
Article Google Scholar
Beguet, B., Chehata, N., Boukir, S. & Guyon, D. Classification of forest structure using very high resolution Pleiades image texture. in 2014 IEEE Geoscience and Remote Sensing Symposium 2324–2327 https://doi.org/10.1109/IGARSS.2014.6946936 (2014).
Rajendran, G. B., Kumarasamy, U. M., Zarro, C., Divakarachari, P. B. & Ullo, S. L. Land-Use and Land-Cover Classification Using a Human Group-Based Particle Swarm Optimization Algorithm with an LSTM Classifier on Hybrid Pre-Processing Remote-Sensing Images. Remote Sens. 12, 4135 (2020).
Article ADS Google Scholar
Baatz, M. & Schäpe, A. Multiresolution Segmentation: an optimization approach for high quality multi-scale image segmentation. in (2000).
Lassalle, P., Inglada, J., Michel, J., Grizonnet, M. & Malik, J. Large scale region-merging segmentation using the local mutual best fitting concept. in 2014 IEEE Geoscience and Remote Sensing Symposium 4887–4890 https://doi.org/10.1109/IGARSS.2014.6947590 (2014).
Kuhn, M. & Johnson, K. Applied Predictive Modeling. https://doi.org/10.1007/978-1-4614-6849-3 (Springer New York, New York, NY, 2013).
Lantz, B. Machine Learning with R: Learn How to Use R to Apply Powerful Machine Learning Methods and Gain an Insight into Real-World Applications. (Packt Publ, Birmingham, 2013).
Gislason, P. O., Benediktsson, J. A. & Sveinsson, J. R. Random Forests for land cover classification. Pattern Recognit. Lett. 27, 294–300 (2006).
Article ADS Google Scholar
Baccini, A., Laporte, N., Goetz, S. J., Sun, M. & Dong, H. A first map of tropical Africa’s above-ground biomass derived from satellite imagery. Environ. Res. Lett. 3, 045011 (2008).
Article ADS Google Scholar
Karlson, M. et al. Mapping Tree Canopy Cover and Aboveground Biomass in Sudano-Sahelian Woodlands Using Landsat 8 and Random Forest. Remote Sens. 7, 10017–10041 (2015).
Article ADS Google Scholar
Deng, C. & Wu, C. The use of single-date MODIS imagery for estimating large-scale urban impervious surface fraction with spectral mixture analysis and machine learning techniques. ISPRS J. Photogramm. Remote Sens. 86, 100–110 (2013).
Article ADS Google Scholar
Xia, N., Cheng, L. & Li, M. Mapping Urban Areas Using a Combination of Remote Sensing and Geolocation Data. Remote Sens. 11, 1470 (2019).
Article ADS Google Scholar
Fraser, B. T. & Congalton, R. G. Monitoring Fine-Scale Forest Health Using Unmanned Aerial Systems (UAS) Multispectral Models. Remote Sens. 13, 4873 (2021).
Article ADS Google Scholar
Ozigis, M. S., Kaduk, J. D. & Jarvis, C. H. Mapping terrestrial oil spill impact using machine learning random forest and Landsat 8 OLI imagery: a case site within the Niger Delta region of Nigeria. Environ. Sci. Pollut. Res. 26, 3621–3635 (2019).
Article Google Scholar
Wang, H., Zhao, Y., Pu, R. & Zhang, Z. Mapping Robinia Pseudoacacia Forest Health Conditions by Using Combined Spectral, Spatial, and Textural Information Extracted from IKONOS Imagery and Random Forest Classifier. Remote Sens. 7, 9020–9044 (2015).
Article ADS Google Scholar
Molnar, C. Interpretable Machine Learning. (2020).
Livia Alejandro, L. et al. Atlas de la superficie agrícola del Perú. https://repositorio.ana.gob.pe/handle/20.500.12543/4895 (2021).
Pekel, J.-F., Cottam, A., Gorelick, N. & Belward, A. S. High-resolution mapping of global surface water and its long-term changes. Nature 540, 418–422 (2016).
Article ADS CAS PubMed Google Scholar
OpenStreetMap contributors. Planet dump retrieved from https://planet.osm.org (2022).
Vallet, A. High resolution land use and land cover map for the greater Mariño watershed in 2019. Recherche Data Gouv https://doi.org/10.57745/DDP1ZR (2024).
Janitza, S. & Hornung, R. On the overestimation of random forest’s out-of-bag error. PLOS ONE 13, e0201904 (2018).
Article PubMed PubMed Central Google Scholar
Makowski, D., Brun, F., Doutart, E., Duyme, F. & Jabri, M. E. Data science pour l’agriculture et l’environnement: Méthodes et applications avec R et Python. (Ellipses, 2021).
Tharwat, A. Classification assessment methods. Appl. Comput. Inform. 17, 168–192 (2020).
Article Google Scholar
Cohen, J. A Coefficient of Agreement for Nominal Scales. Educ. Psychol. Meas. 20, 37–46 (1960).
Article Google Scholar
Pontius, R. G. & Millones, M. Death to Kappa: birth of quantity disagreement and allocation disagreement for accuracy assessment. Int. J. Remote Sens. 32, 4407–4429 (2011).
Article Google Scholar
Ji, L., Gong, P., Geng, X. & Zhao, Y. Improving the Accuracy of the Water Surface Cover Type in the 30 m FROM-GLC Product. Remote Sens. 7, 13507–13527 (2015).
Article ADS Google Scholar
Myeong, S., Nowak, D. J., Hopkins, P. F. & Brock, R. H. Urban cover mapping using digital, high-spatial resolution aerial imagery. Urban Ecosyst. 5, 243–256 (2001).
Article Google Scholar
Van de Voorde, T., De Genst, W. & Canters, F. Improving Pixel-based VHR Land-cover Classifications of Urban Areas with Post-classification Techniques. Photogramm. Eng. Remote Sens. 73 (2007).
Ozesmi, S. L. & Bauer, M. E. Satellite remote sensing of wetlands. Wetl. Ecol. Manag. 10, 381–402 (2002).
Article Google Scholar
Rapinel, S. et al. National wetland mapping using remote-sensing-derived environmental variables, archive field data, and artificial intelligence. Heliyon 9, e13482 (2023).
Article PubMed PubMed Central Google Scholar
Yang, X. et al. Detection and characterization of coastal tidal wetland change in the northeastern US using Landsat time series. Remote Sens. Environ. 276, 113047 (2022).
Article Google Scholar
Zhou, X.-X. et al. Research on remote sensing classification of fruit trees based on Sentinel-2 multi-temporal imageries. Sci. Rep. 12, 11549 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Rouse, J., Rh, H., Ja, S. & Dw, D. Monitoring vegetation systems in the great plains with ERTS. (1974).
Pearson, R. L. & Miller, L. D. Remote Mapping of Standing Crop Biomass for Estimation of the Productivity of the Shortgrass Prairie. 1355 (1972).
Barnes, E. M. et al. Coincident detection of crop water stress, nitrogen status and canopy density using ground-based multispectral data. in Proceedings of the 5th International Conference on Precision Agriculture and other resource management July 16-19, 2000, Bloomington, MN USA (2000).
McFeeters, S. K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 17, 1425–1432 (1996).
Article Google Scholar
Gao, B. NDWI—A normalized difference water index for remote sensing of vegetation liquid water from space. Remote Sens. Environ. 58, 257–266 (1996).
Article ADS Google Scholar
Xu, H. Modification of normalised difference water index (NDWI) to enhance open water features in remotely sensed imagery. Int. J. Remote Sens. 27, 3025–3033 (2006).
Article Google Scholar
Escadafal, R. Remote sensing of arid soil surface color with Landsat thematic mapper. Adv. Space Res. 9, 159–163 (1989).
Article ADS Google Scholar
Inglada, J. et al. Assessment of an Operational System for Crop Type Map Production Using High Temporal and Spatial Resolution Satellite Optical Imagery. Remote Sens. 7, 12356–12379 (2015).
Article ADS Google Scholar

Download references

Acknowledgements

The authors are grateful to DINAMIS for providing them with access to Pléiades images. They also thank the European Space Agency for granting them free access to TanDEM-X images. This study was funded by CLAND and MSH Paris Saclay (grant 20-EM-06). Without the support and resources provided by these organizations, this research would not have been possible. The authors are also thankful to Yésica Quispe Conde and Jaime J. Valenzuela Trujillo for their help with the organization of fieldwork and their useful feedback on the LULC nomenclature and classification.

Author information

Authors and Affiliations

Université Paris-Saclay, CNRS, AgroParisTech, Ecologie Systématique et Evolution, 91190, Gif-sur-Yvette, France
Améline Vallet & Matthieu Verlynde
Université Paris-Saclay, AgroParisTech, CNRS, Ecole des Ponts ParisTech, Cirad, EHESS, UMR CIRED, 94130, Nogent-sur-Marne, France
Améline Vallet & Matthieu Verlynde
TETIS, Univ Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, 34398, Montpellier, France
Stéphane Dupuy & Raffaele Gaetano

Authors

Améline Vallet
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Dupuy
View author publications
You can also search for this author in PubMed Google Scholar
Matthieu Verlynde
View author publications
You can also search for this author in PubMed Google Scholar
Raffaele Gaetano
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.V. supervised the study and collected the field data. A.V. and S.D. designed the methodology and processed the data. M.V. processed the data post-processing, validation and visualization. A.V. prepared the manuscript, with contributions from all co-authors.

Corresponding author

Correspondence to Améline Vallet.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary materials

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Vallet, A., Dupuy, S., Verlynde, M. et al. Generating high-resolution land use and land cover maps for the greater Mariño watershed in 2019 with machine learning. Sci Data 11, 915 (2024). https://doi.org/10.1038/s41597-024-03750-x

Download citation

Received: 02 February 2024
Accepted: 05 August 2024
Published: 23 August 2024
DOI: https://doi.org/10.1038/s41597-024-03750-x

Subjects

Abstract

Similar content being viewed by others

Land use and land cover (LULC) performance modeling using machine learning algorithms: a case study of the city of Melbourne, Australia

Detecting desertification in the ancient oases of southern Morocco

Annual Impervious Surface Data from 2001–2020 for West African Countries: Ghana, Togo, Benin and Nigeria

Background & Summary

Material and Methods

Study site

Overview of the MORINGA processing chain

Field database and land-use land-cover nomenclature

Satellites images and their pre-processing

Topography

Very high spatial resolution (VHSR)

High spatial resolution (HSR)

LULC classification with the MORINGA processing chain

Predictors calculation: topographic, textural and spectral indices

Object detection by segmentation of the VHSR image

Random forest training

Post-processing procedure

Vectorization

Data Records

Technical Validation

Random forest cross-validation and performance metrics

Corrections applied to the classification during post-processing

Final classification validation

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary materials

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links