Introduction

Biodiversity monitoring is critical because it serves as a foundation for assessing ecosystem integrity, disturbance responses, and the effectiveness of conservation and recovery efforts1,2,3. Traditionally, biodiversity monitoring relied on empirical data collected manually4. This is time-consuming, labor-intensive, and costly. Moreover, such data can contain sampling biases as a result of difficulties controlling for observer subjectivity and animals’ responses to observer presence5. These constraints severely limit our ability to estimate the abundance of natural populations and community diversity, reducing our ability to interpret their dynamics and interactions. Counting wildlife by humans has a tendency to greatly underestimate the number of individuals present6,7. Furthermore, population estimates based on extrapolation from a small number of point counts are subject to substantial uncertainties and may fail to represent the spatio-temporal variation in ecological interactions (e.g. predator-prey), leading to incorrect predictions or extrapolations7,8. While human-based data collection has a long history in providing the foundation for much of our knowledge of where and why animals dwell and how they interact, present difficulties in wildlife ecology and conservation are revealing the limitations of traditional monitoring methods7.

Recent improvements in imaging technology have dramatically increased the data-gathering capacity by lowering costs and widening the scope and coverage compared to traditional approaches, opening up new paths for large-scale ecological studies7. Many formerly inaccessible places of conservation interest may now be examined by using high-resolution remote sensing9, and digital technologies such as camera traps10,11,12 are collecting vast volumes of data non-invasively. Camera traps are low-cost, simple to set up, and provide high-resolution image sequences of the species that set them off, allowing researchers to identify the animal species, their behavior, and interactions including predator-prey, competition and facilitation. Several cameras have already been used to monitor biodiversity around the world, including underwater systems13,14, making camera traps one of the most widely-used sensors12. In biodiversity conservation initiatives, camera trap imaging is quickly becoming the gold standard10,11, as it enables for unparalleled precision monitoring across enormous expanses of land.

However, people find it challenging to analyze the massive amounts of data provided by these devices. The enormous volume of image data generated by modern gathering technologies for ecological studies is too large to be processed and analyzed at scale to derive compelling ecological conclusions15. Although online crowd-sourcing platforms could be used to annotate images16, such systems are unsustainable due to the exponential expansion in data acquisition and to the insufficient expert knowledge that is most often required for the annotation. In other words, we need tools that can automatically extract relevant information from the data and help to reliably understand how ecological processes act across space and time.

Machine learning has proven to be a suitable methodology to unravel the ecological insights from massive amounts of data17. Detection and counting pipelines have evolved from imprecise extrapolations from manual counts to machine learning-based systems with high detection rates18,19,20. Using deep learning (DL) to detect and classify species for the purpose of population estimation is becoming increasingly common18,19,20,21,22,23,24,25,26,27. DL models, most often with convolutional neural network (CNN) like architectures18,19,20,22,24,26, have been the standard thus far in biodiversity monitoring. Although these models have an acceptable performance, they often unreliably detect minority classes22, require a very well-tailored model selection and training, large amounts of data20, and have a non-negligible error rate that negatively influences the modeling and interpretation of the outcoming data. Thereupon, it is argued that many DL-based monitoring systems cannot be deployed in a fully autonomous way if one wants to ensure a reliable-enough classification28,29.

Recently, following their success in natural language processing applications30, transformer architectures were adapted to computer vision applications. The resulting structures, known as vision transformers (ViTs)31, differ from CNN-based models, that use image pixels as units of information, in using image patches, and employing an attention mechanism to weigh the importance of each part of the input data differently. Vision transformers have demonstrated encouraging results in several computer vision tasks, outperforming the state of the art (SOTA) in several paradigmatic datasets, and paving the way for new research areas within the branch of deep learning.

In this article, we use a specific kind of ViTs, Data efficient image Transformers (DeiTs)32, for the classification of biodiversity images such as plankton, coral reefs, insects, birds and large animals (though our approach can also be applied in different domains). We show that while the single-model performance of DeiTs matches that of alternative approaches, ensembles of DeiTs (EDeiTs) achieve very good performances without requiring any hyperparameter tuning. We see that this mainly happens because of a higher disagreement in the predictions, with respect to other model classes, between independent DeiT models.

Results

A new state of the art

We trained EDeiTs on several ecological datasets, spanning from microorganisms to large animals, including images in color as well as in black-and-white, with and without background; and including datasets of diverse sizes and with varying numbers of classes, both balanced and unbalanced. Details on the datasets are provided in section "Data". As shown in Fig. 1, the error rates of EDeiTs are sometimes close to or even smaller than those of previous SOTA. In the SI, we provide a detailed comparison between our models’ accuracy and F1-score and that of the previous SOTA. Details on models and training are provided in sections "Models", "Implementation" and "Ensemble learning".

Figure 1
figure 1

Comparing EDeiTs to the previous SOTA. For each dataset, we show the error, which is the fraction of misclassified test images (\(1-\text {accuracy}\)). The error of the existing SOTA model is shown in orange. For the ensembles of DeiTs, we show two ways of combining the individual learnings: through arithmetic (blue) and geometric (purple) averaging.

Individual models comparison

We now show that the better performance of EDeiTs is not a property of the single models, but that it rather stems from the ensembling step. To do this, we focus on the ZooLake dataset where the previous state of the art is an ensemble of CNN models 22 that consisted of EfficientNet, MobileNet and DenseNet architectures. In Table 1, we show the single-model performances of these architectures, and those of the DeiT-Base model ("Implementation" section), which is the one we used for the results in Fig. 1. The accuracies and (macro-averaged) F1-scores of the two families of models (CNN and DeiT) when compared individually are in a similar range: the accuracies are between 0.96 and 0.97, and the F1-scores between 0.86 and 0.90.

Table 1 Summary of the performance of the individual models on the ZooLake dataset.

Ensemble comparison

We train each of the CNNs in Table 1 four times (as described in Ref.22), with different realisations of the initial conditions, and show their arithmetic average ensemble and geometric average ensemble ("Ensemble learning" section) in the last two columns. We also show the performance of the ensemble model developed in Ref.22, which ensembles over the six shown CNN architectures. We compare those with the ensembled DeiT-Base model, obtained through arithmetic average ensemble and geometric average ensemble over three different initial conditions of the model weights.

As can be expected, upon ensembling the individual model performance improves sensibly. However, the improvement is not the same across all models. The CNN family reaches a maximum F1-score ≤ 0.920 for ensemble of Efficient-B7 network across initial conditions. When the best CNNs are picked and ensembled the ensemble performance (Best_6_avg) reaches F1-score ≤ 0.924 . In the case of DeiT models, the ensemble was carried out without picking the best model across different DeiTs but still reaches similar classification accuracy (with the F1-score reaching 0.924) with no hyperparameter tuning.

Why DeiT models ensemble better

To understand the better performance of DeiTs upon ensembling, we compare CNNs with DeiTs when ensembling over three models. For CNNs, we take the best EfficientNet-B7, MobileNet and Dense121 models from Ref.22 (each had the best validation performance from 4 independent runs). For DeiTs, we train a DeiT-Base model three times (with different initial weight configurations) and ensemble over those three.

Since the only thing that average ensembling takes into account is the confidence vectors of the models, we identify two possible reasons why EDeiTs perform better, despite the single-model performance being equal to CNNs:

  1. (a)

    Different CNN models tend to agree on the same wrong answer more often than DeiTs.

  2. (b)

    The confidence profile of the DeiT predictions is better suited for average ensembling than the other models.

We will see that both (a) and (b) are true, though the dominant contribution comes from (a). In Fig. 2a we show a histogram of how many models gave a right (R) or wrong (W) prediction (e.g. RRR denotes three correct predictions within the individual models, RRW denotes one mistake, and so on).

On Fig. 2b and c, we show the same quantity, but restricted to the examples that were correctly classified by the arithmetic and geometric averaged ensemble models. The CNN ensemble has more RRR cases (2523) than the EDeiT (2515), but when the three models have some disagreement, the EDeiTs catch up with the CNN ensembles

Figure 2
figure 2

Comparison between three-model ensemble models based on CNNs and on DeiTs on the ZooLake test set. The bar heights indicate how often each combination (RRR, RRW, RWW, WWW) appeared. RRR indicates that all the models gave the right answer, RRW means that one model gave a wrong answer, and so on. The numbers below each bar indicate explicitly the height of the bar. On panel (a) we consider the whole test set, on panel (b) we only consider the examples which were correctly classified by the arithmetic ensemble average, and on panel (c) those correctly classified through geometric ensemble average.

In particular:

The correct RWW cases are 2.0x more common in the geometric average and arithmetic average EDeiT (Geometric CNN: 8, Geometric EDeiT: 15; Arithmetic CNN: 8, Arithmetic EDeiT: 16). In the SI (See Footnote 1) we show that the probability that a RWW ensembling results in a correct prediction depends on the ratio between the second and third component of the ensembled confidence vector, and that the better performance of DeiT ensembles in this situation is justified by the shape of the confidence vector.We thus measure the mutual agreement between different models. To do so, we take the confidence vectors, \(\vec c_0\), \(\vec c_1\) and \(\vec c_2\) of the three models, and calculate the similarity

$$\begin{aligned} S = \frac{1}{3}(\vec c_0\cdot \vec c_1+\vec c_0\cdot \vec c_2+\vec c_1\cdot \vec c_2)\,, \end{aligned}$$
(1)

averaged over the full test set. For DeiTs, we have \(S=0.799\pm 0.004\), while for CNNs the similarity is much higher, \(S=0.945\pm 0.003\). This is independent of which CNN models we use. If we ensemble Eff2, Eff5 and Eff6, we obtain \(S=0.948\pm 0.003\). Note that the lower correlation between predictions from different DeiT learners is even more striking given that we are comparing the same DeiT model trained three times, with different CNN architectures. This suggests that the CNN predictions focus on similar sets of characteristics of the image, so when they fail, all models fail similarly. On the contrary, the predictions of separate DeiTs are more independent. Given a fixed budget of single-model correct answers, RWW combinations result more likely in a correct answer when the two wrong answers are different (see SI (See Footnote 1)). The situation is analogous for geometric averaging (Fig. 2c).

Comparison to vanilla ViTs : For completeness, in the SI (See Footnote 1) we also provide a comparison between DeiTs32 and vanilla ViTs31. Also here, we find analogous results: despite the single-model performance being similar, DeiTs ensemble better, and this can be again attributed to the lower similarity between predictions coming from independent models. This suggests that the better performance of DeiT ensembles is not related to the attention mechanism of ViTs, but rather of the distillation process which is characteristic of DeiTs ("Models" section).

Discussion

We presented Ensembles of Data Efficient Image Transformers (EDeiTs) as a standard go-to method for image classification. Though the method we presented is valid for any kind of images, we provided a proof of concept of its validity with biodiversity images. Besides being of simple training and deployment (we performed no specific tuning for any of the datasets), EDeiTs achieve results comparable to those of earlier carefully tuned state-of the-art methods, and even outperform them in classifying biodiversity images in four of the ten datasets.

Focusing on a single dataset, we compared DeiT with CNN models (analogous results stem from a comparison with vanilla ViTs). Despite the similar performance of individual CNN and DeiT models, ensembling benefits DeiTs to a larger extent. We attributed this to two mechanisms. To a minor extent, the confidence vectors of DeiTs are less peaked on the highest value, which has a slight benefit on ensembling. To a major extent, independently of the architecture, the predictions of CNN models are very similar to each other (independently of whether the prediction is wrong or right), whereas different DeiTs have a lower degree of mutual agreement, which turns out beneficial towards ensembling. This greater independence between DeiT learners also suggests that the loss landscape of DeiTs is qualitatively different from that of CNNs, and that DeiTs might be particularly suitable for algorithms that average the model weights throughout learning, such as stochastic weighted averaging33, since different weight configurations seem to interpret the image in a different way.

Unlike many kinds of ViTs, the DeiT models we used have a similar number of parameters compared to CNNs, and the computational power required to train them is similar. In addition to their deployment requiring similar efforts, with higher performances, DeiTs have the additional advantage of being more straightforwardly interpretable than CNNs by ecologists, because of the attention map that characterizes transformers. The attention mechanism allows to effortlessly identify where in the image the model focused its attention (Fig. 3), rendering DeiTs more transparent and controllable by end users.

Figure 3
figure 3

Examples of DeiTs identifying images from different datasets: (a) Stanford Dogs, (b) SriLankan tiger beetles, (c) Florida wild-trap, and (d) NA-Birds datasets are visualized. The original image is shown on the left in each panel, while the right reveals where our model is paying attention while classifying the species in the image.

All these observations pose EDeiTs as a solid go-to method for the classification of ecology monitoring images. Though EDeiTs are likely to be an equally solid method also in different domains, we do not expect EDeiTs to beat the state of the art in mainstream datasets such as CIFAR34 or ImageNet35. In fact, for such datasets, immense efforts were made to achieve the state of the art, the top architectures are heavily tailored to these datasets36, and their training required huge numerical efforts. Even reusing those same top architectures, it is hard to achieve high single-model performances with simple training protocols and moderate computational resources. In addition, while ensembling provides benefits37, well-tailored architectural choices can provide the same benefits38. Therefore, it is expected that the SOTA models trained on these datasets will benefit less from ensembling.

Finally, we note that the nominal test performance of machine learning models is often subject to a decrease when the models are deployed on real world data. This phenomenon, called data shift, can be essentially attributed to the fact that the data sets often do not adequately represent the distribution of images that is sampled at the moment of deployment39. This can be due to various reasons (sampling method, instrument degradation, seasonal effects, an so on) and is hard to harness. However, it was recently shown that Vision Transformer models (here, ViT and DeiT) are more robust to data shift40,41,42 and to other kinds of perturbations such as occlusions41, which is a further reason for the deployment of EDeiTs in ecological monitoring.

Methods

Data

We tested our models on ten publicly available datasets. In Fig. 4 we show examples of images from each of the datasets. When applicable, the training and test splits were kept the same as in the original dataset. For example, the ZooScan, Kaggle, EILAT, and RSMAS datasets lack a specific training and test set; in these cases, benchmarks come from k-fold cross-validation43,44, and we followed the exact same procedures in order to allow for a fair comparison.

Figure 4
figure 4

Examples of images from each of the datasets.(a) RSMAS (b) EILAT (c) ZooLake (d) WHOI (e) Kaggle (f) ZooScan (g) NA-Birds (h) Stanford dogs (i) SriLankan Beetles (j) Florida Wildtrap.

RSMAS This is a small coral dataset of 766 RGB image patches with a size of \(256\times 256\) pixels each45. The patches were cropped out of bigger images obtained by the University of Miami’s Rosenstiel School of Marine and Atmospheric Sciences. These images were captured using various cameras in various locations. The data is separated into 14 unbalanced groups and whose labels correspond to the names of the coral species in Latin. The current SOTA for the classification of this dataset is by44. They use the ensemble of best performing 11 CNN models. The best models were chosen based on sequential forward feature selection (SFFS) approach. Since an independent test is not available, they make use of 5-fold cross-validation for benchmarking the performances.

EILAT This is a coral dataset of 1123 64-pixel RGB image patches45 that were created from larger images that were taken from coral reefs near Eilat in the Red sea. The image dataset is partitioned into eight classes, with an unequal distribution of data. The names of the classes correspond to the shorter version of the scientific names of the coral species. The current SOTA44 for the classification of this dataset uses the ensemble of best performing 11 CNN models similar to RSMAS dataset and 5-fold cross-validation for benchmarking the performances.

ZooLake This dataset consists of 17943 images of lake plankton from 35 classes, acquired using a Dual-magnification Scripps Plankton Camera (DSPC) in Lake Greifensee (Switzerland) between 2018 and 2020 14,46. The images are colored, with a black background and an uneven class distribution. The current SOTA22 on this dataset is based on a stacking ensemble of 6 CNN models on an independent test set.

WHOI This dataset 47 contains images of marine plankton acquired by Image FlowCytobot48, from Woods Hole Harbor water. The sampling was done between late fall and early spring in 2004 and 2005. It contains 6600 greyscale images of different sizes, from 22 manually categorized plankton classes with an equal number of samples for each class. The majority of the classes belonging to phytoplankton at genus level. This dataset was later extended to include 3.4M images and 103 classes. The WHOI subset that we use was previously used for benchmarking plankton classification models43,44. The current SOTA22 on this dataset is based on average ensemble of 6 CNN models on an independent test set.

Kaggle-plankton The original Kaggle-plankton dataset consists of plankton images that were acquired by In-situ Ichthyoplankton Imaging System (ISIIS) technology from May to June 2014 in the Straits of Florida. The dataset was published on Kaggle (https://www.kaggle.com/c/datasciencebowl) with images originating from the Hatfield Marine Science Center at Oregon State University. A subset of the original Kaggle-plankton dataset was published by43 to benchmark the plankton classification tasks. This subset comprises of 14,374 greyscale images from 38 classes, and the distribution among classes is not uniform, but each class has at least 100 samples. The current SOTA22 uses average ensemble of 6 CNN models and benchmarks the performance using 5-fold cross-validation.

ZooScan The ZooScan dataset consists of 3771 greyscale plankton images acquired using the Zooscan technology from the Bay of Villefranche-sur-mer49. This dataset was used for benchmarking the classification models in previous plankton recognition papers43,44. The dataset consists of 20 classes with a variable number of samples for each class ranging from 28 to 427. The current SOTA22 uses average ensemble of 6 CNN models and benchmarks the performance using 2-fold cross-validation.

NA-Birds NA-Birds50 is a collection of 48,000 captioned pictures of North America’s 400 most often seen bird species. For each species, there are over 100 images accessible, with distinct annotations for males, females, and juveniles, totaling 555 visual categories. The current SOTA51 called TransFG modifies the pure ViT model by adding contrastive feature learning and part selection module that replaces the original input sequence to the transformer layer with tokens corresponding to informative regions such that the distance of representations between confusing subcategories can be enlarged. They make use of an independent test set for benchmarking the model performances.

Stanford Dogs The Stanford Dogs dataset comprises 20,580 color images of 120 different dog breeds from all around the globe, separated into 12,000 training images and 8,580 testing images52. The current SOTA51 makes use of modified ViT model called TransFG as explained above in NA-Birds dataset. They make use of an independent test set for benchmarking the model performances.

Sri Lankan Beetles The arboreal tiger beetle data53 consists of 380 images that were taken between August 2017 and September 2020 from 22 places in Sri Lanka, including all climatic zones and provinces, as well as 14 districts. Tricondyla (3 species), Derocrania (5 species), and Neocollyris (1 species) were among the nine species discovered, with six of them being endemic . The current SOTA53 makes use of CNN-based SqueezeNet architecture and was trained using pre-trained weights of ImageNet. The benchmarking of the model performances was done on an independent test set.

Florida Wild Traps The wildlife camera trap54 classification dataset comprises 104,495 images with visually similar species, varied lighting conditions, skewed class distribution, and samples of endangered species, such as Florida panthers. These were collected from two locations in Southwestern Florida. These images are categorized in to 22 classes. The current SOTA54 makes use of CNN-based ResNet-50 architecture and the performance of the model was benchmarked on an independent test set.

Models

Vision transformers (ViTs)31 are an adaptation to computer vision of the Transformers, which were originally developed for natural language processing30. Their distinguishing feature is that, instead of exploiting translational symmetry, as CNNs do, they have an attention mechanism which identifies the most relevant part of an image. ViTs have recently outperformed CNNs in image classification tasks where vast amounts of training data and processing resources are available30,55. However, for the vast majority of use cases and consumers, where data and/or computational resources are limiting, ViTs are essentially untrainable, even when the network architecture is defined and no architectural optimization is required. To settle this issue, Data-efficient Image Transformers (DeiTs) were proposed32. These are transformer models that are designed to be trained with much less data and with far less computing resources32. In DeiTs, the transformer architecture has been modified to allow native distillation56, in which a student neural network learns from the results of a teacher model. Here, a CNN is used as the teacher model, and the pure vision transformer is used as the student network. All the DeiT models we report on here are DeiT-Base models32. The ViTs are ViT-B16, ViT-B32, and ViT-L32 models31.

Implementation

To train our models, we used transfer learning57: we took a model that was already pre-trained on the ImageNet35 dataset, changed the last layers depending on the number of classes, and then fine-tuned the whole network with a very low learning rate. All the models were trained with two Nvidia GTX 2080Ti GPUs.

DeiTs We used DeiT-Base32 architecture, using the Python package TIMM58, which includes many of the well-known deep learning architectures, along with their pre-trained weights computed from the ImageNet dataset35. We resized the input images to 224 x 224 pixels and then, to prevent the model from overfitting at the pixel level and help it generalize better, we employed typical image augmentations during training such as horizontal and vertical flips, rotations up to 180 degrees, small zoom up’s to 20%, a small Gaussian blur, and shearing up to 10%. To handle class imbalance, we used class reweighting, which reweights errors on each example by how present that class is in the dataset59. We used sklearn utilities60 to calculate the class weights which we employed during the training phase.

The training phase started with a default pytorch61 initial conditions (Kaiming uniform initializer), an AdamW optimizer with cosine annealing62, with a base learning rate of \(10^{-4}\), and a weight decay value of 0.03, batch size of 32 and was supervised using cross-entropy loss. We trained with early stopping, interrupting training if the validation F1-score did not improve for 5 epochs. The learning rate was then dropped by a factor of 10. We iterated until the learning rate reached its final value of \(10^{-6}\). This procedure amounted to around 100 epochs in total, independent of the dataset. The training time varied depending on the size of the datasets. It ranged between 20min (SriLankan Beetles) to 9h (Florida Wildtrap). We used the same procedure for all the datasets: no extra time was needed for hyperparameter tuning.

ViTs We implemented the ViT-B16, ViT-B32 and ViT-L32 models using the Python package vit-keras (https://github.com/faustomorales/vit-keras), which includes pre-trained weights computed from the ImageNet35 dataset and the Tensorflow library63.

First, we resized input images to 128 × 128 and employed typical image augmentations during training such as horizontal and vertical flips, rotations up to 180 degrees, small zooms up to 20%, small Gaussian blur, and shearing up to 10%. To handle class imbalance, we calculated the class weights and use them during the training phase.

Using transfer learning, we imported the pre-trained model and froze all of the layers to train the model. We removed the last layer, and in its place we added a dense layer with \(n_c\) outputs (being \(n_c\) the number of classes), was preceded and followed by a dropout layer. We used the Keras-tuner64 with Bayesian optimization search65 to determine the best set of hyperparameters, which included the dropout rate, learning-rate, and dense layer parameters (10 trials and 100 epochs). After that, the model with the best hyperparameters was trained with a default tensorflow63 initial condition (Glorot uniform initializer) for 150 epochs using early stopping, which involved halting the training if the validation loss did not decrease after 50 epochs and retaining the model parameters that had the lowest validation loss.

CNNs CNNs included DenseNet66, MobileNet67, EfficientNet-B268, EfficientNet-B568, EfficientNet-B668, and EfficientNet-B768 architectures. We followed the training procedure described in Ref.22, and carried out the training in tensorflow.

Ensemble learning

We adopted average ensembling, which takes the confidence vectors of different learners, and produces a prediction based on the average among the confidence vectors. With this procedure, all the individual models contribute equally to the final prediction, irrespective of their validation performance. Ensembling usually results in superior overall classification metrics and model robustness69,70.

Given a set of n models, with prediction vectors \(\vec c_i~(i=1,\ldots ,n)\), these are typically aggregated through an arithmetic average. The components of the ensembled confidence vector \(\vec c_{AA}\), related to each class \(\alpha\) are then

$$\begin{aligned} c_{AA,\alpha } = \frac{1}{n}\sum _{i=1}^n c_{i,\alpha }\,. \end{aligned}$$
(2)

Another option is to use a geometric average,

$$\begin{aligned} c_{GA,\alpha } = \root n \of {\prod _{i=1}^n c_{i,\alpha }}\,. \end{aligned}$$
(3)

We can normalize the vector \(\vec c_g\), but this is not relevant, since we are interested in its largest component, \(\displaystyle \max _\alpha (c_{GA,\alpha })\), and normalization affects all the components in the same way. As a matter of fact, also the nth root does not change the relative magnitude of the components, so instead of \(\vec c_{GA}\) we can use a product rule: \(\displaystyle \max _\alpha (c_{GA,\alpha })=\max _\alpha (c_{PROD,\alpha })\), with \(\displaystyle c_{PROD,\alpha } = \prod _{i=1}^n c_{i,\alpha }\).

While these two kinds of averaging are equivalent in the case of two models and two classes, they are generally different in any other case71. For example, it can easily be seen that the geometric average penalizes more strongly the classes for which at least one learner has a very low confidence value, a property that was termed veto mechanism72 (note that, while in Ref.72 the term veto is used when the confidence value is exactly zero, here we use this term in a slightly looser way).