Introduction

Behaviour is the external output of an animal’s nervous system. The possibility to systematically observe, extract, and quantify an animal’s motion is a prerequisite to investigate and ultimately understand its behavioral repertoire. Alterations to an organism’s natural behavior is a phenotypic readout of the neural and other molecular changes that are causing them. To fully understand the functioning of neural mechanisms it is therefore essential to dissect their effect on an animal’s behavior.

Capturing behavior requires video acquisition systems allowing to either view or infer an entire posture of an organism and its change in time. One of the main challenges in obtaining complete and precise posture measurements are the occlusions of animal body parts in a 2D video recording, especially if more than one individual is being imaged. To resolve this, extensive 3D motion capture systems have been developed1 as well as methods that allow to impute the occluded parts of the posture2.

These challenges have not yet been resolved for the model organism C. elegans. While imaging the nematode’s behavior is less complex than imaging of larger organisms and massively parallel recording systems allow to capture thousands of worms at a time3,4, there are currently no end-to-end methods that resolve their postures when occlusions occur. The quantification of C. elegans strains’ behavior and characterization of their phenotypes is therefore based on segments of worm motion in which it does not coil or intersect with another worm. As a result, a large portion of the worm behavior, including its group behavior, cannot be quantitatively analyzed.

Here we propose an automated method for C. elegans posture extraction from 2D video recordings. Based on deep learning transformer architecture and a classical instance segmentation training objective, our solution allows to correctly infer an outline of an individual worm body in overlapping and occluded configurations. We train the neural network on randomly generated image data, obtaining a solution that generalizes to various real datasets. With the segmentation outputs of our method we are able to correctly infer worm trajectories with a simple position matching heuristics. WormSwin opens up new opportunities to study the full repertoire of C. elegans behavior including behaviors such as mating that were previously inaccessible to quantitative analysis.

Related work

Over the past years different methods for C. elegans detection and segmentation have been proposed, either as part of a general approach to tracking and behavioral studies, or as a stand-alone method.

One of the first methods for automated worm tracking and behavior quantification was proposed by Baek et al.5. The method used a computer-controlled tracker for single worms, recording grayscale videos. The grayscale frames of a video were binarized based on the mean and standard deviation of pixel intensities and a predefined threshold. The method computes features such as the area of foreground or the movement between two frames in the binarized videos and uses them as input to the algorithm6 for classification of different C. elegans strains. Swierczek et al.7 proposed a tracking approach called Multi Worm Tracker. The method calculates a background estimate using pixel intensity values. Moving objects are found by searching for pixels darker than the background by a specific threshold. In the next frame, the objects are searched for in the vicinity of their previous location.

The arrival of deep learning offered new opportunities to build more accurate methods for worm segmentation and tracking. Javer et al.8 developed a multi-object tracking framework able to track C. elegans as well as fish and drosophila larvae. The method requires manual tuning of segmentation parameters to best perform with the given recorded data and comes with a graphical user interface for the ease of use and evaluation of the results. Using the motion data, the framework extracts a large number of features characterizing worm movement. Hebert et al.9 proposed a pose estimation method for videos of single moving C. elegans in challenging poses like coiling. Using a ResNetV210-like architecture the centerline of worms is predicted. With the help of temporal information the head and tail position is determined. Wählby et al.11 proposed a phenotype analysis toolbox based on the open-source CellProfiler12 project. To untangle clusters of worms the authors describe them as a mathematical graph and, using a learned model of worm postures, search for the best representation of true worms. The worm posture model is based on a training dataset of isolated single C. elegans shapes and on computed angle-based shape descriptors. One of the downsides of this approach is that unexpected phenotypes are likely to be discarded as debris. Banerjee et al.13 introduced a deep learning C. elegans tracking method in which the detection is based on YOLOv514 and tracking on Strong SORT algorithm15. For each detected object the method outputs its bounding box, then threshold-based segmentation and skeletonization are applied to infer shapes of the detected objects. Fudickar et al.16 developed a two-shot segmentation method based on Mask R-CNN17 with ResNet-10118 backbone, to segment C. elegans imaged in petri-dishes with a low-cost image capturing system. However, the method did not solve the problem of segmenting overlapping worms and segments them as one object. Mais et al.19 developed a proposal-free instance segmentation method, called PatchPerPix, based on a convolutional neural network (CNN) trained to predict the shape of a C. elegans in a small patch of the whole image (local shape patches). The method uses a modified U-Net20 deep neural network and patch affinity graphs to reconstruct individual worm shapes. For each pixel the method predicts which shape patch it belongs to and, using a patch affinity graph, merges the patches to form complete instance shapes. Lalit et al.21 proposed an embedding-based instance segmentation method for 2D and 3D microscopy data, called EmbedSeg. The method is based on ERF-Net22, predicting spatial embeddings of pixels. These embeddings are then clustered into object instances. To train this method, an additional step of pre-processing the dataset is required to generate object-centered image patches for every object. The method was tested on different datasets including the C. elegans BBBC010 dataset.

Among the methods described above there are one- and two-shot detectors. One shot-detector architectures like YOLO23 detect objects in one step. Pre-defined boxes (also called anchors) are placed onto a grid, laid over the image. For each box, the network predicts if the box contains an object. On the other hand, two-shot detectors consist of a region proposal network (RPN) proposing regions of interest (RoI) to a second network, refining these proposals to form the actual predictions. One-shot object detection methods (like13) are in general less computationally expensive compared to two-shot approaches (e.g.16), although the latter ones achieve a higher precision especially in more challenging scenes. This is one of the reasons for the high popularity of two-shot networks such as Mask R-CNN in the domain of instance segmentation.

Usually more than one box is predicted per object. To only keep the best matching box, many methods apply non-maximum suppression (NMS). This approach consists of removing from the predicted highly overlapping bounding boxes those with lower probability values as potential false positive detections of the same object. However, NMS can lead to removal of correct detections, especially in dense scenes, where many objects in the image overlap.

In this paper we address the problem of segmenting objects in dense scenes by combining the well established architecture of two-shot detectors with state of the art vision transformer. To avoid the pitfalls of the NMS algorithm, we apply Soft Non Maximum Suppression (Soft-NMS)24.

Methods

Datasets

Figure 1
figure 1

Example images from the datasets used in this study: (a) synthetic dataset example with added ring, (b) synthetic dataset without ring, (c) BBBC010 dataset example with mostly alive C. elegans, (d) BBBC010 dataset patch with mostly dead C. elegans, (e) mating dataset with petri-dish ring, (f) zoomed-in mating dataset patch with many overlaps.

CSB-1 dataset

The CSB-1 dataset consists of 56 videos with a length of \(\sim 1.5\) min, a frame rate of 5 Hz and frame size of \(912 \times 736\) px which were generated to describe the new C. elegans csb-1 strain25. We annotated 10 of those videos, where nine videos were reserved for training and one for testing. The videos do not contain any visible petri-dish edges, have different backgrounds and varying numbers of worms. We extracted frames from the videos using FFmpeg (https://ffmpeg.org) and used them to generate our synthetic training dataset described below.

Worms were annotated individually with a binary mask labelling foreground pixels, resulting in one mask image per worm per frame. These separate masks allow to mark all the worms also in cases where pixels of individual C. elegans overlap. The labeled CSB-1 dataset contains more than 60,500 individual worm masks. C. elegans at the image borders are ignored during the labelling process. Our data is available under https://doi.org/10.5281/zenodo.7456803 as a rich resource to develop better methods for animal tracking.

Synthetic dataset

For training the model we generated a synthetic dataset using the nine annotated videos from the CSB-1 dataset described above. We automatically cut out foreground objects from the original grayscale images, according to their polygon annotations and created patches with a worm in the foreground and transparent background. Additionally, we created background images as templates by removing all foreground objects using standard graphics software and filled them with patches of background, taken from the same background images.

The following pipeline was applied to create each image of the synthetic dataset:

  1. 1.

    Randomly select 5–30 foreground objects and a background template

  2. 2.

    Randomly flip and rotate foreground objects and their corresponding annotations

  3. 3.

    Apply blurring to foreground objects by averaging the pixel values using a \(2\times 2\) px kernel

  4. 4.

    Place foreground object patches on background image:

    1. (a)

      In 20% cases: place a foreground object on top of another one

    2. (b)

      In 80% cases: place a foreground object randomly on the background image

The generated training dataset consists of 10,000 grayscale images with a size of \(912 \times 736\) px and more than 175,000 labeled C. elegans and additional 1000 images for testing (see Fig. 1a). We randomly added grayscale rings of random sizes surrounding the center of the images (see Fig. 1b) to make the network robust against similar artifacts (e.g. petri-dish edges) in other test datasets. Foreground objects might overlap with the artificial petri-dish edges, but are only placed on the inside of the rings. Using the object masks of the original data, for each foreground object we generated a binary mask corresponding to its artificially generated location and shape. These masks are used as ground truth for model training and testing on this dataset. Our synthetic training dataset is available at https://doi.org/10.5281/zenodo.7456803.

BBBC010

The “BBBC010—C. elegans live/dead assay”26 (BBBC010) dataset consists of 200 images, divided into 100 bright-field and 100 green fluorescent protein (GFP) microscopy images of the same scene. The images have a size of \(696 \times 520\) px and are saved as 16-bit grayscale TIFF files. For our experiments we converted the images to 8-bit grayscale PNG images. The images contain a black border surrounding the region of interest (ROI) with the C. elegans in the center (Fig. 1c) which makes up around 50% of the image. Ground truth consists of binary foreground/background images for each worm separately, allowing to disentangle the overlapping shapes.

The images show C. elegans exposed to Enterococcus faecalis with a negative control group containing dead worms and a positive control group, which was treated with ampicillin and includes alive worms. While the alive C. elegans have the natural curved shapes (Fig. 1c), the negative control group appear rod-like with an uneven texture (Fig. 1d).

Mating dataset

The mating dataset (MD) was created from a 10 min. long video with a frame rate of 25 Hz and a frame size of \(3036 \times 3036\) px. It contains freely moving worms as well as mating ones. Mating behavior is particularly difficult to segment as the two individuals are strongly overlapping and parallel to one another (Fig. 1e). This dataset represents therefore the most challenging segmentation task for our method.

We downsampled the video to 5 Hz and selected 50 frames randomly for annotation and testing of our approach. More than 3900 individual worm postures were labeled in this dataset. The labeling includes only mature C. elegans, worms touching the image boundary were ignored. We split the frames into 450 images with a size of \(1012 \times 1012\) px without overlap. The grayscale images show C. elegans in a petri-dish with the edges visible (see example patch in Fig. 1f).

Network architecture

To predict bounding boxes and instance segmentation masks we use the Hybrid Task Cascade (HTC)27 neural network architecture, combined with Swin Transformer28 as backbone (similar to28).

Swin Transformer is a Vision Transformer (ViT)-based backbone architecture29, which can be applied to different vision-related tasks (e.g. classification, detection or segmentation). Previous ViTs divided the input image into relatively large patches and computed self attention among them. ViTs showed lower computational complexity, but did not account for small details in large images. To tackle this problem Swin Transformer introduced a Shifted Window approach to reduce the computational complexity of standard multi-head self attention (MSA) modules. Additionally, Swin Transformer builds hierarchical feature maps, merging image patches in deeper layers, enabling small-sized patches, leading to more detailed predictions. We chose Swin-L architecture variant in our study which was pre-trained on ImageNet-21K30 with an image size of \(384 \times 384\) px (similar to28).

HTC improves the architecture of Cascade Mask R-CNN31 by introducing interleaved bounding box regression and instance segmentation mask prediction. The information flow is optimized by adding direct connections between the individual mask branches (Fig. 2). Additionally, a semantic segmentation branch is added to the original architecture to help to distinguish between foreground and background. In our experiments we do not use this additional semantic segmentation branch.

To further improve the accuracy when training on small batches, we exchanged the default Batch Normalization (BN)32 with Group Normalization (GN)33 and Weight Standardization (WS)34 in HTC (similar to34). We also replaced the Shared 2 Fully-Connected Bounding Box heads (Shared2FC) by Shared 4 Convolution + one Fully-Connected Bounding Box head (Shared4Conv1FC) (as described in33). To suppress low quality detections but keep high quality predictions in dense and overlapping scenes we use Soft-NMS instead of the traditional NMS algorithm for the R-CNN during test time (see HTC++28).

Figure 2
figure 2

Network architecture based on Swin-L backbone and HTC. Batch norm (BN) layers in HTC are replaced by group norm (GN) + weight standardization (WS). Bounding box heads are changed from the original Shared2FC architecture to Shared4Conv1FC.

Training

We used multi-scale training with a size between 480 and 800 px for the shorter side and 1333 px at most for the longer side, AdamW35 as optimizer, Cosine Annealing Learning Rate Scheduler36 and Linear Warm-Up37 (similar to28). The learning rate was set to 2.5e-5 and weight decay to 0.1. The number of warm-up iterations of the linear warm-up and learning rate scheduler was set to 1000, warm-up ratio to 0.1 and minimum learning rate ratio to 1e-5. During training and testing the NMS threshold for the RPN was set to 0.7, the Soft-NMS of the R-CNN was set to 0.5 during test time. We used random flipping with a probability of 0.5 and AutoAugment38 for multiscale resizing and cropping. Additionally, we used the pre-trained weights for the Swin backbone, trained on ImageNet-21K with an image size of \(384 \times 384\) px (similar to28). We tested our approach on three different datasets: the publicly available BBBC010 dataset, MD and CSB-1 datasets. During testing all images were resized to 800 px on the short side and to no more than 1333 px on the longer side, preserving the original ratio. We excluded all instances touching image borders as incomplete C. elegans instances.

In all our experiments we used the MMDetection framework39. Our code and network configuration file for the MMDetection framework are available at https://github.com/bozeklab/worm-swin.

WormSwin was trained using 4 Nvidia Tesla V100-SMX2 32 GB GPUs, 6 cores of an Intel Xeon Gold 6248 CPU @ 2.50 GHz and 100 GB of RAM. With a batch size of four (one image per GPU) and two workers per GPU, training for 36 epochs took \(\sim 19\) h. Evaluation on the test set runs at a speed of 2.7 images/s.

Results

Figure 3
figure 3

Example from the CSB-1 dataset (box and mask colors are selected randomly). (a) Ground truth annotations, (b) predicted bounding boxes and masks, (c) TP (green), FP and FN (red) pixels.

We trained WormSwin on data synthetically generated based on the CSB-1 dataset. The procedure of data generation allows us to control the degree of overlap among individuals and to train the network on a large number of images containing overlapping worms to improve segmentation of dense scenes. Once trained, we evaluated the model on a synthetic test set (see Table 1) as well as on three independent datasets: BBBC010, MD and CSB-1. These datasets come from different labs, show visual variability, and contain different number and degree of overlapping C. elegans. We report our results mostly as COCO Average Precision (AP)40 calculated using pycocotools (https://github.com/cocodataset/cocoapi). For the BBBC010 dataset, we report our results as DSB AP for comparison to other methods. AP is the area under the precision-recall curve and its values are between 0 and 1, with a higher AP representing better performance. Precision and recall of the detection is calculated for instances that show intersection over union (IoU) with the ground truth above a predefined threshold. DSB mAP calculates a mean Jaccard Index by using different IoU thresholds. COCO mAP uses a more complex approach: detections are sorted by descending confidence score. The calculation iterates over all detections in this order, marks them as True Positive (TP) or False Positive (FP) and adds them to calculate the precision until a recall of 1.0 is reached or iterated over all detections. Different IoU thresholds are used to label detections as TP or FP.

We report our results mostly for two IoU thresholds: 0.5 and 0.75 as well as a mean AP (mAP) for thresholds from 0.5 to 0.95 with a step size of 0.05. One of the most challenging parts for instance segmentation of C. elegans, as well as other biological systems, are overlapping objects in dense configurations. To measure the accuracy of our approach explicitly for overlapping objects, we added a dedicated AP metric. We defined overlapping objects as those whose ground truth bounding boxes overlap by more than 25% or whose segmentation masks that any overlap (>0% IoU). We report the AP for all objects as well as for the overlapping objects separately (Table 2).

Table 1 Test results on all instances.
Table 2 Test results for overlapping worms only (* multi-scale testing, additional training data).

CSB-1 dataset

Although trained on synthetically generated data, our method generalizes fairly well to the real video data with a mAP of 0.819 and 0.585 for the bounding box and mask respectively, lower by only \(\sim 0.09\) mAP compared to the synthetic data. The same metric on the overlapping worms in the CSB-1 dataset are 0.551 and 0.527. While the mAP is lower for the overlapping C. elegans compared to the results on the entire dataset, the \(\hbox {AP}_{0.50}\) of the bounding box and mask on the overlapping worms are 0.883 and 0.975, respectively. This result suggests that the worms are detected correctly in principle but there exist errors in mask prediction. What these mask prediction errors are, is however not clear at a first glance. Despite the difference between the \(\hbox {AP}_{0.50}\) and \(\hbox {AP}_{0.75}\) in the overlapping worms, we found that the segmentation masks align in general well with the ground truth (Fig. 3), however pixels on the edges of each object tend to be imprecisely segmented. Due to the small size of a worm mask with \(\sim 500\) px, errors at the edges of the predicted masks represent \(\sim 30\%\) of all foreground pixels.

To test the hypothesis that most error occur on the mask edges, we implemented an alternative version of the IoU: if a pixel in either ground-truth or predicted mask is at the border of an object (when the 4-way neighborhood is not fully foreground) then it is set to the value of the pixel at this position in the other mask. This way, object border pixels which otherwise would be considered as false negative (FN) or false positive (FP) do not influence the IoU calculation in a negative way. Using this calculation, the mean IoU on the test subset raised from 0.827 to 0.961 (+13.4% increase) on the CSB-1 dataset.

BBBC010 dataset

Because of the very limited number of training samples (50 images) the predictions of the network trained on BBBC010 were of poor quality. Therefore, we used the network pre-trained on our synthetic data and fine-tune it on 50 randomly selected images from the BBBC010 dataset. We compared the performance of our approach to two existing methods: PatchPerPix19 and EmbedSeg21. To enable this comparison, instead of the COCO AP metric (see Table 1) we used (Data Science Bowl) DSB AP (https://www.kaggle.com/competitions/data-science-bowl-2018/overview/evaluation) as accuracy evaluation on this dataset which was used in the original EmbedSeg method publication21 (see Table 3).

Table 3 Test results using DSB metric (* multi-scale testing, additional training data, alternative IoU without object edges, mAP for IoUs in range 0.5–0.95, step size 0.05).

We used the alternative IoU calculation already used for the CSB-1 dataset, to calculate the DSB accuracy without considering object edges. With the IoU defined this way, using the DSB metric we achieve 0.769 mAP (+0.233), 0.929 \(\hbox {AP}_{0.50}\) (+0.012) and 0.823 \(\hbox {AP}_{0.80}\) (+0.487) (compare to Table 3).

Mating dataset

Finally, we tested WormSwin on the MD dataset using weights pre-trained on our synthetic dataset. In this dataset we annotated 50 images, which are larger in size and contain a higher number of C. elegans compared to the BBBC010 dataset. Further, we split them into patches of size \(1024 \times 1024\) px. We report our results in Table 1). Despite the challenging configurations of worms in this dataset, our method correctly identifies the segmented objects, as indicated by the \(\hbox {AP}_{0.50}\) which is comparable to the \(\hbox {AP}_{0.50}\) in other datasets. However the \(\hbox {AP}_{0.75}\) and \(\hbox {mAP}_{0.50:0.95}\) suggest that, while correctly detected, the segmentation masks of the detected objects are imprecise. Similar to other datasets, we hypothesise that these errors occur on the boundaries of the segmentation masks (Fig. 4) as well as are due to the very challenging object overlaps in this dataset.

Figure 4
figure 4

Results on the Mating Dataset (box and mask colors are selected randomly). (a,c,e,g) Segmentation results, (b,d,f,h) TP (green), FP and FN (red) pixels.

Tracking

To test if our segmentation results are sufficiently accurate to allow for worm tracking and further behavioral analysis, we implemented a simple IoU-based matching method (Fig. 5) and applied it on our predicted instance segmentation masks in the CBS-1 test set. Between two consecutive frames, objects with the highest overlap in mask are matched into a trajectory. Iterating the matching procedure over all video frames results in object trajectories. In this simple approach, if an object is not detected in a frame but detected in a subsequent frame its trajectory is disrupted and two separate trajectories are created instead. We attempt to reconnect such trajectories in a post-processing step: for 10 frames after loosing an object, starting points of new trajectories are compared with the endpoint of the lost trajectory. If the segmentation masks at these points overlap with at least 50%, the trajectories are reconnected. In the frames with missing segmentation masks the positions of C. elegans can be interpolated between two ends of reconnected trajectories.

Figure 5
figure 5

Example of tracked C. elegans.

While a tracking method is outside of the scope of this study, our simple approach allows to build trajectories of interacting mating worms (Fig. 5). Tracking these challenging C. elegans interactions opens up new possibilities of studying its behavior.

Discussion

In this work we present WormSwin, a deep learning approach for instance segmentation of microscopy images of C. elegans. Our method combines several recent improvements in deep learning and instance segmentation (Transformer Networks, HTC, Group Normalization, Weight Standardization, Soft-NMS) into a single approach trained end-to-end. WormSwin does not require any pre-processing of the image data, enabling researchers to directly apply it on their video or image data.

Together with our method we provide a large dataset of C. elegans images with instance mask annotations to help researchers develop better segmentation approaches in the future. The new dataset is by an order of magnitude larger compared to the BBBC010 dataset, enabling training deeper network architectures.

The small size of the BBBC010 benchmark dataset is a limiting factor to extensively train and test our method on this dataset. The accuracy of our method is lower on this dataset compared to the CSB-1 which might be attributed to the differences in the color intensities, size and appearance of C. elegans between the two datasets. Since retraining of WormSwin on a small amount of BBBC010 images improved the methods performance, we suggest that to accurately segment datasets differing from CSB-1 characteristics, a similar retraining is necessary.

Notably, our method shows a decrease in AP in the higher IoU threshold categories (e.g. Table 3\(\hbox {AP}_{0.80}\)). Despite this precision drop, the segmentation masks appear overall correct (Figs. 3, 4). We therefore hypothesize that the major errors in the segmentation masks occur on the boundaries of the foreground area and further substantiate this by calculating accuracy metric that does not take into account boundary pixels. The reason for this type of error might be e.g. variation in human-generated labeling. We introduce blurring in the synthetic training data which might additionally change the appearance of the object contours. Despite these errors, individual C. elegans poses are captured by the predicted segmentation masks and can be subject to further quantitative analysis.

As a major future improvement of this work we see models exploring temporal information to improve segmentation of overlapping objects. Information on how C. elegans individuals arrive in a specific configuration is of great help in disentangling their postures. Previous work by Fontaine et al.41 model C. elegans using planar curves and Central Difference Kalman Filter (CDKF) to track multiple worms. This approach shows good results even when occlusion occurs. Similarly, Alonso et al.42 proposed a deep learning approach for detection and tracking in high density microscopy data, based on splines as shape descriptors. They test their approach on different dataset including videos of C. elegans and achieve high accuracy in dense scenes with a high degree of occlusion. Such methods are a step towards combining segmentation with tracking in a single training objective. While generating training datasets for multi-object tracking is a massive work burden, the accuracy of our segmentation approach allows to build preliminary trajectories in an automated fashion.