Introduction

Facial anatomical structures are small and interconnected. Although these structures can be observed and distinguished well through dissection, the detection of the target muscle structure cannot be easily distinguished using imaging equipment such as magnetic resonance imaging (MRI) or computed tomography (CT). Distinguishing facial anatomical structures is important for detecting various diseases or performing cosmetic procedures such as botulinum neurotoxin1,2,3,4,5,6,7,8 and filler injections9,10,11.

While MRI and CT are considered standard medical imaging modalities that reveal high-resolution images of anatomical structures, potential disadvantages of these pieces of equipment include the need for radiation exposure for CT, elevated costs, and long analysis time12,13. As an alternative, ultrasonography (US), one of the most widely used imaging modalities, is considered to be a strong and omnipresent screening and diagnostic assessment tool for clinicians1,4,5,6,8,14. Over the decades, US has demonstrated several major advantages over other medical imaging modalities such as X-ray, MRI, and CT because of its convenience and cost-effectiveness1,4,5,6,8,12,13. However, US also has unique drawbacks, such as low image quality caused by artifacts, high dependence on practitioner experience, and differences in the manufacturers’ US system12,13.

To overcome these drawbacks, automated image analysis based on deep learning has recently been developed; however, there have been no attempts to apply this useful and smart method in the field of facial US anatomy12,13. The three major basic tasks of medical imaging, namely, classification, detection, and segmentation, are widely applied to different anatomical structures in medical US analysis, including the breast15,16, prostate17,18, liver19, heart/cardiac20,21, carotid22, thyroid23, intravascular24,25, lymph nodes26, kidney27, bone28,29, muscle30, nerve structure31. However, there have been no attempts to apply this useful and smart method in the field of facial US anatomy, which is the main cue of several non-invasive surgical procedures32.

Deep learning has rapidly developed in the automatic analysis of low- and high-quality medical imaging for diagnoses as well as image-based interventions12,13. Most of the classification models in the medical image field were created by using transfer learning from pre-trained models from ImageNet (Stanford Vision Lab, Stanford CA), which contains a wide variety of images ranging from faces to cats, cars, and mountains33,34. However, an intrinsic difference in image quality and complexity could affect deep learning performance and should be taken into special consideration in US applications34. The US images appear to have a significantly different image quality from that of ImageNet photos and other medical images34; therefore, it is crucial to evaluate several deep learning models before entering US images into deep learning algorithms and make US diagnoses and US-guided, non-invasive facial surgical procedures/therapies more objective, precise, and reliable.

Facial esthetic research has been conducted by using deep learning in facial aesthetic prediction35,36,37 and the facial rejuvenation recommendation system38. However, studies on the examination of the facial anatomical structures, which is helpful in diagnosing facial skin disease39, preventing iatrogenic side effects, and establishing the safest and most effective treatment plan, are few1,4,6,8,9,32,40,41. Moreover, several previous deep learning models have not yet established which model is acceptable to classify the facial US images and how many data sets are needed, even though the anatomical information is crucial for some clinical tasks such as deciphering facial structures of US images before a procedure. Therefore, we aimed to estimate the value of deep learning for facial US images by assessing the classification performance for facial US images through transfer learning using current representative deep learning models and analyzing the classification criteria.

Materials and methods

All experimental procedures in this study were performed in accordance with the Declaration of Helsinki of the World Medical Association (version of October 2013). The study was approved by the Institutional Review Board of Yonsei University Dental Hospital (approval no. 2-2019-0026, granted on July 30, 2019). A real-time two-dimensional B-mode US system (E-CUBE 15 Platinum, ALPINION Medical Systems, Seoul, Korea) with a 60-mm-wide linear-array transducer (8.0–17.0 MHz; L8-17X, ALPINION Medical Systems) was used to obtain US images of the masseter muscle of healthy young individuals. These US images are unpublished data. The tables and figures in this paper were constructed based on data from the Supplementary Information.

Participant selection and data acquisition

Signed written informed consent and facial US image data were obtained from 86 healthy, young individuals (48 males and 38 females, aged 25.4 ± 4.1 years). The exclusion criteria were orthodontic treatment, temporomandibular joint disorder, plastic surgery, or botulinum neurotoxin injection within the previous 6 months. The participants were placed in a supine position on a chair reclined at 45°. The US sampling frequency was adjusted to 15.0 MHz, which is an ideal frequency for observing depths between 1.5 and 4 cm, depending on the presence of skin, fat, and muscle tissues. The US transducer was positioned perpendicular to the skin surface over the scanning site. US scanning was performed on the midline and left side of the face. We used MATLAB deep-learning tools to implement the predictive model.

Deep learning models trained based on ImageNet data were evaluated for the classification of the nine facial regions. A total of 1440 US images were obtained from volunteers. From these, 160 US images were obtained from each region. All US images were transverse cross-section images. The facial landmarks and related US images for each facial region are shown in Fig. 1.

Figure 1
figure 1

Nine facial regions, their landmarks, and US images corresponding to each landmark. Transverse US images at the region were used for deep learning models. Forehead: 1, trichion (hair line at the midline); 2, metopion (midpoint of bilateral frontal eminence), 3, half point between 2 and 4; 4, glabella; 5, frontal eminence; 6, meeting point between lines passing 3 and medial canthus; 7, meeting point between lines passing 3 and mid-pupil; 8, meeting point between lines passing through 3 and lateral canthus. Oral: 9, half point between subnasale and 10; 10, lower point on cupid’s bow; 11, stomion; 12, midpoint of lower vermillion border. Mentum: 13, deepest point of the chin at the midline; 14, pogonion; 15, gnathion. Nose: 16, sellion; 17, rhinion; 18, pronasale. Supraorbital: 19, meeting point between lines passing 20 and the medial canthus; 20, superior orbital rim at the mid-pupillary line; 21, meeting point between lines passing 20 and the lateral canthus; 22, meeting point between lines passing 20 and the lateral orbital rim. Lateral nose: 23, meeting point between lines passing 26 and the medial canthus; 24, point between 23 and 25; 25, alare. Infraorbital: 26, superior orbital rim at the mid-pupillary line; 27, meeting point between lines passing 26 and the lateral canthus; 28, meeting point between lines passing 26 and the lateral orbital rim; 29, point between 26 and 32; 30, point between 27 and 33; 31, point between 28 and 34; 32, meeting point between lines passing alare and middle pupil; 33, meeting point between lines passing alare and the lateral canthus; 34, meeting point between lines passing alare and the lateral orbital rim. Anterior cheek: 35, meeting point between the line passing 9 and nasolabial folds; 36, meeting point between lines passing stomion and middle pupil; 37, meeting point between lines passing stomion and lateral cantus. Posterior cheek: 38–41, points that divide the masseter by the upper and lower boundaries.

CNN models for the classification of facial US images

ImageNet database, the most common and representative deep learning database, employed millions of images to train models and compared the classification performance of photographed facial US images. The evaluated CNN models were (1) GoogleNet, (2) SqueezeNet, (3) Mobilenet-v2, (4) ResNet-18, (5) ResNet-50, (6) ResNet-101, (7) Inception-v3, (8) Inception-ResNet-v2, (9) AlexNet, (10) VGG-16, (11) VGG-19, (12) DenseNet-201, (13) Xception, (14) NasNet-Mobile, and (15) ShuffleNet (Table 1).

Table 1 Pre-trained deep learning models using ImageNet.

Verification of the nine regions of the face classification ability using the selected model

We trained 15 deep learning models to classify nine facial regions (Fig. 1). The training was conducted after adjusting the US image size to 224 × 224 × 3, 227 × 227 × 3, and 299 × 299 × 3 transforming the image to match the input size of the pre-trained deep learning model and augmenting the images. The training images were randomly translated up to 30 pixels and horizontally and vertically scaled up and down to 10%.

We evaluated the performance of each model using a tenfold cross-validation method. For the 160 US images of each region, 20 images were used as a test set, while the remaining 140 were divided into ten folds. One model has ten trained sub-models, and the sub-models were each evaluated for performance against the test set.

The training set for the model was a mini-batch size of 20, and the stochastic gradient descent with momentum (SGDM) moment was used. The maximum number of epochs was 20, and the learning rate was 0.0003, which was constant throughout the training.

Evaluation metrics

Precision and recall

We calculated the precision by dividing the number of True Positive elements by the total number of positively predicted units, where “k” represents a generic class.

$${Precision}_{k}= \frac{{True \; Positive}_{k}}{{True \; Positive}_{k}+{False \; Positive}_{k}}$$

The recall was calculated by dividing the number of True Positive elements by the total number of positively classified units.

$${Recall}_{k} = \frac{{True \; Positive}_{k}}{{True \; Positive}_{k}+{False \; Negative}_{k}}$$

The arithmetic mean of the metrics for separate classes is used to calculate the Macro Average Precision and Recall, where K is the total number of class.

$$Macro\;Average\;Precision = \frac{{\sum }_{k=1}^{K}{Precision}_{k}}{K}$$
$$Macro\;Average\;Recall = \frac{{\sum }_{k=1}^{K}{Recall}_{k}}{K}$$

Accuracy

The accuracy was calculated by dividing the correct predictions (including true positives and true negatives) by the total number of examined cases.

$$Accuracy= \frac{True \;Positive+True \;Negative}{True \;Positive+True\; Negative+False \;Positive+ False \;Negative}$$

F-measure

F-measure or F1-Score aggregates Precision and Recall measures under the concept of harmonic mean was measured.

$${F\text{-}measure}_{k}=2\times \left(\frac{{Precision}_{k} \times {Recall}_{k}}{{Precision}_{k}+{Recall}_{k}}\right)$$

Macro F-measure, which is the arithmetic mean of class-wise F-measure, was calculated as shown below.

$$Macro \;F\text{-}measure = \frac{{\sum }_{k=1}^{K}{F\text{-}measure}_{k}}{K}$$

The performance of the deep learning model was evaluated using the abovementioned metrics, and the performance score of one model is the mean of tenfold scores. The score for the model training is provided as the final accuracy and loss value. The score for the validation set is shown as precision, recall, and F-measure. Each result is illustrated in tables and box plots.

LIME (locally interpretable model-agnostic explanations)

Deep learning models are complicated, and their actions may be difficult to comprehend. The LIME approach approximates a deep neural network’s classification behavior with a smaller, more easily interpretable model42. The neural network’s decisions may be deduced by interpreting the decisions of this simpler model.

As the first step in the LIME method, we divided the ultrasound image into a grid of square features. The LIME method then uses bicubic interpolation to up-sample the computed map to match the image resolution. A 10 × 10 grid of features was created to increase the resolution of the computed map. LIME creates a composite image based on the original observation by randomly selecting a feature and replacing all pixels of that feature with the average image pixel, effectively removing that feature. The number of random samples was set to 6,000. The linear regression model used lasso regression.

Facial US images’ quality

The sizes of US images used in this study were 169 × 150 × 3 (smallest); 567 × 418 × 3 (medium); and 848 × 533 × 3 (largest) (Fig. 2). When US images of various sizes are transformed to fit the data input size of the deep learning model, the quality of the images changes. The quality of each transformed image and its original was quantified using BRISQUE (Blind/Referenceless Image Spatial Quality Evaluator) and displayed through a box plot (Fig. 3).

Figure 2
figure 2

Scatter plot of facial US images’ size.

Figure 3
figure 3

BRISQUE score according to facial US image size change. 224: 224 × 224 × 3, 227: 227 × 227 × 3, 229: 229 × 229 × 3.

BRISQUE

BRISQUE is an image analysis tool that adopts mathematical evaluation rather than objective image quality grading43. Unlike a qualitative comparison performed by humans, this is a repeatable quantitative method for image quality inspection. BRISQUE is a feature calculation model that simply employs picture pixels. It is shown to be highly efficient because it calculates its characteristics without the use of any transformations. According to the BRISQUE scoring system, the image quality values range from 0 to 100, corresponding to best and worst, respectively.

Results

During the training process of all models, the accuracy and loss values reached a plateau between 10 and 15 epochs. All average values are arithmetic mean values and are shown with standard deviation.

Training results of the models

After training for ultrasound facial region classification, the mean of the final accuracy of all models using the non-augmented dataset was 93.56 ± 1.38%. The model with the lowest mean final accuracy of 91.50 ± 3.36% was NasNet-Mobile, while the model with the highest mean final accuracy was VGG-19 with 96.75 ± 1.60% (Table 2 and Fig. 4).

Table 2 The training final accuracy and loss values of the model using the non-augmented dataset and the models using the augmented dataset (accuracy: mean ± standard deviation %).
Figure 4
figure 4

Training results for 10 folds of each deep learning model.

The lowest final accuracy among all folds was that of NasNet-Mobile, which recorded 87.30%, while the highest was 99.20%, recorded by the fold of NasNet-Mobile. The mean of the final loss values of all models was 0.22 ± 0.03. VGG-19 showed the lowest average loss value of 0.13 ± 0.07, and NasNet-Mobile revealed the highest average loss value of 0.28 ± 0.08. The model that recorded the lowest loss value among the folds was the fold of VGG-19 with a value of 0.06, while the fold that exhibited the highest loss value was that of VGG-16 with a value of 0.48 (Table 2 and Fig. 4).

The mean final accuracy was 94.25 ± 1.00% using the augmented dataset. The lowest mean final accuracy was recorded by GoogleNet as 92.22 ± 3.03%, and the highest one was recorded by VGG-16 as 96.03 ± 2.01%. The fold of SqueezeNet showed the lowest accuracy of 87.30% among the folds, while the model with the highest accuracy of 100% was that of VGG-16. The mean final loss values of all models was 0.19 ± 0.04. The DenseNet-201 model recorded the lowest average loss value, which was 0.15, while the highest average loss value was 0.28, recorded by SqueezeNet. The model that recorded the lowest loss value among all folds was the fold of VGG-16 with a value of 0.02, while the model showing the highest loss value was 0.78 with SqueezeNet. The mean of the lowest final accuracy among all models was recorded by GoogleNet as 92.22 ± 3.03%, and the model with the highest accuracy was VGG-16, recording 96.03 ± 2.01%. The fold of SqueezeNet showed the lowest accuracy among all folds at 87.30%, and the model with the highest accuracy of 100% was VGG-16. The mean of the final loss values of all models was 0.19 ± 0.04. The DenseNet-201 model recorded the lowest average loss value at 0.15, while SqueezeNet recorded the highest average loss value of 0.28. The model that recorded the lowest loss value among all folds was the fold of VGG-16 with a value of 0.02, while SqueezeNet was the model showing the highest loss value of 0.78 (Table 2 and Fig. 4).

Test results of models

The mean values of precision, recall, and F-measure for the test set of all models using the non-augmented dataset were 93.88 ± 1.37, 93.55 ± 1.83%, and 93.52 ± 1.83%, respectively. The order of prediction, recall, and F-measure scores of the models was the same. The models with the lowest and highest scores were NasNet-Mobile and VGG-16, respectively. The fold scores suggestive of the lowest precision, recall, and F-measure were those of NasNet-Mobile, which were 89.11%, 88.33%, and 88.33%, respectively. The fold with the highest scores was the fold of VGG-16, with precision, recall, and F-measure scores of 97.80%, 97.78%, and 97.76%, respectively (Table 3 and Fig. 5).

Table 3 Performance on the test set of the model with the non-augmented dataset and the models with the augmented dataset (model) (mean ± standard deviation %).
Figure 5
figure 5

Test results for 10 folds of each deep learning model.

The precision score for region classification was lowest in the oral region at 87.85 ± 5.35%, followed by the orbit-upper region at 87.97 ± 7.36%. The recall score was lowest in the anterior cheek at 82.3 ± 6.33%. The F-measure scores were lowest in the anterior cheek and orbit-upper regions at 87.31 ± 4.11% and 87.71 ± 5.48%, respectively. The regions with the highest precision, recall, and F-measure scores were the lateral nose and nose regions. Precision and F-measure scores were 99.8 ± 0.93% and 99.11 ± 1.21% in the lateral nose region, and the recall score was highest in the nose region (98.73%) (Table 3 and Fig. 5).

The mean values of precision, recall, and F-measure for the test set of all models using the augmented dataset were 94.18 ± 1.53, 93.77 ± 1.63%, and 93.74 ± 1.65%, respectively. The order of precision, recall, and F-measure scores of the models were all the same. The model with the lowest score was NasNet-Mobile, while the model with the highest score was VGG-16. The fold scores indicative of the lowest precision, recall, and F-measure were those of NasNet-Mobile, which were 88.72%, 87.22%, and 87.23%, respectively. The highest fold scores were those of the VGG-19 fold, which were 97.85%, 97.77%, and 97.79%, respectively (Table 3 and Fig. 5).

The precision score for region classification was lowest in the orbit-upper region at 86.77 ± 7.58%, followed by that of the oral region, which was 89.31 ± 6.36%. The recall score was lowest in the anterior cheek at 84.5 ± 7.25%. The F-measure scores were the lowest in the anterior cheek and the orbit-upper regions at 88.64 ± 4.37% and 86.84 ± 5.14%, respectively. The lateral nose region exhibited the highest precision, recall, and F-measure scores, which were 99.93 ± 0.54%, 99.33 ± 1.7%, and 99.62 ± 0.9%, respectively (Table 4 and Fig. 6).

Table 4 Performance on the test set of the model with the non-augmented dataset and the models with the augmented dataset (region) (mean ± standard deviation %).
Figure 6
figure 6

Test results for 10 folds of each deep learning model in each region.

Discussion

For facial ultrasound image region classification, the relatively classic models VGG-16, VGG-19, and ResNet-50 had the highest scores (Table 3 and Fig. 5). Looking at the above simplification, the models with better performance have in common a large number of parameters, shallow depth, and small image input size (Table 1 and Table 3). The same was observed in previous studies when comparing deep learning performance on medical images such as ultrasound and CT images, where shallow and classical models performed better than deep modern algorithms44. Considering that the performance is improved from ResNet-18 to ResNet-50 and then decreased in ResNet-101, it seems that a numerical balance between the model depth and the number of parameters is necessary.

The BRISQE score for US images generally shows the highest score among medical images such as MRI and CT, indicating the lowest image quality34. Counterintuitively, the BRISQUE score tended to decrease as the US image size was, arbitrarily, reduced in this study. This may be related to the high-performance scores of the models using the small US image size.

The average performance of the model using the augmented dataset was 0.2% higher than the model using the non-augmented dataset; therefore, there was no significant difference in the performance of the model regardless of whether the data was augmented or not. A significant performance improvement was exceptionally observed only in Inception-ResNet-v2, ResNet-50, and ResNet-101 among the 15 models evaluated in this study (Table 3 and Fig. 5). Data augmentation is the most popular method implemented to prevent overfitting45. The dataset was augmented by horizontal movement and zoom in and out according to the characteristics of neighboring landmarks in the region used in this study; however, the effect was weak. This indicates that the effect of data augmentation may vary depending on data characteristics or models. As in the case of Inception-v3, the performance score decreased after augmentation in some cases; thus, training using unconditional data augmentation requires attention.

The average performance score for each region was about 85% to 99%, which significantly differed between each region. Among all regions, lateral nose and nose were the most clearly distinguished (Table 4 and Fig. 6). By examining the most meaningful locals in the lateral nose and nosed through LIME, it is evident that the models clearly distinguish the skin and bone contours from other regions and their features (Fig. 7). Although the shape of the other regions under investigation are different, the models mainly considered hyperechoic skin and bones or their surroundings as the main features. Artifacts such as gels and bone shadows were sometimes regarded by the models as genuine features; however, in most cases, the artifacts were suitably ignored.

Figure 7
figure 7

LIME analysis of classification criteria for the facial region of VGG-16 and Xception. The top row is the local features considered by VGG-16, the middle row is the original image, and the bottom row is the local features considered by Xception. The red area is the strongly weighted local, and the blue area is the weakly weighted local.

Irrespective of the model, the local features of each region viewed with LIME were similar. The VGG models had exceptionally high-performance scores in the orbital-lower and orbital-upper regions, and the attention areas of VGG models examined through LIME were the smallest compared to the areas of other models. This tendency seems to be a reason why VGG models have lower performance than other models in the case of anterior cheeks. Mentalis m. and masseter m., which are relatively hypoechoic areas of the muscles in the mentum and posterior cheeks, were ignored. Moreover, the model that considered these muscles as the main features showed rather poor performance.

When segmentation is performed on a facial ultrasound image, the structure shown by each region is very different; thus, it is critical to label each region separately. If segmentation is performed without pre-classifying the face parts in this manner, many images are expected to be required to achieve proper performance. Recently, methods to improve the performance of the segmentation model by combining the feature maps of each stage in the segmentation model encoder and the classification model have been introduced46.

In conclusion, the quality and characteristics of the input data are a significant part of deep learning training, and in the case of training using a small number of data, it responds sensitively. The repetition of a structure with clear contrast on the US image in one class during transfer education using a model pre-trained with ImageNet is expected to have a significant impact on feature extraction. When conducting transfer education using a small number of images, it seems crucial to properly filter the US image and strengthen the contrast for the main structures. In deep learning models, muscles, blood vessels, and nerves that lack contrast in the segmentation of facial US images appear to be easily ignored. In the poor-quality US images’ characteristic, the classical deep learning model showed better classification performance. Since the analysis through LIME is limited to local analysis, it was difficult to compare models with little performance difference. For detailed performance comparison, a method that can perform global analyses is required. The results of this study can be used as reference data for future deep learning research on facial US images and content development (Supplementary Information).