Introduction

Photoacoustic microscopy (PAM) imaging is a field of non-invasive medical imaging that combines optical and ultrasound technology1. The principle is based on detecting the ultrasonic waves generated by pressurized tissues by laser beams. The penetration depth (\(\sim 1 \,\,\textrm{mm}\)2) is much higher than purely optical imaging modalities (\(\sim 10\,\,\upmu\)m3). At the same time, it can detect the optical absorbing parts of the tissue such as hemoglobin, lipids, water, and other light-absorbing chromophores4.

Similar to other imaging techniques, a trade-off exists between scanned area, scanning time, and spatial resolution. Scanning large areas with high resolution increases the time complexity of the whole process. In the literature, advanced scanning techniques are developed for high-speed PAM. However, the main factor that restricts the scanning speed is the laser pulse repetition rate5. Unless advanced lasers are used, the best option for speeding up PAM imaging is reducing the scanning area and reconstructing the image via computational methods.

For the past couple of years, deep learning (DL) has emerged as the state-of-the-art computational method for various tasks in medical image processing, including reconstruction, enhancement, denoising, and super-resolution6,7,8. Recent studies have highlighted the potential of DL methods to address challenges in photoacoustic microscopy (PAM) imaging.

Reconstructing the degraded PAM image can be considered a subset of inverse problems in image processing, which are naturally ill-posed. Degradation may come from but is not limited to, out-of-focus lasers, noisy measurements, undersampling, and low spatial resolution. One can reduce the spatial resolution or the scanned area by undersampling to accelerate the scanning. Super-resolution techniques aim to enhance image resolution, while image inpainting restores missing pixels, which can be used to reconstruct degraded PAM images. The literature has various approaches for solving super-resolution and inpainting; the main ones are interpolation and DL-based methods. In recent years, deep learning-based approaches have proven superior to interpolation methods. Early studies on super-resolution and inpainting utilizing deep learning employed peak-signal-to-noise ratio (PSNR) based methods.

Zhao et al.9 used a multi-task residual dense network to speed up optical resolution PAM with low pulse laser energy. They addressed the issues of image denoising, super-resolution, and vascular enhancement simultaneously through multi-supervised learning. Sharma et al.10 employed supervised fully dense U-Net to enhance out-of-focus acoustic resolution PAM images. DiSpirito III et al.11 approached the reconstruction of undersampled PAM images, i.e., an inpainting problem. They demonstrated that using supervised fully dense U-Net, they were able to reconstruct artificially downsampled PAM images. All techniques up until now required a large set of PAM images to train the neural network. On the contrary, Vu et al.12 use the deep image prior (DIP)13 technique, which uses an untrained neural network as an image prior. They reconstructed the undersampled PAM images, similar to DiSpirito et al.11. Based on our knowledge, this study is the only one that does not require supervised training in PAM imaging.

Today, many super-resolution and inpainting techniques take advantage of deep generative models. The state-of-the-art generative models that are used for super-resolution and inpainting are encoder-decoder networks, generative adversarial networks (GANs)14, flow-based models15, and denoising diffusion probabilistic models (DDPMs).

Deep generative models can facilitate both conditional and unconditional image generation. Unconditional generative models learn natural image distributions to generate images from noise, while conditional generation requires conditions in the form of text, class labels, or images. For super-resolution or inpainting applications, degraded images can be used as the condition. Conditional generation can be achieved by using the condition directly in the training process or leveraging the power of pre-trained unconditional generative models. Both approaches demonstrated remarkable performance in inverse problems in image processing16. The former requires a large corpus of high-quality images of interest, the specific design of the model, and the training process. The latter approach is more robust and flexible.

Recently, a few studies have employed the latter approach for photoacoustic tomography images. Tong et al.17 used a score-based generative model with a rotational consistency constraint, controlling the generation process by utilizing annealed Langevin dynamics. Dey et al.18 and Song et al.19 employed score-based diffusion models to reconstruct limited PAT measurements. These studies demonstrated that the proposed methods using diffusion models achieve higher-quality reconstructions compared to conventional reconstruction methods and U-Net.

Since we are dealing with ill-posed problems, leveraging prior information can be advantageous. Provost et al.20 demonstrated that photoacoustic images exhibit sparsity under wavelet transform. Therefore, regularizing the wavelet representation may enhance the quality of reconstructed images. Conversely, total variation (TV) regularization21 is a widely-used technique in modern computer vision problems22.

Building upon these insights, we propose a novel method that utilizes a pre-trained diffusion model to reconstruct the degraded PAM image with external conditioning and regularization. This approach leverages the effectiveness of neural networks pre-trained on natural image data while eliminating the need for an extensive corpus of PAM images and mitigating the cumbersome training processes. Notably, our methodology demonstrates a degree of adaptability, offering the freedom to select the specific degradation operation, whether super-resolution or inpainting. Furthermore, our proposed framework can integrate and benefit from the existing enhancement techniques.

Method

In this section, the development of the DiffPam algorithm is outlined, detailing the transition from theory to practice. An overview of DDPM and its application in image generation is provided. The conditional generation using an unconditional diffusion model is then explained, with regularization techniques incorporated to enhance image reconstruction. The Come-Closer-Diffuse-Faster technique for accelerating the diffusion process is described. Additionally, the datasets used for training and testing are covered, along with the neural network employed as an approximate solution. Finally, the steps of the DiffPam reconstruction algorithm are presented, and the evaluation metrics for assessing image quality are detailed. This methodology aims to accelerate photoacoustic microscopy imaging by reconstructing undersampled images using diffusion models.

Denoising diffusion probabilistic models

DDPMs were first introduced by Sohl-Dickstein23 and then popularized by Ho et al24, as a parametrized Markov chain with variational inference. Markov process defines a stochastic process where the future only depends on the present. Dhariwal and Nichol25 showed that DDPMs can achieve superior image quality and beat GANs.

The diffusion process in DDPMs is inspired by the thermodynamic process of diffusion. DDPM has forward and backward diffusion processes. The forward process is defined as adding Gaussian noise to an image and the backward process is trying to obtain the original image by using the noisy one (Fig. 1).

Figure 1
figure 1

Forward \(q(x_{t}|x_{t-1})\) and backward \(p_{\theta }(x_{t-1}|x_{t})\) diffusion processes. \(X_{0}\) is the original image, where \(X_{T}\) is the Gaussian noise when \(T\rightarrow \infty\).

For the image generation, the backward process is started from pure Gaussian noise at \(t=T\) and repeated until we obtain \(X_{0}\) at \(t=0\). In the following sections, we use y to represent the measurement, \(X_{ref}\) for the reference (ground truth) image, and \(X_{0}\) for the generated output. This notation aligns with the conventions used in diffusion and score-based model research to ensure consistency and clarity.

Conditional generation from unconditional diffusion model

The sequential nature of diffusion models allows researchers to manipulate the generation process. This principle has been the focus of several studies that aim to alter the generation process, using an unconditional model to generate conditional images. Simply, a measured image is defined as a noisy version of the actual image transformed by a degradation operation.

$$\begin{aligned} y = Ax+n \end{aligned}$$
(1)

The degradation operation (A) mimics the relation between measured value (y) and real value (x), and (n) represents the measurement noise. The degradation depends on the inverse problem. For super-resolution, the operation will be downsampling with Gaussian blurring. For inpainting, the transformation is the multiplication of a mask operator.

Choi et al.26 defined a process called Iterative Latent Variable Refinement (ILVR). Proposed algorithm is to generate \(x'_{t-1}\) from \(x_{t}\) using unconditional generation, then update predicted \(x'_{t-1}\) using the condition:

$$\begin{aligned} x_{t-1} = x'_{t-1} + A^{T}(y_{t-1})- A^{T}A \left( x'_{t-1}\right) \end{aligned}$$
(2)

where \(A^{T}\) is the inverse operation to degradation, e.g. upsampling for super-resolution. This process can be deemed as a projection operation onto the desired image manifold. The measurement is considered noiseless (\(n=0\)) in this setting.

Chung et al. (a)27 proposed an improvement to projection-based approaches named Manifold Constraint Gradient (MCG) correction. They devised a series of constraints to ensure that the gradient of the measurement remains on the manifold.

Projection-based approaches may perform well in some cases, but they have downsides. Since the predicted image is projected onto the measurement, noise in the measurement is amplified at each iteration. Chung et al. (b)28 developed an alternative approach that uses gradient-based correction to the predicted image, named Diffusion Posterior Sampling (DPS). In this way, the noise in the measurement is handled during the gradient process, unlike projection-based methods.

$$\begin{aligned} x_{t-1} = x'_{t-1} - \eta \nabla _{x_{t-1}}||y-A(x'_{0})||^{2}_{2} \end{aligned}$$
(3)

\(\eta\) is the scaling factor for DPS, which determines the strength of the measurement in the final image. Chung et al. (b)28 found that \(\eta\) should be between 0.1 and 1, experimentally. Smaller \(\eta\) values lead to hallucination in the generated images, while larger values may produce artifacts.

The formulation presented in Eq. (4) illustrates our contribution to the DPS conditioning, namely regularization with the \(l_{1}\) norm of wavelet coefficients. In Eq. (4), \(\mathcal {F}\) denotes wavelet decomposition, and \(\lambda _{W}\) represents the regularization coefficient for wavelet decomposition. Experimentally, we have determined the value of \(\lambda _{W}\) to be 10, below which we do not observe the effect of regularization, while above, it yields excessive regularization and a smoothing effect that diminishes the quality of the outcome. We have selected the B-spline biorthogonal wavelet with 3 and 5 vanishing moments (bior3.5) to achieve sparse representations. We perform a level 3 discrete wavelet transform using the \(pytorch\_wavelets\) package29 in Python.

$$\begin{aligned} x_{t-1} = x'_{t-1} - \eta \nabla _{x_{t-1}}\left( ||y-A(x'_{0})||^{2}_{2} + \lambda _{W}\Vert \mathcal {F}(x'_{0})\Vert \right) \end{aligned}$$
(4)

In a similar vein, we have utilized DPS conditioning with TV regularization in Eq. (5), where V denotes the TV of the predicted image \(x'_{0}\), and \(\lambda _{TV}\) serves as the regularization coefficient for TV. To calculate the TV-norm, we utilize the \(total\_variation\) function from the torchmetrics library30. The value of \(\lambda _{TV}\) is experimentally set to \(10^{-4}\).

$$\begin{aligned} x_{t-1} = x'_{t-1} - \eta \nabla _{x_{t-1}}\left( ||y-A(x'_{0})||^{2}_{2} + \lambda _{TV}\Vert V(x'_{0})\Vert \right) \end{aligned}$$
(5)

Come-closer-diffuse-faster

Although the pre-trained diffusion models are robust and domain-adaptable, they are computationally expensive. Producing even a single image is a time-consuming process. Chung et al. (c)31 claim that for the inverse problems, starting the generation from pure Gaussian noise is redundant. By utilizing an estimate of initial image \(x_{0}\), the backward diffusion process can begin at a timestep \(t<<T\). They referred to this process as Come-Closer-Diffuse-Faster. An initial estimate can be as simple as an interpolation or as fine-detailed as the output of a fully-trained neural network. As the initial estimate improves, the diffusion process can be shortened31. This method allows us to use existing deep-learning models and go beyond their performance.

Dataset

For all experiments, we have employed OpenAI’s unconditional DDPM model32 trained with ImageNet 256\(\times\)256 dataset33. Since we are using a pre-trained diffusion model, training and validation of the diffusion models are beyond the scope of this study.

For testing the proposed method in this study, the Duke PAM34 dataset is used. Duke PAM is an open-source database acquired by an optical resolution PAM system at a wavelength of 532 nm in the Photoacoustic Imaging Lab at Duke University. This PAM system has a lateral resolution of 5 \(\upmu\)m and an axial resolution of 15 \(\upmu\)m11. This dataset consists primarily of in vivo images of mouse brain microvasculature, from which 20 images are randomly selected as the test set.

The selected diffusion model is trained with 256\(\times\)256 images, so central cropping is applied to make input image dimensions 256\(\times\)256. Then, images are normalized by mapping the color range to \([-1,1]\) before being passed to the diffusion model. These normalized images are the ground truth images (\(X_{ref}\)).

At this step, our study is divided into three paths to generate input images (y):

  1. 1.

    For super-resolution, 1/4 downsampling operation is performed and followed by low-pass filtering to prevent anti-aliasing using the Resizer Python package developed by Shocher et al.35. Therefore, the resulting image has 6.25% of the pixels of the original image.

  2. 2.

    For inpainting with a uniform undersampling, every four rows out of five is replaced with a zero, leaving 20% of the original pixels.

  3. 3.

    For inpainting with a random undersampling, 80% of the pixels are replaced with zero, leaving 20% of the original pixels.

In PAM imaging, variations in the step sizes along the fast and slow scanning axes can result in differing resolutions in the x and y axes. Random undersampling maintains the same sparsity level without a fixed pattern, thus reducing potential biases from the original scanning step sizes. The differences in reconstruction quality between uniform and random undersampling can be attributed to inherent biases within the original images. For a fair comparison with random undersampling, uniform downsampling was also performed equally along both the x and y axes.

Each setup corresponds to distinct real-world scenarios, as elaborated in subsequent sections. Researchers are encouraged to adopt the pathway that best aligns with their objectives and experimental setups.

Neural network as an approximate solution

The previous works mainly focused on reconstructing uniformly undersampled PAM images11,12. They approached it as an inpainting problem and used U-Net-shaped neural networks for a solution. In this study, we demonstrated that existing solutions can be improved using our algorithm. To this end, we trained a U-Net model with residual connections, which was subsequently subjected to comparative evaluation and involved in our proposed algorithmic framework.

We used the 337 samples in the Duke PAM train dataset for model training. Each image is randomly cropped from 10 different locations during data loading. For regularization, the data augmentation processes, namely, random crops, horizontal and vertical flips with 30% probability, random rotations up to 20\(^\circ\), and random Gaussian blur, are applied to each image in the training set. Then images are normalized to range \([-1,1]\) and used as the ground truth. Input images are constructed by applying a uniform undersampling to the ground truth image, which replaces every four rows out of five with a zero, leaving 20% of the original pixels. Then, empty rows are filled with bilinear interpolation and used as input to the U-Net model. Figure 2 illustrates an example image with uniform masking and the resulting bilinear interpolation.

Figure 2
figure 2

An example mice brain microvasculature image from Duke PAM dataset. (a) Original (ground truth) image, (b) uniformly undersampled in x-axis (every four rows out of five is replaced with a zero) and (c) missing pixels are filled via bilinear interpolation.

The U-Net model employed in this study comprises convolutional blocks (Conv Blocks) that consist of two-dimensional convolutional layers (Conv2D) followed by Mish activation functions36. The architecture includes three downsampling blocks and corresponding upsampling blocks, with a middle layer incorporating linear attention (Supp. Fig. 1). The network was optimized using Adam optimizer37 of the PyTorch optimizer package and \(l_{1}\) loss function with batch size 8. The parameters for the Adam optimizer were \(\beta _{1}=0.9\), \(\beta _{2}=0.999\), \(\epsilon =10^{-8}\), and weight decay\(=0\) (the default parameters of PyTorch). We used the PyTorch framework version 1.11. The hyperparameter tuning is achieved by splitting 20% of the training data as the validation set. The initial learning rate is selected \(3e-4\), and step scheduling is employed, which divides the learning rate by half in each 2000 epoch. Total training is done over 5000 epochs, beyond which we did not observe any substantial performance increase.

The DiffPam reconstruction algorithm

The steps of the algorithm are the following:

  • The operation (A) transforms the real value (X) into the measured value (y). For super-resolution, the operation will be downsampling. For inpainting, the transformation is the multiplication of a mask operator.

    $$\begin{aligned} y = AX_{ref} \end{aligned}$$
    (6)
  • Optional: An approximate solution to the inverse problem is required. The better we approximate the real value, the fewer diffusion steps would be needed. The solution depends on the transform operator A. But in general, interpolation operations or a simple trained neural network as an approximator can be utilized. Depending on the approximate solution, the starting point (\(N<T\)) for the backward diffusion process can be settled.

  • After having \(X_{N}\), an unconditional diffusion model is employed to generate \(X_{N-1}\). Then, using regularized DPS conditioning the image generation process is iterated in the direction of interest.

Algorithms 1 and 2 illustrate the DiffPam algorithm with or without approximate solution. Table 1 summarizes the experiments conducted in this study, regarding the selection of the measurement operations (A), approximate solutions and the starting point for the diffusion processes (N).

Table 1 The experimens conducted in this study, regarding the selection of the measurement operations (A), approximate solutions and the starting point for the diffusion processes (N).

The diffusion model that we used in this study is trained with \(T=1000\) diffusion steps. Depending on the quality of the approximate solution, we have selected \(N=500\) for bilinear and bicubic interpolation and \(N=200\) for pre-trained U-Net outputs.

Image production time is highly correlated with the sample step size. Selecting \(N=T/k\) reduces the computing time by a factor of k.

Algorithm 1
figure a

DiffPam reconstruction algorithm

Algorithm 2
figure b

DiffPam with approximate solution

Evaluation

The quality of reconstructed images is evaluated using three key metrics: PSNR, structural similarity index (SSIM), and learned perceptual image patch similarity (LPIPS). PSNR is the logarithm of the ratio between the peak signal value and the mean squared error (MSE) between the reference and the reconstructed images. The PSNR goes to infinity when MSE is equal to zero.

$$\begin{aligned}{} & {} {\text {MSE}}({\textbf {x}}, {\textbf {y}}) = \frac{1}{mn} \sum _{i=0}^{m} \sum _{j=0}^{n} (\textbf{x}(i,j)-\textbf{y}(i,j))^{2} \end{aligned}$$
(7)
$$\begin{aligned}{} & {} {\text {PSNR}}({\textbf {x}}, {\textbf {y}}) = 10*\log _{10}\frac{max^{2}(\textbf{x})}{MSE(\textbf{x}, \textbf{y}) } \end{aligned}$$
(8)

On the other hand, SSIM is designed to simulate any image distortion using a combination of three factors: loss of correlation, luminance distortion, and contrast distortion38. Equation (9) is the specific form of the SSIM index, which is used in this work. \(\mu\) and \(\sigma\) correspond to the mean and standard deviation of the given image patches39. \(C_{1} = (K_{1}L)^{2},\) and \(C_{2} = (K_{2}L)^{2}\), where K is a small parameter and L is the dynamic range of the image pixels. The SSIM values range between zero and one. Zero SSIM means there is no correlation between the images, and one refers to identical images.

$$\begin{aligned} {\text {SSIM}}(\textbf{x}, \textbf{y})=\frac{\left( 2 \mu _x \mu _y+C_1\right) \left( 2 \sigma _{x y}+C_2\right) }{\left( \mu _x^2+\mu _y^2+C_1\right) \left( \sigma _x^2+\sigma _y^2+C_2\right) } \end{aligned}$$
(9)

While PSNR and SSIM are widely acknowledged, they may occasionally fall short in aligning with human perceptual judgments. In response to this, LPIPS, as developed by Richard Zhang et al.40, leverages deep neural networks to offer a more refined measure of image similarity closer to human perception. Contrary to PSNR and SSIM, lower LPIPS scores mean better image quality. In this study, the Scikit-image framework41 is employed to calculate PSNR and SSIM metrics, whereas the lpips Python package with AlexNet is used to compute LPIPS scores. For SSIM calculation, image patches of size \(7 \times 7\) are utilized, with default parameters set to \(L=255\), \(K_{1} = 0.01\), and \(K_{2} =0.03\).

In addition to these metrics, we also include the Mean Absolute Error (MAE) to provide further insights into the performance of our method. MAE measures the average magnitude of the absolute differences between the predicted and true values, providing a straightforward interpretation of prediction accuracy. The MAE is defined as:

$$\begin{aligned} {\text {MAE}}({\textbf {x}}, {\textbf {y}}) = \frac{1}{mn} \sum _{i=0}^{m} \sum _{j=0}^{n} |\textbf{x}(i,j)-\textbf{y}(i,j)| \end{aligned}$$
(10)

Our sample size is relatively small (n = 20), and data distributions do not meet the normalization condition according to the Kolmogorov-Smirnov test. Therefore, hypothesis testing is done by utilizing non-parametric Wilcoxon signed-rank tests. The Wilcoxon signed-rank test is designed to test the null hypothesis that two paired samples come from the same distribution42. Paired tests are specifically designed for situations where each data point in one sample is directly paired with a data point in the other sample, which was the case for our images. The alpha level for statistical significance is selected as 0.05.

Results

Producing a single 256\(\times\)256 image from the \(T=1000\) sample steps took 18 min on average on Amazon Web Services (AWS) Sagemaker Studio Lab accelerated computing. Starting from the halfway, image reconstruction time can be reduced to 9 min. Even further, starting from the U-Net model output, the time can be reduced up to 3.6 min.

Inpainting

For inpainting, 20 images are randomly selected from the test set and subjected to uniform and random masking, leaving 20% effective pixels. Then, images are reconstructed using pre-trained U-Net and diffusion model conditioned on undersampled image (DiffPam). Four DiffPam experiments are conducted for inpainting, as mentioned in the “Methods” section.

Three out of four inpainting experiments were conducted using uniform masking in x-axis with differing approximate solutions. Figure 3 demonstrates the ground truth (fully-sampled) PAM image, bilinear interpolation, and the results of reconstruction from the undersampled version. For simplicity, only the diffusion model started with bilinear input (B-DiffPam) is illustrated in the figure.

Figure 3
figure 3

An example PAM image from our test set. (a) Original (ground truth) image, (b) bilinear interpolation applied to the uniformly undersampled image (c) reconstructed image by pre-trained U-Net d) reconstructed image by DiffPam using (b) as the input. The second row illustrates the focus of the box.

Our investigation reveals that both wavelet and TV regularization outperform unregularized DPS (\(p<0.01\)) in terms of PSNR and SSIM. While there exists no statistically significant performance disparity between TV and wavelet regularization in terms of PSNR and SSIM, TV regularization demonstrates superior results concerning LPIPS compared to wavelet regularization (\(p<0.01\)) (Supplementary Table 1). Consequently, we present the TV regularized DiffPam results. For a comprehensive examination of the outcomes, we refer readers to the supplementary material.

Table 2 The results of reconstruction methodologies from uniform undersampling in terms of MAE, PSNR, SSIM and LPIPS metrics (mean ± standard error).

Table 2 presents the results of reconstruction methodologies from uniform undersampling in terms of MAE, PSNR, SSIM, and LPIPS metrics. All methods yield superior results compared to interpolation (\(p<0.0001\)). DiffPam methods produce lower PSNR values than the pre-trained U-Net model (\(p<0.05\)) since the U-Net is trained for this objective. Nevertheless, U-Net tends to generate smoother outputs, which the PSNR metric favors more than humans do. No statistically significant differences in MAE, SSIM and LPIPS values are observed when comparing DiffPam, B-DiffPam, and U-Net. This result demonstrates that diffusion models trained solely with natural images can perform on par with domain-specifically trained U-Net. U-DiffPam achieves higher SSIM, lower MAE and LPIPS scores (both superior) than the pre-trained U-Net model (\(p=0.044\)).

As an alternative to uniform masking, we offer scanning randomly distributed pixels with an equal number of scanned pixels. We have reconstructed the selected test images with random masking using \(T=1000\) sample steps in the same diffusion model and compared the results with uniform masking. Table 3 demonstrates the PSNR, SSIM, and LPIPS metrics of reconstructed images with uniform undersampling with both x and y-axes and random undersampling methods. Specifically, random undersampling, while maintaining the same effective pixel density, consistently outperforms uniform undersampling across all metrics. (\(p<0.05\)).

Table 3 The results of uniform and random undersampling, along with their corresponding reconstructions in terms of PSNR, SSIM, and LPIPS metrics (mean ± standard error).
Figure 4
figure 4

Visual comparison of random and uniform inpainting DiffPam results. (a) Original (ground truth) image, reconstructed image from (b) the uniform mask in x-axis (c) the uniform mask in y-axis, (d) the random mask. The second row illustrates the focus of the upper box, whereas the third row demonstrates the focus of the lower box. (b1) and (b2) contain artifacts in the microvasculature, where (c1, c2, d1 and d2) resembles the ground truth.

Upon examining the reconstructed images, we noted the presence of certain artifacts in microvasculature exclusively in images with uniform undersampling in the x-axis, a phenomenon absent in other methods (Fig. 4). Besides, implementing random downsampling could introduce mechanical instability and necessitate modifications to the existing motor control, which might not be feasible with the current hardware configurations. To address these challenges while maintaining mechanical stability, we propose employing uniform undersampling along both axes. This approach can reduce inherent biases caused by differing step sizes in fast and slow scanning, thus enhancing the quality of the reconstructed images.

Super resolution

Another practical approach would be increasing the spatial resolution of the OR-PAM using DiffPam algorithm. The corresponding operation is downsampling and smoothing with cubic kernel in the image processing domain. We experimented with this setting using six images in the test set. We were able to improve the PSNR by 18.2% and SSIM by 31.7% and lower the LPIPS by 54.0% on average. Figure 5 illustrates the ground truth, downsampled and reconstructed images, and pixel value change across the red line.

Figure 5
figure 5

The example result of super-resolution from 4\(\times\) downsampled image. (a) Original (ground truth) image, (b) 4\(\times\) downsampled image, (c) reconstructed image by DiffPam, (d) pixel values across the red line in the images. DiffPam is able to reconstruct the downsampled images using prior knowledge of natural images.

Discussion

Key contributions

In this study, we demonstrated that the proposed algorithm, DiffPam, adeptly reconstructs undersampled PAM images without the need for a large dataset or training a deep learning model. The diffusion model, which is solely trained with the ImageNet database of natural images, was able to perform on par with the U-Net model specifically trained with mice brain microvasculature image database. By reducing the diffusion steps, we can further accelerate the reconstruction process, enhancing its efficiency. Furthermore, for users with an existing trained model, our algorithm provides a means to enhance results without interfering with the original model.

Pre-training the U-Net on a large-scale dataset such as ImageNet prior to fine-tuning on the Duke PAM dataset could potentially yield additional insights and improvements. However, it is noteworthy that DiffPam inherently leverages the knowledge from ImageNet without the necessity for explicit transfer learning, thereby demonstrating its robustness and adaptability across different domains. This approach highlights the efficacy of diffusion models in effectively utilizing large-scale datasets for a variety of applications.

The reconstruction of undersampled PAM images has recently received attention in two notable studies11,12. DiSpiritio et al. undertook a comprehensive investigation by training multiple versions of U-Net using a fully supervised approach. They reported a substantial improvement over the baseline of bicubic interpolation, achieving SSIM and PSNR increases of \(6.42\%\) and \(16.73\%\), respectively, with a \(20\%\) effective pixel size and uniform undersampling, a strategy akin to ours. In contrast, Vu et al. adopted a different methodology by employing DIP. Their approach involved training a U-Net model using a single undersampled image. Although the DIP approach does not require a large dataset or supervised training akin to our algorithm, it requires careful selection of the neural network architecture and a delicate optimization process, which demands extensive knowledge about the subject.

In our investigation, we demonstrate substantial advancements over baseline methods. Specifically, we achieved a \(27.54\%\) increase in SSIM, a \(24.70\%\) increase in PSNR, and a remarkable \(45.83\%\) decrease in LPIPS compared to our baseline method of bilinear interpolation. Discrepancies in results among studies may stem from variations in hyperparameter selections and differences in the composition of the test set. Nevertheless, our observed improvements over the baseline method exhibit statistically significant advancements (\(p < 0.0001\)).

Notably, neither study incorporated external regularization similar to wavelet and TV techniques. We demonstrated the significant impact of regularization (\(p < 0.01\)) on external conditioning in diffusion models, highlighting its effectiveness in improving reconstruction outcomes. Given that pixel removal and aliasing result in information loss, generative methods attempting to recover from this may produce inaccurate regions, including biologically improbable structures. While regularization techniques play a crucial role in mitigating these effects, addressing this challenge remains an ongoing endeavor.

The primary result of our study is the potential utilization of the proposed algorithm by researchers with limited expertise or computational resources in artificial intelligence. The diffusion model employed in our investigation is openly accessible and can be substituted with any other proficient diffusion model. Our source code is accessible on a public GitHub repository (github.com/iremzog/diffpam). For scientists facing constraints in accessing accelerated computing, the option of fast-tracked, truncated diffusion steps presents a viable solution for reducing computation time.

Undersampling strategies

Since the speed of the scanning axes of PAM imaging may differ considerably, skipping every few lines in the slower axis is a more convenient way of speeding up the process of acquiring images. Uniform masking of PAM images, which is commonly documented in the literature, is a way of representing this process. Our findings revealed that undersampling along fast or slow axes significantly affects results, highlighting an inherent bias in the PAM images. These results emphasize the critical need to account for axis-specific biases in PAM images. We recommend that researchers consider undersampling both axes, regardless of whether they use uniform or random methods, to further enhance the quality and accuracy of their reconstruction outcomes.

An alternative strategy involves employing larger spatial resolutions, followed by utilizing super-resolution techniques to reconstruct finer details. In this study, our objective was to elucidate the limitations inherent in all approaches, leaving the decision regarding methodology selection to the discretion of researchers.

Limitations and future works

Our study is limited to a single image dataset of narrow scope, namely mice brain microvasculature. While using a test set with a higher number of images would provide a more robust evaluation, due to current constraints in computational resources, repeating all experiments with a significantly larger dataset is beyond the scope of this manuscript. Future work should involve applying DiffPam to a larger selection of images from the Duke PAM dataset to ensure comparability and robustness of the results.

Future research should employ broader datasets, including various tissue types and pathological conditions, to validate and generalize our findings. Engaging clinicians in the research process will be crucial for obtaining expert insights and ensuring the clinical relevance of the outcomes. Future work could also explore the application of DiffPam to other imaging modalities, such as Raman spectroscopy where similar challenges with scanning speed and resolution exist. Integrating DiffPam with advanced hardware techniques, optimizing the method for real-time imaging, and developing software tools will further enhance its practical applicability and impact.

Conclusion

In conclusion, our study introduces the DiffPam algorithm, showcasing its efficiency in reconstructing undersampled PAM images without the need for extensive datasets or deep learning model training. The diffusion model, exclusively trained on the natural image database, demonstrated comparable performance to an in-domain trained neural network. We also address the escalating demand for computing power in advancing AI technology, emphasizing the significance of overcoming obstacles in accessing high computational resources for scientific progress. We propose the reduction of computing time through shortened diffusion processes without compromising accuracy. Additionally, our exploration of random or uniform sampling techniques in PAM imaging underscores the superiority of random sampling in preserving valuable information. We acknowledge the limitations of our study, which are that it is confined to a specific image dataset and has a relatively small sample size due to resource constraints. However, our findings will inspire further research in this domain, offering researchers with limited resources a valuable algorithmic tool for PAM image reconstruction. Our algorithm’s accessibility is underscored by the availability of source code on a public GitHub repository, providing a practical way for researchers to implement and extend our work.