Introduction

Production of nanomedicine has been a topic of first-rate hobby in pharmaceutical vicinity due to their importance significance for improving the drug efficacy. Nanosized drug production is one of the strategies to decorate drug solubility due to the expanded surface area to nanosized which therefore outcomes in an enhancement in drug solubility13. Given that, maximum newly discovered drugs are poorly soluble in aqueous media, underpinning studies is required to secorate the solubility of drugs thru one of the kind strategies such as nanonization, amorphous solid dispersion, crystallization, and salt formation4,5,6,7,8.

Drug nanonization has been used as an appealing technique for optimization of drugs thru solubility enhancement withinside the body. Supercritical solvents are extra appealing techniques due to the advanced characteristics of this process for preparation of nanosized drug particles3,6,9,10,11,12. The use of superficial CO2 as the secure solvent in pharmaceutical has been accredited by authorities. Moreover, there are some advantages of supercritical CO2 method like low price, easy operation, and moderate supercritical factor as compared to different gases. Therefore, supercritical CO2 is a great choice to be used as green solvent for preparation of nanomedicine in pharmaceutical area7,13,14,15,16.

Prior to implementing the nanonization process based on supercritical technology, first drug dissolution has to be analyzed in the solvent. Determination of drug solubility in supercritical process can be done via either experimental approach or computational, by which the computational method is more attractive17,18. Applying the experimental technique needs extensive time and cost for the analysis, while computational methods can save time and cost of the experiments, and they can be used for interpolation and extrapolation of the data19,20,21,22,23.

Different computational techniques have been utilized for the modeling of drug solubility, but the thermodynamic model and machine learning have shown better performance. The thermodynamic model establishes equilibrium between the solid phase and the solvent phase to determine the value of solubility19,24,25,26,27,28. On the other hand, machine learning models need measured data for training and validation of the algorithms29,30. The methods of machine learning have shown to be easier to implement and offer better accuracy for prediction of drug solubility in solvents.

Machine learning (ML) as a subject in artificial intelligence is a set of techniques to understand the patterns of data with no any suppostions regarding to the structure. One of these strategies' strengths is creating a relation among data and, then estimate some interaction. An important application of machine learning is regression, that could be defined as a specific type of problem in this study31,32,33,34. In this research, three approaches have been chose to make approaches on the drug solubility. Accepted methods in this research are Kernel Ridge Regression (KRR), Decision Tree Regression (DTR), and Gaussian Process Regression (GPR). Indeed, we implemented these efficient ML models for the first time for simulation of decitabine solubility in supercritical CO2 as the solvent. The results can help to assess the applicability of supercritical process for this drug candidate to be prepared in nanosized scale35.

Ridge regressions and the kernel approach are used in the Kernel Ridge Regression (KRR). KRR has the advantage of capturing nonlinear relationships, avoiding regression over-fitting problems through regularization and kernel techniques35.

A decision tree regressor (DTR) is a straightforward, comprehensible, and efficient approach. The core principle of the decision tree algorithm has been distributed a large problem within multiple smaller sub-problems, it can be lead to an easier-to-interpret respond36,37. A DTR demonstrates a set of conditional queries ordered hierarchically and requested from the tree's root to the leaf38. DTRs are easy to understand and have a clear structure. DTRs produce a trained predictor, be able to express principles, and   forecast new datasets using the splitting procedures, which is repeated39,40.

The other employed model of this study is based on the Gaussian process (GP) statistical concept, which refers to a group of random variables, as some of them are distributed with Gaussian distributions41,42. In geostatistics, the Gaussian process is the fundamental stochastic process. Gaussian processes directly represent Gaussian data and the base for non-Gaussian models such as linear regression models. As a result, Gaussian processes regression based on GPs is both accurate and straightforward for small datasets with high generality43,43,45. Additional target prediction studies of decitabine will be conducted in this research work to get better insights about the different plausible targets for this drug. We will use a hybrid approach in this study through combining both binding and ligand similarity analysis to predict other putative targets of decitabine. The purpose of this research is modifying the solubility of decitabine and GPR has been selected as the best model and reveals that increasing both inputs roughly increase the solubility of drug. So, the optimal is (P = 400, T = 3.38E + 02, Y = 1.07E-03).

Data set

To make models on solubility, we used a dataset with 32 data points identical to the reference46,47. Indeed, the experimental data have been collected from the reference and the machine learning models were fitted and implemented on the data. More detailed description of the method and experimental conditions can be found in the source published paper in46. Here, two inputs are considered, Pressure (bar) and Temperature (K), and a single output that shows the solubility of Decitabine drug in the supercritical carbon dioxide (CO2). The entire dataset is shown in Table 1.

Table 1 The whole dataset46,47.

Methodology

Kernel ridge regression (KRR)

The first machine learning (ML) method which is considered here for correlation of drug solubility values is the method of Kernel Ridge Regression (KRR). Suppose a data set \({\left\{({x}_{i},{y}_{i})\right\}}_{i=1}^{N}\) has been provided which is include \(N\) data points, and the goal is to estimate a function can analysis the mean squared error (MSE) of  [\({(f\left(x\right)-y)}^{2}\)]. The conditional mean \({f}^{*}\left(x\right) : ={\mathbb{E}}[Y|X=x]\) has been illustrated as the best function48. In order to estimate the function \({f}^{*}\),

$$\widehat{f}:=\underset{f\in H}{\mathrm{argmin}}\left\{\frac{1}{N}\sum_{i=1}^{N}{\left(f\left({x}_{i}\right)-{y}_{i}\right)}^{2}+ \leftthreetimes {\left|\left|f\right|\right|}_{H}^{2}\right\}$$
(1)

This equation can predict the kernel ridge regression35.

Decision tree regression (DTR)

A regression tree or decision trees regressor38 uses data from simulation inputs and outputs to create a structure that can be a leaf (terminal node), illustrating a estimation value, or an internal node (decision node), indicating some query to be performed on an input, with a branch and child for each possible output of the query. For continuous inputs, two options are available based on whether the condition is true or not. The structure of the data is declared at every node of the regression tree. To estimate the output for an unobserved data point, the inputs of that data point are employed to traverse the decision tree until a terminal node is seen. The estimated value is decided according to the output values from the training set ending up at that terminal node51.

An impurity measure for each node of the tree's test is decided by reviewing all input feature and obtaining an optimal split that maximizes the measure. MSE can be calculated by formulating the split A as follows for a particular input52:

$$MSE\left(A\right)={p}_{L}.s\left({t}_{L}\right)+{p}_{R}.s\left({t}_{R}\right)$$
(2)

Here, tL and tR denote the set of instances. Also, s(t) indicates the standard deviation of the N(t) data, ci, of instances within t:

$$s(t)=\sqrt{\frac{1}{N(t)}\sum_{i=1}^{N(t)}\left({c}_{i}-{\overline{c\left(t\right)}}^{2}\right)}$$
(3)

Here, \(\overline{c(t)}\) is the average of the values in t. The split that minimizes mean square error across all input features for instances at each node of the regression tree is used at each node. Overfitting can occur in tree-based algorithms if the data is split too finely53,54,52.

Gaussian process regression (GPR)

Successor models, such as the Gaussian process (GP), provide predictions as well as the degree of uncertainty associated with those predictions. A GP is a group of random variables with the same Gaussian distribution for any quantity of variables42. GPs can be assumed as an infinite-dimensional buildup of multivariate Gaussian distributions. N- instance training data can be considered a singular data point taken from an N-variate Gaussian distribution; thus, it can be matched to the Gaussian process. Typically, the average of this Gaussian Process is reserved to zero.

We describe GP55 using a one-dimensional problem with an N-instance training set, [xi | i = 1,2,…,N] and the corresponding output values y = [y1, …, yN]. We use the same notation as in the previous sections to describe GP for a one-dimensional problem for ease of exposition. Two instances xi and xj in the training set are related to each other through the covariance function k(xi, xj). The squared-exponentiation function is employed here56:

$$k\left({x}_{i}, {x}_{j}\right)={\sigma }_{f}^{2}\mathrm{exp}\left(\frac{-{\left({x}_{i}, {x}_{j}\right)}^{2}}{2{l}^{2}}\right)$$
(4)

where \({\sigma }_{f}^{2}\) the maximum allowable covariance and l is a length parameter that controls the extent of influence of each point. \({\sigma }_{f}^{2}\) Should be set to a large value for functions covering a broad range of values. In condition that data points xi and xj are close to each other, their output values are highly correlated, but if they are far away, then the value at one point does not influence the value at the other point. Accordingly, the hyper-parameter l determines the smoothness of the interpolation.

Assume we desire to employ the training data to estimate the output at an unseen data point x. Since the results be able toe depicted as an instance through a multivariate Gaussian distribution:

$$\left[\genfrac{}{}{0pt}{}{y}{{y}_{*}}\right]=\mathrm{ N}\left(0, \left[\genfrac{}{}{0pt}{}{K}{{K}_{*}}\genfrac{}{}{0pt}{}{{K}_{*}^{T}}{{K}_{**}}\right]\right)$$
(5)

y denotes the output variable correlated to the N training data points, y shows the estimated production at x and the following sub-matrices:

$$K=\left[\begin{array}{llll}k\left({x}_{1},{x}_{1}\right)& k\left({x}_{1},{x}_{2}\right)& \cdots & k\left({x}_{1},{x}_{N}\right)\\ .& .& .& .\\ \vdots & \vdots & \vdots & \vdots \\ k\left({x}_{N},{x}_{1}\right)& k\left({x}_{N},{x}_{2}\right)& \cdots & k\left({x}_{N},{x}_{3}\right)\end{array}\right]$$
(6)

And:

$${K}_{*}=\left[\begin{array}{llll}k({x}_{*},{x}_{1})& & \cdots & k({x}_{*},{x}_{N})\end{array}\right]$$

The probability of y, which is, the output at a data point, is formulates as:

$${k}_{**}=k({x}_{*},{x}_{*})$$

The variance indicates the degree of uncertainty in the estimate:

$$\mathrm{var}({y}_{*})={K}_{**}-{K}_{*}{K}^{-1}{K}_{*}^{T}$$

The parameters l and σf of the Gaussian process regressor can be computed from the training subset using a maximum likelihood method. It is also feasible to incorporate a Gaussian noise component in the output variable, however we have supposed that the noise is zero in our current research.

Prediction of decitabine putative targets

Decitabine Smiles were generated via PubChem (https://pubchem.ncbi.nlm.nih.gov/compound/Decitabine) then we feed the smiles into The LigTMap server (https://cbbio.online/LigTMap/?action=home) to identify the plausible targets from seventeen target classes and more than six thousands of different types of proteins.

Results and discussion

Analysis of model outcomes

The three abovementioned models were implemented to the collected dataset to build the models for the drug solubility. The hyper-parameters of the models we introduced were optimized using Grid-Search57. More than 1000 distinct combinations were used to get these ideal parameters for each model. Then, the models were tested in their ideal configurations, and their performance was evaluated.

Three traditional statistical metrics will be used to assess and compare the efficiency of each model, such as R2 and Mean Absolute Error (MAE) and MAPE. In order to calculate each of the statistics, a mathematical equation must be used52:

$${R}^{2}=\frac{{({\sum }_{i=1}^{\text{n}}({Y}_{i,m}-{\overline{Y}}_{i,m})({Y}_{i,o}- {\overline{Y}}_{i,o}))}^{2}}{{\sum }_{i=1}^{\text{n}}{({Y}_{i,m}-{\overline{Y}}_{i,m})}^{2}{\sum }_{i=1}^{\text{n}}{({Y}_{i,o}- {\overline{Y}}_{i,o})}^{2}}$$
(7)
$$MAE=\frac{1}{n}\sum_{i=1}^{\text{n}}\left|{Y}_{i,m}-{Y}_{\text{i,o}}\right|$$
(8)

In these equations, n is size of data set, Yi,m is the estimated value, Yi,o indicates actual (observed) value. As well, \({\overline{Y}}_{i,m}\) is the average of estimated values and \({\overline{Y}}_{i,o}\) indicates average of actual values. A comparison among the estimated amounts and the real (observed) amounts in the model training is shown in Figs. 1, 2, and 3 for the methods of KRR, DTR, and GPR, respectively. The red dots indicate the test data, the blue dots are the training data (estimated amounts), and the green line represents the real amounts. Comparing these three shapes clearly shows the higher generality in the GPR method in comparison to other methods. The statistical results of the comparison for all methods have been also demonstrated in Table 2. As it is clear, all methods have great capability in fitting and correlating the experimental data which indicate that these models are of great choice for application in production of nanomedicine using supercritical based technology. The best outputs are illustrated for GPR through R2 higher than 0.99 in order to fit the solubility results.

Figure 1
figure 1

Observed vs estimated values (KRR) (Y: solubility/mole fraction).

Figure 2
figure 2

Observed vs estimated values (DTR) (Y: solubility/mole fraction).

Figure 3
figure 3

Observed vs predicted values (GPR) (Y: solubility/mole fraction).

Table 2 The statistical results of all models used in this study.

The validated GPR method as the significant method has been applied to calculate the solubility data and find the influence of temperature and the pressure on the solubility of Decitabine in supercritical CO2. The results of 3D surface plot are explained in Fig. 4, the impact of temperature and pressure on the solubility of decitabine are significant, so that the highest value of solubility is observed at the maximum values of T and P in the 3D graph (see Fig. 4). The increase in the solubility with temperature and pressure could be attributed to the change of solvent density and consequently changing the solvation capacity of the solvent. Also, the 2D graphs of solubility versus temperature and pressure are indicated in Figs. 5 and 6, respectively. The optimum values calculated using the GPR model are listed in Table 3.

Figure 4
figure 4

3D projection of inputs/outputs (GPR method) (T: temperature, K), (P: pressure, bar).

Figure 5
figure 5

Trend of variable T (temperature, K) calculated using GPR model (Y: solubility, mole fraction).

Figure 6
figure 6

Trend of variable P (pressure, bar) calculated using GPR model (Y: solubility, mole fraction).

Table 3 Optimized  parameters using the GPR method.

Figure 7 aims to evaluate the impact of pressure on the solubility values of Decitabine at disparate temperatures. As indicated, there is a cross over area at each solubility figure. Indeed, the impact of temperature on drug solubility in SC-CO2 is paradoxical. Furthur, temperature's growth, influence on the sublimation pressure of drugs, causes the increment of solubility. In another side, increase the temperature results in decreasing the molecular compaction and as the result, the amount of SC-CO2 density, which has negative effect on the solubility of Decitabine. The pressure value of 18 bar is known as the cross over pressure. At the pressures between 12 and 18 bar, the negative effect of density deterioration entirely overcomes the desirable effect of vapor pressure increment. Moreover, at this range of pressures, temperature enhancement lead to a reduction in solubility. Above the cross over pressure (18 MPa), the solubility of Decitabine significantly enhances owing to the superiority of the positive effect of drug’s vapor pressure than the negative effect of density reduction. Therefore, at this pressure elevation of temperature improves the solubility.

Figure 7
figure 7

The impact of pressure on the solubility of Decitabine considering disparate temperatures.

Correlation of the solubility data with semi-empirical models

Figure 8a–d present the correlation outcomes of Decitabine-SC-CO2 system obtained by semi-empirical models. In this investigation, four principal semi-empirical density-based models (Sodeifian et al., K-J, Bartle et al. and Bian et al.) were pondered for the correlation of the experimental data of Decitabine solubility SC-CO258,56,57,58,62. Disparate values including settable parameters (a0, a1, a2, a3, a4 and a5), average absolute relative deviation (AARD%) and R2 are enlisted in Table 4. The AARD for developed models for Sodeifian et al., Bartle et al. and Bian et al. models were 12.15%, 11.61%, 14.46%, and 13.25%, respectively. Comparison of the results implies the fact that K-J model is the best model due to presenting the lowest value of AARD (11.61%).

Figure 8
figure 8

Comparison of correlation outcomes for Decitabine-SC-CO2 system using various semi-empirical models. (a) Sodeifian et al. (b) K-J, (c) Bartle et al. and (d) Bian et al.

Table 4 Correlation outcomes of Decitabine-SC-CO2 system obtained by various semi-empirical models.

Table 4 presents the correlation outcomes of Decitabine-SC-CO2 system obtained by semi-empirical models.

Additionally, through usage of LigTMap server we have found more than one hundred predicted targets for decitabine, these targets are classified according to disease target class into kinase 34 (29%), 30 (25%) transferase, 28 (24%) Hydrolase, 15 (13%) tuberculosis, 5 (4%) Hpyroli, 3 (2.5%) Influenza and 1 (0.8%) Beta secretase. Attached with this research work a supplementary data file that contains a list for the targets with docking scores in the binding sites of the specified proteins. Also, Pdb IDs for each specific protein are incorporated, the optimum binding of decitabine with these target proteins and binding mode, in addition to predicted affinity and docking scores all are obtained through the automated workflow of LigTMap. The obtained results (supplementary data) revealed that decitabine has ligand Similarity Score more than 0.6 with Deoxycytidine kinase and Thymidylate kinase TMK, target classes are kinase and tuberculosis with Pdb ID 3ipx and 1w2g respectively. Decitabine showed binding affinity more than 7 in disease target class Influenza, target name is Polymerase basic protein 2 (Pdb IDs: 5efc, 4or6 and 4q46). Decitabine showed the best docking score into Thymidylate kinase binding site with value equal −7.007 kcal/mol, Pdb ID: 1mrs in tuberculosis disease class. From these obtained results we can conclude that Thymidylate kinase (tuberculosis) and Polymerase basic protein 2 (Influenza) are plausible targets for decitabine, the following Fig. 9 illustrates the binding mode with Thymidylate kinase (tuberculosis).

Figure 9
figure 9

3D interactions and binding mode of decitabine drug with Thymidylate kinase (Pdb ID: 1mrs).

Four semi-empirical models (Sodeifian et al., Bartle et al., K-J and Bian et al.) have been considered to make a correlation with the outputs of solubility experiments. The precision for all applied methods has analysed and measured through AARD% and R2. Comparison of the outputs implies the fact that K-J model is the best model due to presenting the lowest value of AARD (11.61%). Despite good efficiency of K-J model for the accurate prediction of drug solubility, the employed GPR model shows better performance compared to K-J owing to having higher value of R2.

Conclusion

Computational simulation of Decitabine drug solubility in supercritical carbon dioxide was carried out in this study via three different machine learning models. We used a dataset of 32 data points and two inputs in this investigation to create solubility models (P and T). In this dataset, Y (solubility, mole fraction) is the lone result which is predicted by the models. Kernel Ridge Regression (KRR), Decision Tree Regression (DTR), and Gaussian process (GP) are the models which were employed in this work for correlation of the solubility data. Hyper-parameter tweaking was used to fine-tune these models, and standard metrics were used to assess their performance. KRR, DTR, and GPR have R2-scores of 0.806, 0.891, and 0.998. MAE's error rate is 1.08E−04, 7.40E−05, and 9.73E−06 in that sequence, too. The MAPE measure has a KRR error rate of 4.64E−01, a DTR error rate of 1.63E−01, and a GPR error rate of 5.06E−02 as the optimum option. As a conclusion, the best model (GPR) shows that increasing both inputs roughly raise the output. So, the best outcome is obtained as P = 400 bar, T = 3.38E + 02 °K, Y = 1.07E−03. Finally, LigTMap workflow revealed the promiscuity of decitabine to target Thymidylate kinase (disease class: tuberculosis) and Polymerase basic protein 2 (disease class: influenza). In this paper, the solubility value of Decitabine was evaluated at disparate values of pressure (120, 160, 200, 240, 280, 320, 360 and 400 bar) and temperatures (308, 318, 328, and 338 K). Four semi-empirical models (Sodeifian et al., Bartle et al., K-J and Bian et al.) have been considered to make a correlation with the outputs of solubility experiments. The precision of all applied methods has been evaluated through AARD% and R2. Comparison of the outputs implies the fact that K-J model is the best model due to presenting the lowest value of AARD (11.61%). Despite good efficiency of K-J model for the accurate prediction of drug solubility, the employed GPR model shows better performance compared to K-J owing to having higher value of R2.