Introduction

As an important step in drug design, the quantitative structure-activity relationship (QSAR) study has become one of the most active branches because it can improve the efficiency of drugs by computer simulation and provide ideas for designing new drugs1. The QSAR study is also important in computer science, chemistry, pharmacy and life sciences2. The efficacy of drugs is mainly achieved by activating the acidity coefficient, which is called pKa constant and denotes the capability of an acid to dissociate hydrogen ions3,4. The pKa value is an important parameter in drug design and determines pharmacological activity. The experimental measurement method of pKa value is relatively cumbersome and time-consuming. Therefore, it is necessary to establish an accurate and efficient pKa prediction model5,6.

Model establishment is one of the key steps in QSAR research. The traditional model establishment methods include linear regression and least square method7,8. The modern computing methods consist of support vector machines (SVM)9, artificial neural networks (ANN)10,11,12,13, and various intelligent algorithms14,15. Polanski16 proposed a model utilizing ANN and partial least squares (PLS) to study the relationship between molecular surface area and pKa value and predicted the pKa values of aromatic acids and alkyl acids. Luan17 proposed a pKa prediction model based on the heuristic method (HM) and radial basis function artificial neural network (RBF ANN) and obtained the better prediction performance. Previous studies confirmed that the ANN model had the better performance in QSAR modelling18,19, but the performance of ANN depended on its training algorithm. The training algorithm plays a decisive role in RBF ANN model and various evolutionary algorithms have been successfully applied in the training of RBF ANN20,21,22,23.

The selection of molecular descriptors largely determines the quality of QSAR model24,25,26. There are many selection methods of molecular descriptors, which can be mainly divided into two categories: traditional stepwise selection methods (including the PLS method and its variants) and the modern search algorithm based on optimization strategy27,28. The first category is simple, direct, and efficient, but it fails to achieve the global optimum, especially in the complex data sets. The second category is a global optimal method and shows significant advantages. It is easy to search the optimal solution and suitable to deal with complex large data sets. Therefore, the second category has become one of the hotspots29.

Several commonly used evolutionary algorithms, such as genetic algorithm, particle swarm optimization algorithm, ant colony algorithm, and firefly algorithm, have been successfully applied in the modeling of QSAR and the selection of molecular descriptors30,31,32. However, evolutionary algorithms have many shortcomings, such as premature convergence, and slow local search and the developed QSAR models are unsatisfactory. Therefore, an improved PSO algorithm, called CSAPSO-EDCD algorithm, based on population entropy diversity, convergence/divergence strategy and the self-adaptive adjustment strategy of weight factor was proposed. The CSAPSO-EDCD algorithm was applied in RBF ANN training and the selection of molecular descriptors in order to develop an efficient and accurate hybrid intelligent QSAR model for predicting the pKa values of neutral and basic drugs and exploring other QSAR models based on protonation changes upon the binding33,34,35,36,37,38,39 and molecular fingerprint similarity search40,41,42,43.

Theories and Methods

The model proposed in this paper involves several theories: RBF ANN, PSO and its improved algorithm.

RBF ANN

Radial basis function artificial neural network (RBF ANN) is one of the most widely used forward neural network models44,45. It has three layers of network structures: input layer, hidden layer, and output layer. The activation function adopted in the paper is the gauss function and defined as:

$${{g}}_{{i}}({{x}}_{{k}})=\exp (-\frac{{\Vert {{x}}_{{k}}-{{c}}_{{i}}\Vert }^{2}}{{\sigma }_{{i}}^{2}}),$$
(1)

where \({x}_{k}(1\le k\le n)\) is the kth output vector; \({c}_{i}(1\le i\le c)\) is the basis function center; σ i is the spreading constant; n is the number of samples; c is the number of hidden nodes. The network output is defined as:

$$O({x}_{k})=\sum _{i=1}^{c}{w}_{i}\,{g}_{i}({x}_{k}),$$
(2)

where w i is the connection weight of the ith hidden node.

However, the artificial neural network has many problems to be solved. For example, the performance is directly related to the optimization of the network weight. The training process of RBF ANN can be considered as the optimization process of function center, spreading constant and connection weight, namely, c i , σ i , w i .

Improved PSO Algorithm

Standard PSO algorithm

PSO is a widely applied population evolutionary algorithm proposed by Eberhart and Kennedy46,47 and characterized by fast convergence, simple parameter adjustment and easy realization. The standard PSO algorithm updates its own speed and position, as expressed in Eqs (1) and (2):

$${{\rm{v}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}+1}={\omega }{{\rm{v}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}}+{{\rm{c}}}_{1}({{\rm{p}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}}-{{\rm{x}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}})+{{\rm{c}}}_{2}({{\rm{p}}}_{{\rm{g}},{\rm{d}}}^{{\rm{k}}}-{{\rm{x}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}});$$
(3)
$${{\rm{x}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}+1}={{\rm{x}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}}+{{\rm{v}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}+1};$$
(4)

where i = 1, …, m; ω is the inertia weight factor, which controls the inertia of the particles and possesses the capability of expanding search space; C1 and C2, which are the learning factors, represent the statistical acceleration weight when each particle arrives at the extreme-value position; \({{\rm{v}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}}\) and \({{\rm{x}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}}\) respectively denote the velocity and position of the i-th particle in the d-dimensional space at k-th iteration; \({{\rm{p}}}_{{\rm{i}},{\rm{d}}}^{{\rm{k}}}\) is the local best position of i-th particle in the d-dimensional space; \({{\rm{p}}}_{{\rm{g}},{\rm{d}}}^{{\rm{k}}}\) is the global best position of the population upon arriving at the d-dimensional space.

However, PSO algorithm cannot ensure the optimal solution in each execution. In order to obtain the optimization network parameters, an improved PSO algorithm, called CSAPSO-EDCD, based on population entropy diversity, convergence/divergence strategy and the self-adaptive strategy, has been developed in this paper and then applied in the optimization of the function center, spreading constant and connection weight of RBF ANN.

Chaotic self-adaptive strategy

Thus, in this study, the chaotic self-adaptive PSO algorithm, or CSAPSO algorithm, is deduced by applying Lorenz chaos equations and self-adaptive strategies in the adjustment of the learning factors and inertia weight in the PSO algorithm. The inertia weight factor, ω, is changed to Eq. (5).

$$\omega ={\omega }_{\max }-\mathrm{Pgbest}(k)/{{\rm{Plbest}}}_{{\rm{ave}}}-({\omega }_{\max }-{\omega }_{\min })\times k/{k}_{\max },$$
(5)

where ωmax and ωmin respectively denote the maximum and minimum inertia weights; Pgbest(k) denotes the global best fitness in the k-th iteration; Plbestave denotes the average local best fitness; kmax denotes the maximum number of iterations; and k denotes the current iteration.

The learning factors C1 and C2 are obtained with the chaotic sequences generated by the Lorenz equations in Eq. (6).

$$\{\begin{array}{c}\frac{dx}{dt}=-a(x-y)\\ \frac{dy}{dt}=rx-y-xz\\ \frac{dz}{dt}=xy-bz\end{array},$$
(6)

where a, b, and r are the positive control parameters. When a, b, and r are respectively set to be 10, 8/3, and 28, the learning factors (i.e, c1 and c2) are in a chaotic state and defined as follows:

$$\{\begin{array}{c}{c}_{1}=x(t)\\ {c}_{2}=y(t)\end{array}.$$
(7)

Chaotic variables are characterized by randomness, ergodicity, and regularity. By changing the characteristics of chaotic variables, the algorithm can simultaneously increase population diversity and solve the premature convergence problem.

Population entropy diversity strategy

Entropy is used to describe the state of a system and indicate the uncertainty in the system or the degree of confusion. The level of entropy can directly reflect the degree of chaos of a system. The higher the entropy is, the more chaotic the system is. On the basis of the definition of entropy in information theory, the population entropy is used to describe the population diversity in the PSO algorithm.

Definition

Population entropy. Population size is set to be N. Q disjoint subgroups \(\{{s}_{1}^{t},{s}_{2}^{t},\mathrm{...},{s}_{Q}^{t}\}\) exist in the t-th iteration. The number of particles in each subgroup is described as \(\{|{{\rm{s}}}_{1}^{{\rm{t}}}|,|{{\rm{s}}}_{2}^{{\rm{t}}}|,\mathrm{...},|{{\rm{s}}}_{{\rm{Q}}}^{{\rm{t}}}|\}\). Thus, the population entropy in the t-th iteration is defined as follows:

$${E}_{t}=-k\sum _{i=1}^{Q}{p}_{i}^{t}{\mathrm{log}}_{2}({p}_{i}^{t}),\,{\rm{where}}\,{p}_{i}^{t}=\frac{|{s}_{i}^{t}|}{N}.$$
(8)

The population entropy reflects the distribution situation of the population particles. When the population entropy is higher, the particles are in a more chaotic state. The more uniformly distributed the particles are in space, the better the population diversity is. Conversely, the lower the population entropy is, the less chaotic the population is. In this situation, the population particles may converge in the nearest region with one or few extreme points and the population diversity is poor.

In order to evaluate population diversity with population entropy, maximum threshold and minimum threshold are respectively set as E high and E low . If the population entropy is between E low and E high , then the population is in the equilibrium state. If the population entropy is higher than E high , the population is in a state of exploration. If the population entropy is lower than E low , the population is in a local search state.

Convergence/divergence strategy

In the convergence and divergence strategies, the necessary conditions for the convergence and divergence of the particles should be determined. In a previous study, the conditions of convergence depend on the inertia weight and learning factor48,49. When the inertia weight and the sum of two learning factors are less than 1 and 3, respectively, the particles are always convergent. For example, when the inertia weight and the sum of the two learning factors are respectively 0.65 and 0.1, the particles rapidly converge. When the inertia weight is greater than 1, the particles are always divergent. The higher the inertia weight is, the faster the spread rate of the particles is. In the study, Eqs (9) and (10) respectively express the control strategies of convergence and divergence.

$$\{\begin{array}{c}\omega =0.65|\sin (\frac{1}{{E}_{t}})|\\ \phi =0.1|\sin (\frac{1}{{E}_{t}})|\end{array};$$
(9)
$$\omega =1+\lambda \phi \frac{1}{{E}_{t}};$$
(10)

where ω indicates the inertia weight; E t indicates the population entropy; φ is the sum of c1 and c2; λ is the positive divergence coefficient and set to be 2.

CSAPSO-EDCD algorithm

The CSAPSO-EDCD algorithm is deduced by combining the convergence/divergence and chaotic self-adaptive strategies on the basis of the diversity of the population entropy. When the population entropy is higher than the set maximum threshold, E high , and the particles are in the exploration state, the CSAPSO-EDCD algorithm uses the convergence strategy to induce the particle to move to group center. When the population entropy is lower than the set minimum threshold, E low , and the particles enter the state of exploitation, the algorithm uses the divergence strategy to force the particle to move away from group center. If the population entropy is between E low and E high , the existing search strategy is maintained. The procedure of CSAPSO-EDCD algorithm is shown as follows:

Step 1: Initialization. Initialize the population size, the maximum and minimum numbers of iterations, the maximum and minimum population entropy, etc.

Step 2: Fitness evaluation. Calculate the fitness value of each individual.

Step 3: Update the extreme value.

Step 4: Population entropy evaluation.

if (the population entropy is greater than the max population entropy E high ) then

Particles with the convergence strategy

else if (the population entropy is less than the min population entropy E low ) then

Particles with the divergent strategy

else

Particles with the chaotic self-adaptive strategy

end if

Step 5: Finished. Confirm whether the iterative conditions are satisfied. If they are satisfied, then the evolution is finished, otherwise jump to Step 2 to continue.

Figure 1 shows the CSAPSO-EDCD algorithm flowchart.

Figure 1
figure 1

Flowchart of CSAPSO-EDCD algorithm.

Hybrid Intelligent Model

The relationship between the output and input of the RBF ANN is defined as follow:

$$y=f({w}_{h,o},{B}_{h,o},{C}_{basis-fun}),$$
(11)

where W h,o and B h,o \((1\le h\le c)\), \((1\le o\le p)\) are respectively the weight matrix and deviation matrix of the hidden node h and the output node o; Cbasisfunction is the base function center; is the number of output nodes.

Three parameters of RBF ANN, namely, c i , σ i , w i , are optimized through the above CSAPSO-EDCD algorithm. Thus, the structure of the particle is defined as:

$$particle(i)=[{W}_{h,o},{B}_{h,o},{C}_{basis-fun}]$$
(12)

RBF ANN was trained by the CSAPSO-EDCD algorithm and the pKa prediction model was developed and called CSAPSO-EDCD RBF ANN.

Experimental

Experimental Database

The experimental database of neutral and basic drugs from the previous study50 is provided in Table 1. The database, which consists of 74 sets of data, is divided into two subsets by using the random selection method in order to obtain a more reasonable prediction model, and the two subsets denote a training set and a testing set, respectively. The training set is used to establish the model and the testing set is applied to test the performance of the model.

Table 1 Experimental database.

In this study, the 70% of the data in the database (52 sets of data) were used for training; the remaining data (22 sets of data) were applied to test the model.

Molecular Descriptors

To establish the relationship between the pKa value and the molecular structure, the molecular structure was indirectly characterized by molecular descriptors. In this paper, molecular descriptors were generated by the following steps. Firstly, the molecular structure was established with Chemdraw UItra 7.0. Secondly, the established molecular structure was optimized with Hyper Chem 7.5. Then, the molecular descriptors were calculated in CODESSA software and 686 molecular descriptors were obtained. In order to reduce the complexity and obtain the most relevant descriptors for the pKa value, the CSAPSO-EDCD algorithm is adopted in the selection of molecular descriptors as follows:

  • Step 1: Population initialization. Initialize population size, the max and min number of iterations, etc. and set the population individual as molecular descriptors. Table 2 shows the parameters of the CSAPSO-EDCD algorithm.

    Table 2 Parameters of the CSAPSO-EDCD algorithm.
  • Step 2: Fitness evaluation. Calculate the fitness value of molecular descriptors of each individual.

  • Step 3: Update the molecular descriptors. The velocity and position of the particles are updated with the local and global extreme values and the next population of molecular descriptor will be obtained.

  • Step 4: Update the fitness of molecular descriptor.

  • Step 5: Finished. Confirm whether the iterative conditions are satisfied. If they are satisfied, then the evolution is finished, otherwise jump to Step 2 to continue.

Finally, 5 molecular descriptors are selected by CSAPSO-EDCD algorithm (Table 3).

Table 3 Molecular descriptors selected by CSAPSO-EDCD algorithm.

Model Establishment

The CSAPSO-EDCD RBF ANN model was developed according to the molecular descriptors selected by CSAPSO-EDCD algorithm and consisted of 3 layers: the input layer, hidden layer, and output layer. The input layer has 5 input parameters representing the 5 molecular descriptors, the relative number of N atoms, the Randic index (order 3), the RNCG relative negative charged (QMNEG/QTMINUS) [Quantum-Chemical PC], the RNCS relative negative charged SA (SAMNEG*RNCG) [Zefirov’s PC], and the maximum net atomic charge. The output layer has only 1 output parameter representing the pKa value.

Two methods are commonly used to determine the number of nodes in hidden layer: the formula method and the heuristic method. In this study, the two methods were adopted. First, the number of the hidden nodes was calculated through the formula method (2*sqrt (m*n) + 1, where m and n denote the numbers of input nodes and output layer nodes, respectively). Then, the heuristic method was adopted to confirm the optimal number of hidden nodes. There were 5 input nodes and 1 output node in this study. Therefore, 5 hidden nodes were obtained through the formula method. It is assumed that the number of nodes in hidden layer is explored from 3 to 13. Figure 2 shows the relationship between the prediction error and the number of hidden nodes.

Figure 2
figure 2

Relationship between MSE and the number of hidden nodes.

As shown in Fig. 2, with the increase in the number of hidden nodes, MSE decreases firstly and then increases. When the number of hidden nodes is 6, the MSE reaches its minimum value and the structure of the model is optimal. Therefore, the CSAPSO-EDCD RBF ANN model structure is 5-6-1.

Model Evaluation

The prediction capabilities of different models are evaluated in terms of absolute average relative deviation (AARD), root mean square error of prediction (RMSEP), and squared correlation coefficient (R2). The three indicators are defined as follows:

$$AARD=\frac{1}{N}\sum _{i=1}^{N}\frac{|{\overline{y}}_{i}-{y}_{i}|}{{y}_{i}};$$
(13)
$$RMSEP=\sqrt{\frac{1}{N}\sum _{i=1}^{N}({y}_{i}-{\bar{y}}_{i}{)}^{2}};$$
(14)
$${R}^{2}=\frac{{[\sum _{i=1}^{N}({y}_{i}-{y}_{ave})({\bar{y}}_{i}-{\bar{y}}_{ave})]}^{2}}{\sum _{i=1}^{N}({y}_{i}-{y}_{ave}{)}^{2}\sum _{i=1}^{N}({\bar{y}}_{i}-{\bar{y}}_{ave}{)}^{2}};$$
(15)

where N is the number of data samples; is the predicted value; y i denotes the experimental value; y ave denotes the average of the experimental values; \({\bar{y}}_{ave}\) denotes the average of the predicted values.

Results and Discussion

Results of the proposed model

Experiments were performed in Windows 7 SP1 64-bit operating system (4.00 GB of memory and 4 Intel (R) Core(TM) i5-4460 CPU @ 3.20 GHz processors). Through Matlab 2010a software programing, a 5-6-1 CSAPSO-EDCD RBF ANN model was proposed to predict the pKa values of 74 neutral and basic drugs. The model was trained with the 52 data points in the training set and tested with the 22 data points in the testing set. Figure 3 shows the correlogram between the experimental data and the predicted values. The straight line denotes the experimental data, whereas the stars denote the predicted values in this paper. The vertical distance between the star-shaped point and the straight line indicates the absolute error between the predicted value and the experimental value.

Figure 3
figure 3

Correlations between predicted values and experimental data in the training set.

As shown in Fig. 3, in the training set, the prediction values of the model are close to the experimental data. The vertical distances between the star-shaped point and the straight line show that the prediction error of the model is small and that the prediction accuracy is high. Training graph indicates that the training effect of model is better. Figure 4 shows the correlogram between the experimental data and the predicted value in the testing set. The predicted values of the model are consistent with the experimental data in the testing set. Table 4 shows the statistical parameters of the proposed model.

Figure 4
figure 4

Correlation between predicted values and experimental data in the testing set.

Table 4 Statistical parameters of the proposed model.

In the two subsets, the model shows the better comprehensive performance. According to the statistical data, the prediction effect of the proposed model is good and the prediction error is small. Therefore, the comprehensive performance is good. The prediction accuracy and correlation analysis showed that the model had the good prediction performance (Table 4) and the above results confirmed that the prediction performance of the model was excellent.

Validation analysis with other test databases

Furthermore, in order to verify the robustness and scalability of the proposed model, another testing database containing 20 data points was additionally established for the performance validation. The database from previous studies51,52 is provided in Table 5.

Table 5 Additional testing database.

Figure 5 shows the correlation between the predicted pKa data and experimental pKa data. The predicted pKa data obtained by the proposed model were well consistent with experimental pKa. Table 6 displays the prediction performance verified with the testing database. The results indicated the superior prediction performance of the proposed model.

Figure 5
figure 5

Correlation between predicted values and experimental data in testing database.

Table 6 Statistical results of the proposed model in testing database.

The data in this testing database were not trained by the prediction model. The testing results indicated that the proposed model had the good prediction performance with the better robustness and scalability.

Comparison of the Proposed Model against Other Models

In this paper, to verify the performance of the CSAPSO-EDCD RBF ANN model, we chose two congeneric models (PSO RBF ANN and RBF ANN model) as the comparison models and each model was tested with the testing set. Figure 6 shows the correlation between prediction and experimental values of comparison models.

Figure 6
figure 6

Correlation between prediction and experimental values of the comparison models.

The vertical distances between the prediction data points and experimental data points showed that the prediction performances of the three models were increased according to the order: RBF ANN, PSO RBF ANN, and CSAPSO-EDCD RBF ANN. PSO RBF ANN is slightly better than RBF ANN. Figure 7 shows the residual curve between the experimental and predicted values of each comparison model. Table 7 shows the statistical parameters of the comparison models.

Figure 7
figure 7

Residual curve for each comparison model.

Table 7 Statistical results of each comparison model.

According to the statistical results, the accuracy of the CSAPSO-EDCD RBF ANN model is good. Execution time of the CSAPSO-EDCD RBF ANN model is close to that of the RBF ANN model, and the CPU utilization is less than that of others. In fact, the intervention of intelligent algorithm is bound to consume more computation time. However, the computation time is acceptable because the improved PSO algorithm enhances the training and prediction performances of the model.

Discussion

Discussion of molecule descriptor selection using CSAPSO-EDCD algorithm

Based on the results of molecular descriptor selection, 5 molecular descriptors were adopted based on CSAPSO-EDCD algorithm.

As for the constitutional descriptors, the relative number of N atoms, has a great influence on the density of electron cloud. The relative number of N atoms is generally proportional to the electron cloud density. Therefore, the polarity of the positive and negative charges in the molecule becomes larger and the pKa value is smaller. The relative number of N atoms can be used to characterize the constituent of the molecular structure.

As for the topological descriptor, the Randic index (order 3) indicates the size and shape of a molecule, or the degree of branching, and shows the molecular dispersion force. The molecular volume increases with the increase in the dispersion force of the molecule, thus leading to the decrease in the pKa value. The Randic index (order 3) can better characterize the topological structure of the molecule.

As for the electrostatic descriptor, the RNCG relative negative charged (QMNEG/QTMINUS) [Quantum-Chemical PC] and RNCS Relative negative charged SA (SAMNEG * RNCG) [Zefirov’s PC] depend on the distribution of the calculated electric charge on the molecule. The negative coefficient of relative negative charge shows an inverse relationship with pKa value. In terms of the relative negative charge surface area, the probability of the positive ion to replace the proton shows an inverse relationship with the most negative atom solvent contact area and pKa value. The relative negative charge and its surface area can be used to characterize the molecular electrostatic parameters.

As for the quantum chemical descriptors, the maximum net atomic charge is related to the polarization action and directly proportional to the pKa value. Simultaneously, it can also be used to characterize the quantum chemical structure of molecules.

According to the results of molecular descriptors, the molecular descriptors selected by the CSAPSO-EDCD algorithm can better characterize the molecular structure and reflect the close relationship between the structures of various drug compounds and the pKa value.

Discussion of the model of CSAPSO-EDCD RBF ANN

The CSAPSO-EDCD algorithm can effectively overcome the shortcomings of the PSO. It can provide an efficient training algorithm for RBF ANN through the combination of chaotic self-adaptive, the diversity of population entropy and the convergence/divergence strategies. The good performance is ascribed to the CSAPSO-EDCD algorithm for selecting molecular descriptors and training RBF ANN.

The QSAR model based on the hybrid intelligent method has the less computation load and good prediction performance. If the molecular structure is unknown, the structure activity relationship can be predicted accurately and effectively. However, the model is established based on data training and obviously affected by the training data. Moreover, the physical meaning of the structure-activity relationship is one of the most important challenges.

Conclusions

The CSAPSO-EDCD RBF ANN model can accurately predict the pKa values of various drug molecules. The model shows the good performance in predicting the pKa values of neutral and basic drugs and its prediction accuracy and correlation are higher. The model can provide a reference for QSAR modeling.

The molecular descriptors selected by the CSAPSO-EDCD algorithm can better characterize the molecular structure and provide ideas for the selection of molecular descriptors in QSAR.