Introduction

Acinetobacter species belong to the group of aerobic gram-negative bacteria that are non-motile, non-fermentative, catalase positive, oxidase negative encapsulated coccobacilli, having a DNA G+C content of 39 to 47 mol1,2. Taxonomically, scientists have identified 68 validated species in the genus Acinetobacter, with numerous others yet to be delineated into species3,4,5. Many Acinetobacter species are found naturally in different environments, including soil, water, air, wastewater, fomites, human skin, animals, and even on plants6,7,8. Some species can utilise different substrates, such as amino acids, carbohydrates, organic acids, and hydrocarbons, while some can secrete industrial enzymes like lipase and protease9,10. However, few species are human opportunistic pathogens. For instance, Acinetobacter baumannii is a well-known notorious species in hospital settings that cause life-threatening infections such as pneumonia, respiratory and urinary tract infections, septicaemia, and wound infections, among others, especially in immune-compromised patients11,12,13.

Acinetobacter species are widely spread via the environmental milieu and may alarmingly spread antimicrobial resistance genes in the environment14,15. In addition, wastewater treatment plants (WWTPs) feed by hospital and municipal wastewater inflows have been reported to contribute multidrug-resistant (MDR), and extensively drug-resistant (XDR) Acinetobacter isolates to their effluents receiving waterbodies compared with other sources15,16. Discharging WWTP effluents increases the prevalence of Acinetobacter in the receiving river waterbodies and promotes antimicrobial resistance and transmission to irrigated vegetables15. The transmission of Acinetobacter spp. (especially A. baumannii)—with high antimicrobial resistance and case fatality ratio—onto fresh produce has been demonstrated and reviewed by Carvalheira et al.17. Acinetobacter species with different resistant capabilities ranging from MDR to XDR have been isolated in fresh fruits and vegetables (apples, cabbages, melons, cauliflowers, peppers, mushrooms, lettuce, cucumbers, bananas, radishes, sweet corn carrots, potatoes, peach, pear, strawberry, apple, celery, tomato, and radish) at a density up to 50–1000 CFU/g18 in Hong Kong19, France20, Nigeria21, Lebanon22, Portugal23 and agricultural environment in Algeria24. Furthermore, waterbodies especially rural rivers for instance, support recreational use of considerably high levels by people incognizant of the inflow/inputs of WWTP effluents and the influx of multidrug-resistant pathogens of public health concern including Acinetobacter25.

The routine experimental determination and identification of Acinetobacter species and other bacteria in all matrices (water, food, and clinical samples, etc.) using most probable number, direct plate count, adenosine triphosphate testing, and membrane filtration methods are usually laborious, repetitive, time-consuming (incubation period), and cost-intensive endeavours that required expert knowledge which might not be readily available in most settings. Therefore, there is an urgent need for rapid, reliable, and cost-effective means that required no or low technical know-how to assess Acinetobacter density (AD) in waterbodies and other matrices to ensure short turnaround time necessary to make informed microbiological quality decisions. It is hypothesized that AD in waterbodies could be predicted accurately and dependably by using machine learning intelligence frameworks that depend upon the dynamic’s relationship between AD based on the afore determination methods and physicochemical variables of waterbody and other matrices in a low-cost and time-effective way. Thus, an artificial intelligence system for AD determination in waterbodies receiving WWTP effluents, which are subsequently used as irrigation source waters (ISW), would be an invaluable preventive option for immediate and future public health challenges.

The main merits of ML models lie in their capacity to overcome problems associated with traditional statistical models in capturing and predicting multidimensional interactions in large data by “learning” deep patterns26. ML frameworks and SAIS allow proactive management of events rather than reactive. Thus, MLs and SAIS are finding increasing applications in many sectors, including medicine, precision farming, environmental management, water purification, Vibrio abundance on microplastics, wastewater treatment, watershed typologies and stormwater quality and epidemiology prediction26,27,28,29,30 and the application is endlessly expanding daily.

Therefore, the present study aimed at predicting/determining AD in waterbodies (receiving hospital, municipal and WWTP effluents) using ML without the repetitive, laborious, cost-intensive, and time-consuming laboratory routines to reduce the turnaround time essential to make informed microbiological quality decisions (e.g., for irrigation use and other purposes).

Materials and methods

Sample collection and in-situ determination of physicochemical data

Water samples were collected using grab sampling technique from the Great Fish River, Keiskamma River and Thyume River, serving as receiving waterbodies for municipal and hospital wastewater effluents (MHWE) discharge at one or more points along their courses in the Eastern Cape Province, South Africa. At least, five strategic sampling locations based on socioeconomic importance (e.g., fishing, swimming, nearness to wastewater treatment plants, farming, pasture, irrigation, dam etc.) of each river were selected for sample collection. At the sampling sites, water temperature (TEMP), pH, total dissolved solids (TDS), electrical conductivity (EC), salinity (SAL), and dissolved oxygen (DO) were determined in-situ using a standard multi-parameter device (Hanna, model HI 9828) instrumental protocol. In addition, the rivers’ turbidity (TBS) was assessed using a turbidimeter (HACH, model 2100P). For microbiological analysis and biochemical oxygen demand (BOD) measurement, midstream water samples (25–30 cm depth) were collected at the same sampling sites in three replicates into sterile glass and amber bottles, respectively and stored in iceboxes and transported to the laboratory for analysis with 6 h of collection31. After five days of incubation of samples in amber bottles, the BOD of the samples was determined using a biochemical oxygen demand meter (HACH, HQ 40 days)31. Detailed sampling strategy, sampling points’ description, and study area maps were as described in our previous study32.

Acinetobacter data acquisition

The density of Acinetobacter species in the water samples was estimated via membrane filtration31. Briefly, 100 ml of serially diluted water samples were filtered in three independent iterations using a Ø47 mm 0.45 μm pore-sized cellulose membrane31. These membranes were aseptically placed onto freshly prepared Acinetobacter CHROMagar plates containing selective supplements (CHROMagar, Paris, France) per the manufacturer’s instruction. The plates were incubated at 37 °C for 24 h. All Acinetobacter colonies presented as red colouration on CHROMagar plates post-incubation was counted and log transformed (log CFU/100 mL). All isolates were purified, validated as oxidase negative, and assessed by Acinetobacter-specific polymerase chain reaction. Fifty per cent (50%) of glycerol stocks of the pure culture was prepared and stored at – 80 °C.

Model development

Pre-processing and modelling procedure

The datasets were first subjected to explanatory and bivariate Pearson's correlation (r) [Eq. (1)] analyses. The estimation of 95% confidence intervals (95% CI) of the r-value in bivariate correlation analysis was based on Fisher's r-to-z transformation with bias adjustment [Eq. (2)]. To avoid multicollinearity, where the r-value between two variables ≥ 0.99, one of them was dropped randomly in subsequent models (see Table 2). Any of the two variables can be used in the implementation of the models. Also, for models’ implementation, the datasets were centre scaled such that the mean = 0 and the square root of the variance = 1 for variables. The dataset for DTR was not scaled.

$$r=\sum_{i=1}^{h}({u}_{i}-\overline{u })({w}_{i}-\overline{w })/\sqrt{\sum_{i=1}^{h}{({u}_{i}-\overline{u })}^{2}}\sqrt{\sum_{i=1}^{n}{{(w}_{i}-\overline{w })}^{2}} $$
(1)
$$z=\mathrm{arctanh}\left(r\right)= 1/2\mathrm{log}((1+r)/(1-r))$$
(2)

where r is a Pearson’s correlation coefficient with possible values from − 1 to 1 inclusive. Here, u and w represent a pair of PVs and h is the sample size.

Acinetobacter density (AD) was modelled as a dependent variable of the rivers’ physicochemical variables (PVs). Hence, the conditional expected (CE) AD value at instances of PVs consisting of a vector of TEMP, DO, BOD, TSS, SAL, and pH is derived as \({\mathrm{CE}}_{AD|PVs}(AD)\). Thus, the estimation of the mean AD can be constructed as Eq. (3).

$${\mathrm{CE}}_{AD|PVs}(AD)\approx f\left(PVs\right).$$
(3)

Equation (1) was implemented via 18 regression ML algorithms that have the robust capability to fit multidimensional variables of ordinal/continuous outcome, including linear regression with stepwise selection (LRSS), an RF, XGB, SVR, linear regression (LR), a gradient boosted machine (GBM), neural network (NNT) (6–6–1 network with 49 weights multiple; decay = 0.1), a KNN (k-nearest neighbour), M5P, a boosted regression tree (BRT), a Cubist regression, a decision tree (DTR), multivariate adaptive regression splines (MARS), ANN [with one 6-node hidden layers (ANN6), extreme learning machine (ELM), two 4- and 2- node hidden layers (ANN42), and two 3- and 3-node hidden layers (ANN33), and elastic net (ENR)]. The dataset (540 observations, 6 variables after explanatory feature selection) was split into a learning subset (70%) for the estimate of models’ coefficients and a validation subset (30%) for model substantiation. In all the ML implementations of Eq. (1), ten different learning-validation dataset pairs were generated via tenfold cross-validation accompanied by 3 repeats and 10 tune-lengths. Optimal hyper-parameters were derived and selected through a grid search algorithm. Models’ hyper-parameters are provided in detail in the supplemental material. Detailed discussion on the strengths and weaknesses and previous application of the various algorithms could be found elsewhere and their documentation.

The explanatory rendition of all variables contributions in the models was according to Eq. (4):

$$f\left(w.\right)= {t}_{0} +\sum_{j=1}^{p}t\left(j,w.\right),$$
(4)

where t(j, w.) denotes the jth variable contribution measure to the model’s prediction at instance w and t0 is the average model prediction33.

Assessment of ML model’s performance

The MLI algorithms model’s performance was determined against experimental data based on Eqs. (5)–(8):

$$\mathrm{Mean \, squared}-\mathrm{error}: MSE\left(f,\underline{U}, \underline{w}\right)=\frac{1}{h}\sum_{i}^{h}{\left({\widehat{w}}_{i}-{w}_{i}\right)}^{2}= \frac{1}{h}\sum_{1}^{h}{r}_{i}^{2}$$
(5)
$$\mathrm{Root}-\mathrm{mean}-\mathrm{squared \, error}: RMSE\left(f,\underline{U}, \underline{w}\right)= \sqrt{MSE\left(f,\underline{U}, \underline{w}\right)}$$
(6)
$${R}^{2}\left(f,\underline{U}, \underline{w}\right)=1- \frac{MSE\left(f,\underline{U}, \underline{w}\right)}{MSE\left({f}_{0},\underline{U}, \underline{w}\right)}$$
(7)
$$\mathrm{Median \, absolute \, deviation}: MAD\left(f,U, \underline{w}\right)=median\left(\left|{r}_{1}\right|, \ldots , \left|{r}_{n}\right|\right).$$
(8)

where h = number of the sample; f0(): baseline model; ri: residual for the ith observation, U: matrix of PVs; \(\underline{w}\): vector of AD; \(f\left(\widehat{\underline{\theta }},\underline{U}\right):\) model based on the training dataset; \(\widehat{\underline{\theta }}:\) estimated values of the model’s coefficients; and \({\widehat{\underline{w}}}_{i}:\) model’s prediction equivalent to \({\mathrm{w}}_{i}\).

RMSE was further employed in assessing mean dropout loss for variable importance following 1000 permutation34,35.

Models’ sensitivity analysis

Residual diagnostics and partial-dependence profiles of PVs on the predicted AD was generated to assess the model’s sensitivity. The partial-dependence profile of a model f() (i.e., anticipated/predicted AD value at an instance by the model) and the outcome variable Uj set at s (over the empirical/marginal distribution of U-j (h), i.e., the collective distribution of all other PVs without Uj ) is created according to Eqs. (9) and (10):

$${q}_{P}^{j}\left(\mathrm{s}\right)={E}_{{\underline{X}}^{-j}}\left\{f\left({U}^{j|=s}\right)\right\}.$$
(9)
$${\widehat{q}}_{P}^{j}\left(\mathrm{s}\right)=\frac{1}{h}\sum_{i=1}^{h}f\left({\underline{u}}_{i}^{j|=s}\right).$$
(10)

The implementation of all models was achieved in R v.4.1.2 software.

Results

A descriptive summary of the physicochemical variables and Acinetobacter density of the waterbodies is presented in Table 1. The mean pH, EC, TDS, and SAL of the waterbodies was 7.76 ± 0.02, 218.66 ± 4.76 µS/cm, 110.53 ± 2.36 mg/L, and 0.10 ± 0.00 PSU, respectively. While the average TEMP, TSS, TBS, and DO of the rivers was 17.29 ± 0.21 °C, 80.17 ± 5.09 mg/L, 87.51 ± 5.41 NTU, and 8.82 ± 0.04 mg/L, respectively, the corresponding DO5, BOD, and AD was 4.82 ± 0.11 mg/L, 4.00 ± 0.10 mg/L, and 3.19 ± 0.03 log CFU/100 mL respectively.

Table 1 Descriptive statistics of the physicochemical variables and Acinetobacter density of the waterbodies.

The bivariate correlation between paired PVs varied significantly from very weak to perfect/very strong positive or negative correlation (Table 2). In the same manner, the correlation between various PVs and AD varies. For instance, negligible but positive very weak correlation exist between AD and pH (r = 0.03, p = 0.422), and SAL (r = 0.06, p = 0.184) as well as very weak inverse (negative) correlation between AD and TDS (r = − 0.05, p = 0.243) and EC (r = − 0.04, p = 0.339). A significantly positive but weak correlation occurs between AD and BOD (r = 0.26, p = 4.21E−10), and TSS (r = 0.26, p = 1.09E−09), and TBS (r = 0.26, 1.71E-09) whereas, AD had a weak inverse correlation with DO5 (r = − 0.39, p = 1.31E−21). While there was a moderate positive correlation between TEMP and AD (r = 0.43, p = 3.19E−26), a moderate but inverse correlation occurred between AD and DO (r = − 0.46, 1.26E−29).

Table 2 Bivariate correlational relationship among physicochemical variables and Acinetobacter density in waterbodies receiving municipal and hospital wastewater effluents.

Model predicted AD and explanatory contribution of PVs

The predicted AD by the 18 ML regression models varied both in average value and coverage (range) as shown in Fig. 1. The average predicted AD ranged from 0.0056 log units by M5P to 3.2112 log unit by SVR. The average AD prediction declined from SVR [3.2112 (1.4646–4.4399)], DTR [3.1842 (2.2312–4.3036)], ENR [3.1842 (2.1233–4.8208)], NNT [3.1836 (1.1399–4.2936)], BRT [3.1833 (1.6890–4.3103)], RF [3.1795 (1.3563–4.4514)], XGB [3.1792 (1.1040–4.5828)], MARS [3.1790 (1.1901–4.5000)], LR [3.1786 (2.1895–4.7951)], LRSS [3.1786 (2.1622–4.7911)], GBM [3.1738 (1.4328–4.3036)], Cubist [3.1736 (1.1012–4.5300)], ELM [3.1714 (2.2236–4.9017)], KNN [3.1657 (1.4988–4.5001)], ANET6 [0.6077 (0.0419–1.1504)], ANET33 [0.6077 (0.0950–0.8568)], ANET42 [0.6077 (0.0692–0.8568)], and M5P [0.0056 (− 0.6024–0.6916)]. However, in term of range coverage XGB [3.1792 (1.1040–4.5828)] and Cubist [3.1736 (1.1012–4.5300)] outshined other models because those models overestimated and underestimated AD at lower and higher values respectively when compared with raw data [3.1865 (1–4.5611)].

Figure 1
figure 1

Comparison of ML model-predicted AD in the waterbodies. RAW raw/empirical AD value.

Figure 2 represents the explanatory contributions of PVs to AD prediction by the models. The subplot A-R gives the absolute magnitude (representing parameter importance) by which a PV instance changes AD prediction by each model from its mean value presented in the vertical axis. In LR, an absolute change from the mean value of pH, BOD, TSS, DO, SAL, and TEMP corresponded to an absolute change of 0.143, 0.108, 0.069, 0.0045, 0.04, and 0.004 units in the LR’s AD prediction response/value. Also, an absolute response flux of 0.135, 0.116, 0.069, 0.057, 0.043, and 0.0001 in AD prediction value was attributed to pH, BOD, TSS, DO. SAL, and TEMP changes, respectively, by LRSS. Similarly, absolute change in DO, BOD, TEMP, TSS, pH, and SAL would achieve 0.155, 0.061. 0.099, 0.144, and 0.297 AD prediction response changes by KNN. In addition, the most contributed or important PV whose change largely influenced AD prediction response was TEMP (decreases or decreases the responses up to 0.218) in RF. Summarily, AD prediction response changes were highest and most significantly influenced by BOD (0.209), pH (0.332), TSS (0.265), TEMP (0.6), TSS (0.233), SAL (0.198), BOD (0.127), BOD (0.11), DO (0.028), pH (0.114), pH (0.14), SAL(0.91), and pH (0.427) in XGB, BTR, NNT, DTR, SVR, M5P, ENR, ANET33, ANNET64, ANNET6, ELM, MARS, and Cubist, respectively.

Figure 2
figure 2

PV-specific contribution to eighteen ML models forecasting capability of AD in MHWE receiving waterbodies. The average baseline value of PV in the ML is presented on the y-axis. The green/red bars represent the absolute value of each PV contribution in predicting AD.

Table 4 presents the eighteen regression algorithms’ performance predicting AD given the waterbodies PVs. In terms of MSE, RMSE, and R2, XGB (MSE = 0.0059, RMSE = 0.0770; R2 = 0.9912) and Cubist (MSE = 0.0117, RMSE = 0.1081, R2 = 0.9827) ranked first and second respectively, to outmatched other models in predicting AD. While MSE and RMSE metrics ranked ANET6 (MSE = 0.0172, RMSE = 0.1310), ANRT42 (MSE = 0.0220, RMSE = 0.1483), ANET33 (MSE = 0.0253, RMSE = 0.1590), M5P (MSE = 0.0275, RMSE = 0.1657), and RF (MSE = 0.0282, RMSE = 0.1679) in the 3, 4, 5, 6, and 7 position among the MLs in predicting AD, M5P (R2 = 0.9589 and RF (R2 = 0.9584) recorded better performance in term of R-squared metric and ANET6 (MAD = 0.0856) and M5P (MAD = 0.0863) in term of MAD metric among the 5 models. But Cubist (MAD = 0.0437) XGB (MAD = 0.0440) in term of MAD metric.

The feature importance of each PV over permutational resampling on the predictive capability of the ML models in predicting AD in the waterbodies is presented in Table 3 and Fig. S1. The identified important variables ranked differently from one model to another, with temperature ranking in the first position by 10/18 of the models. In the 10 algorithms/models, the temperature was responsible for the highest mean RMSE dropout loss, with temperature in RF, XGB, Cubist, BRT, and NNT accounting for 0.4222 (45.90%), 0.4588 (43.00%), 0.5294 (50.82%), 0.3044 (44.87%), and 0.2424 (68.77%) respectively, while 0.1143 (82.31%),0.1384 (83.30%), 0.1059 (57.00%), 0.4656 (50.58%), and 0.2682 (57.58%) RMSE dropout loss was attributed to temperature in ANET42, ANET10, ELM, M5P, and DTR respectively. Temperature also ranked second in 2/18 models, including ANET33 (0.0559, 45.86%) and GBM (0.0793, 21.84%). BOD was another important variable in forecasting AD in the waterbodies and ranked first in 3/18 and second in 8/18 models. While BOD ranked as the first important variable in AD prediction in MARS (0.9343, 182.96%), LR (0.0584, 27.42%), and GBM (0.0812, 22.35%), it ranked second in KNN (0.2660, 42.69%), XGB (0.4119, 38.60); BRT (0.2206, 32.51%), ELM (0.0430, 23.17%), SVR (0.1869, 35.77%), DTR (0.1636, 35.13%), ENR (0.0469, 21.84%) and LRSS (0.0669, 31.65%). SAL rank first in 2/18 (KNN: 0.2799; ANET33: 0.0633) and second in 3/18 (Cubist: 0.3795; ANET42: 0.0946; ANET10: 0.1359) of the models. DO ranked first in 2/18 (ENR [0.0562; 26.19%] and LRSS [0.0899; 42.51%]) and second in 3/18 (RF [0.3240, 35.23%], M5P [0.3704, 40.23%], LR [0.0584, 27.41%]) of the models.

Table 3 Feature importance of PVs over 100 permutational resampling on AD prediction.

Figure 3 shows the residual diagnostics plots of the models comparing actual AD and forecasted AD values by the models. The observed results showed that actual AD and predicted AD value in the case of LR (A), LRSS (B), KNN (C), BRT 9F), GBM (G), NNT (H), DTR (I), SVR (J), ENR (L), ANET33 (M), ANER64 (N), ANET6 (O), ELM (P) and MARS (Q) skewed, and the smoothed trend did not overlap. However, actual AD and predicted AD values experienced more alignment and an approximately overlapped smoothed trend was seen in RF (D), XGB (E), M5P (K), and Cubist (R). Among the models, RF (D) and M5P (K) both overestimated and underestimated predicted AD at lower and higher values, respectively. Whereas XGB and Cubist both overestimated AD value at lower value with XGB closer to the smoothed trend that Cubist. Generally, a smoothed trend overlapping the gradient line is desirable as it shows that a model fits all values accurately/precisely.

Figure 3
figure 3

Comparison between actual and predicted AD by the eighteen ML models.

The comparison of the partial-dependence profiles of PVs on AD prediction by the 18 modes using a unitary model by PVs presentation for clarity is shown in Figs. S2–S7. The partial-dependence profiles existed in i. a form where an average increase in AD prediction accompanied a PV increase (upwards trend), (ii) inverse trend, where an increase in a PV resulted in a decline AD prediction, (iii) horizontal trend, where increase/decrease in a PV yielded no effects on AD prediction, and (iv) a mixed trend, where the shape switch between 2 or more of i–iii. The models' response varied with a change in any of the PV, especially changes beyond the breakpoints that could decrease or increase AD prediction response.

The partial-dependence profile (PDP) of DO for models has a downtrend either from the start or after a breakpoint(s) of nature ii and iv, except for ELM which had an upward trend (i, Fig. S2). TEMP PDP had an upward trend (i and iv) and, in most cases filled with one or more breakpoints but had a horizontal trend in LRSS (Fig. S3). SAL had a PDP of a typical downward trend (ii and iv) across all the models (Fig. S4). While pH displayed a typical downtrend PDP in LR, LRSS, NNT, ENR, ANN6, a downtrend filled with different breakpoint(s) was seen in RF, M5P, and SVR; other models showed a typical upward trend (i and iv) filled with breakpoint(s) (Fig. S5). The PDP of TSS showed an upward trend that returned to a plateau (DTR, ANN33, M5P, GBM, RF, XFB, BRT), after a final breakpoint or a declining trend (ANNT6, SVR; Fig. S6). The BOD PDP generally had an upward trend filled with breakpoint(s) in most models (Fig. S7).

Discussion

The present investigation studied the invaluableness of MLs in determining AD in waterbodies to shorten the turnaround time involved in routine determination of the emerging pathogen with significant public health priority and high case-fatality ratio. Jiang et al. previously demonstrated that ML models predicted and offered cost-effective risk assessment options for Vibrio spp. relative abundances on microplastics in the estuarine milieu based on easy-to-measure environmental variables30.

Characteristics of the waterbodies

The pH of the waterbodies (5.05–9.11) did not satisfied South African water guidelines for irrigation purposes and recreational use of a pH range of 6.5–8.4 and 6.5–8.5, respectively36 but the average pH (7.76 ± 0.02) of the waterbodies met the FAO criteria37. In relation to the pathogen, Acinetobacter spp. are known to possess and survive under a wide pH (5–10) and temperature (− 20 to 44 °C) range with an optimal long-term survival temperature of 4–22 °C no matter nutrient availability38.

The observed EC (47.00–561.00 µS/cm) of the waterbodies generally satisfied the WHO guidelines for 2500 μS/cm in surface waters39, and the mean (218.66 ± 4.76 μS/cm) was in accepted limits of 400 µS/cm and 700 to 3000 µS/cm WHO and FAO standard for irrigation water37. The EC of the waterbodies also fell in the categories of Class I (excellent: ≤ 250 µS/cm) and Class II (good: 250–750 µS/cm) irrigation water EC limits classification40. The EC concentrations of the waterbodies will generally impact fishing negatively, as an EC range of 0.15–0.50 μS/cm are necessary to support fisheries according to the USEPA (United States Environmental Protection Agency)41.

TDS summed up organic and inorganic substances in the waterbodies but generally did not exceed the WHO’s maximum permissible limit of 1000 mg/L TDS in drinking water39. The TDS (23.00–279.00 mg/L) of the waterbodies followed the World Health Organization standard of a TDS < 300 mg/L (excellent) and its average (110.53 ± 2.36 mg/L) does not exceed the USEPA and WHO limit for drinking water (500 mg/L)41,42.

However, the TBS average values of the waterbodies exceeded the WHO guideline of 5 NTU39. Higher EC, TDS, and TBS in surface waters are generally attributed to wastewater and anthropogenic activities inputs43. Also, high levels of EC, TDS and TBS are known to impair visibility, cleanliness, safety, aesthetics, and recreational use of river waters44. The mean TSS (80.17 ± 5.09 mg/L) of the waterbodies exceeded the WHO (2006) wastewater discharge limit of 60 mg/L and exceeded the Australia and New Zealand (2000) guideline limits (TSS < 0.03 mg/L) of water quality for aquaculture45,46. In addition, the average BOD level (4.00 ± 0.10 mg/L) of the waterbodies complied with the tolerance limit of 5 mg/L in surface waters for aquatic life47. Higher level of BOD in waterbodies depletes DO available for aquatic organisms48 and generally have negative impacts on fishing and fish harvest.

The average AD (3.19 ± 0.03 log CFU/100 mL) obtained in this study is comparable to AD reported from waterbody impacted by hospital wastewater, WWTP, informal settlements, and veterinary clinics effluents along Umhlangane River course in Durban South Africa49. The observed DO (8.82 ± 0.04 mg/L) and BOD (4.00 ± 0.10 mg/L) both suggested the facultative aerophilic characteristics of Acinetobacter and a relatively high nutrient composition of the rivers’ probable from wastewater effluents. The average EC in the waterbodies was 218.66 ± 4.76 µS/cm. This shows high level of organic carbon (DOC) in the rivers. EC is an indirect indicator of DOC25,50,51 and found to have associations with Acinetobacter-specific ARG and other ARG abundance25,52,53. Generally, A. baumannii in the environment can survive irrespective of the level of DO54.

The finding from this study revealed that AD negligible—positive but very weak—correlated with pH (r = 0.03), and SAL (r = 0.06) and—negatively—with TDS (r = − 0.05) and EC (r = − 0.04) (Table 2). These results can be attributed to the ability of the Acinetobacter to survive under a wide range of harsh environmental conditions. A significantly positive correlation between AD and BOD (r = 0.26), TSS (r = 0.26), and TBS (r = 0.26) indicated a considerable increase AD with an increase in nutrient and DOC pollution in aquatic environments (Fig. S7). Also, findings showed a moderate positive correlation between TEMP and AD (r = 0.43), suggesting that AD improves in abundance with an increase in temperature38 to specific breakpoints. AD moderately and inversely correlated with DO (r = − 0.46), indicating that Acinetobacter abundance increases with an anaerobic condition or low oxygen level.

Model predicted AD and explanatory contribution of PVs

The predicted AD average and range values by the 18 ML models differed. The present study's findings suggested that both lower/upper bound and the general trend characteristic of the prediction is far more important than the average prediction only. Most algorithms had higher average predictions but overestimated or underestimated AD values at lower and upper bounds, respectively. Thus, algorithms other than XGB and Cubist are not suitable for predicting AD in waterbodies. Whereas the performance of most ML algorithms, such as RF, DTR, and MARS43,55, has been praised in terms of average predictions and regression metrics, most studies neglect consideration of the lower/upper bound and the general trend characteristic of their predictions—which are far significant when dealing with infectious organisms/poison that might have low infectivity dose/potent at a very low concentration. Several researchers also reported the superiority of XGB against several ML algorithms in predictive performance in terms of average prediction, and sensitivity43,55. Although a previous study showed that RF models achieved higher level of accuracy than XGB, SVR, and ENR in predicting the Vibrio spp. relative abundance on microplastics, the actual trend characteristics including the lower/upper bounds were not reported30. The difference in the models’ trend coverage and boundary characteristics in AD predictions are attributable to the capability of the models to capture the complex interactions of co-occurrence levels/changes in different environmental variables at different degrees or concentrations. The performance of Cubist [3.1736 (1.1012–4.5300)] was also found to be comparable to XGB [3.1792 (1.1040–4.5828)] in term of trend and boundaries characteristics as both models outshined other models. A typical problem with most algorithms observed in this study was over-estimation and underestimation of AD at lower and higher concentrations, respectively. These limitations suggested that the models could raise false alarm of high risk at lower AD as well as undermine higher risk at higher concentrations of AD. An indication that those models could not capture the nonlinear complex relationships between AD, PVs, and underlying anthropogenic inputs.

Nevertheless, the absolute contributions of individual PV change to models’ prediction of AD from their models attributed mean values varied (Fig. 2). The behaviours could be interpreted in term of the complex interactions among the PVs coupled with the prevailing anthropogenic fluxes in the waterbodies. Several PVs undergo fluctuations co-concurrently unlike behaviours in models in which other PVs are held constant to assess a particular PV’s effects on the outcome variable (AD). These interactions are capture to some great degrees by the algorithms leading to differences in the ranking of PVs contributions to AD predictions by the algorithms. Also, intrinsic characteristics of the distinct algorithms and data noise are major causes of differences in observed contributions of variables in ML models30.

Considering the overall performance of 18 AI-based models assayed in this study using four metrics, XGB (MSE = 0.0059, RMSE = 0.0770; R2 = 0.9912; MAD = 0.0440) and Cubist (MSE = 0.0117, RMSE = 0.1081, R2 = 0.9827; MAD = 0.0437) were the best models ranking in first and second position respectively, to outshined others in AD prediction in waterbodies (Table 4). XGB has reputation of been the best performer ML algorithms in most microbiological regression studies compared with others30. Cubist has been demonstrated to outperformed partial least squares, RF, and MARS in predicting soil property including soil total nitrogen, organic carbon, total sulphur, exchangeable calcium clay; sand, and cation exchange capacity, and pH and RF, classification, and regression trees, SVM, and KNN predicting NH4–N and COD in subsurface constructed wetlands effluents56,57. In forecasting daily dissemination of COVID-19 vaccination, Cubist outperformed ENR, Gaussian Process, Slab (SPIKES), and Spikes ML algorithms58. Also, Cubist has been shown to outmatched XGB in predicting left ventricular pressures, volumes, and stresses59. An ensemble of XGB and Cubist could be further exploited for a better performance in forecasting AD in waterbodies. However, ANN (R2 = 0.953) was demonstrated to show a superior predictive coefficient over Cubist model (R2 = 0.946) and LR (R2 = 0.481) when assaying faecal coliform content in treated wastewater for reuse purposes60. Generally, while XGB involved ensemble of trees that capture multidimensional interactions/relationships, Cubist combined the strengths of both linear regression equations and a committee tree-based structural nodes for capture effectively linear and nonlinear multidimensional relationships among variables and outcome event56. The results show that ANET6, ANRT42, ANET33, M5P, and RF had MSE and RMSE that placed them in the 3, 4, 5, 6, and 7 position among the MLs in predicting AD, their performances are to be avoided for practical forecast of AD for preventive purposes.

Table 4 Predictive performance of eighteen regression algorithms in predicting AD in the waterbodies.

Feature importance of PVs in predicting AD

TEMP was the most important PV in predicting AD in the waterbodies and ranked by 10/18 ML-algorithms including RF, XGB, Cubist, BRT, and NNT accounting 45.90%, 43.00%, 50.82%, 44.87%, and 68.77% in respective models, as well as 82.31%, 83.30%, 57.00%, 50.58%, and 57.58% RMSE dropout loss in ANET42, ANET10, ELM, M5P, and DTR respectively. The observed results can be explained in term of the direct and indirect influence TEMP had on other PVs and AD in the waterbodies. DO decreases with increase temperature, favoured facultative aerobic lifestyle of Acinetobacter. Also, temperature increase decomposition of organic matters in waterbodies, thereby leading to high BOD contents providing more nutrients for AD and other microbial lives. Resultant increase in DOC in waterbodies is an indirect indicator of EC25,50,51 and found to have associations with Acinetobacter-specific ARG abundance in waterbodies25,52,53. BOD was another significant feature identified in forecasting AD in the waterbodies and ranked first in 3/18 [MARS (182.96%), LR (27.42%), and GBM (22.35%)] and second in 8/18 models [KNN (42.69%), XGB (38.60%); BRT (32.51%), ELM (23.17%), SVR (35.77%), DTR (35.13%), ENR (21.84%) and LRSS (31.65%)]. BOD is a measure of nutrient pollution from anthropogenic inputs such as wastewater effluents, agricultural activities, and environmental events such as rainwater runoffs among others. BOD also influence EC, TDS, and TBS in surface waters43 Whereas SAL was identified as first important feature in in 2/18 (KNN, ANET33) and second in 3/18 (Cubist, ANET42, ANET6) models, Acinetobacter can only survive relatively high SAL without improving its population density (Fig. S4). Unlike Vibrio spp, whose high density are linked with high salinity30 as it promotes genes expression and functional proteins61 and eventual vibrio growth and reproduction62, high SAL are not suitable for AD as its inhibitory for growth related gene expression.

The sensitivity analyses of the 18 ML predictive models of AD using the residual diagnostics plots found that LR (A), LRSS (B), KNN (C), BRT (F), GBM (G), NNT (H), DTR (I), SVR (J), ENR (L), ANET33 (M), ANER64 (N), ANET6 (O), ELM (P) and MARS (Q) did not fit the data optimally. This imply that the models are not suitable for forecasting AD in waterbodies. Meanwhile models such as RF (D), XGB (E), M5P (K), and Cubist (R) fitted the data with more alignment and approximately overlapped smoothed trend between the actual and the predicted AD values, RF (D) and M5P (K) over-predicted and under-predicted AD at lower and higher extremities, respectively. Thus, could be interpreted as forecasting exaggerated risk (AD) at probable innocuous level while weakening true risk at higher extremity. Such models are not suitable to assess real life events of AD in waterbodies. Although both XGB and Cubist predicted AD value slightly higher than the actual value at lower extremities, XGB had a closer fit smoothed trend than Cubist. Compared to other models assayed in this study, the duo is the best and could be applied for AD AI-smart system design for water quality monitoring. A stacked model of XGB and Cubist may outmatch and overcome the limitation the two models had at the lower extremity of AD value.

The overall summary of the PDPs of the PVs on AD prediction by the 18 modes (Figs. S2S7), found that any degree of change/flux in a particular PV especially changes beyond its breakpoints attracted a corresponding varied response in AD which could decrease or increase AD prediction response. The various forms of partial-dependence profiles as explained in previous section also showed the direct/indirect/complex interactions between a PV and AD coupled with the sensitivity of a model in mapping the relationships. Summarily, the increase in AD level (PDP) in most models equivalent to a decline trend in DO and SAL especially after its breakpoint(s) excluding ELM where DO had upward trend (i; Figs. S2 and S4). These patterns revealed a nonlinear relationship between AD and the PVs. A near increase-by-increase relationship exist between TEMP and AD in most models coupled with one or more breakpoints. LRSS revealed a zero-relationship between AD and TEMP indicating its inability to map the relationship between them. Although Acinetobacter has been showed to have a broad pH range, a typical downtrend PDP of pH by LR, LRSS, NNT, ENR, ANN6—filled with breakpoint(s) in RF, M5P, and SVR while other models showed a typical upward—is informative of the weakness of the models as increasing in pH from 5.02 to 10 promotes Acinetobacter growth38. AD prediction responses aligned with a general increase in BOD regardless of breakpoint(s) in most models revealed important of nutrients for Acinetobacter population density in waterbodies.

Furthermore, the strengths of this current study aside been the first that assessed AD in waterbodies receiving hospital and municipal wastewater effluents along their courses, two ML algorithms optimally and accurately predict AD, proven to be promising candidates for developing SAIS for AD determination and thereby shorten the turnaround time and reduce labour involved in experimental approaches. Also, the MLs were able to capture nonlinear complex multidimensional interactions between AD and PVs as well as their inherent anthropogenic fuels which conventional mathematical models could not robustly mapped63. In addition, the MLs are amenable to improvements and can be utilized across several water management landscape. However, the shortcoming of the present study lies in the lack of spatiotemporal covariates that could improve upon the ML models’ predictions as stochastic distributions of waterborne pathogens are governed by both spatial extension and temporal duration across depth in water columns. Future studies should seek data from a wide range of socioeconomic activities/areas as well as include spatiotemporal and geospatial inputs in developing AI-based predictive framework for AD determination.

Conclusion

The present study has proven SAIS as an evidence-based strategy to shorten the turnaround time involved in assessing AD in waterbodies; thereby minimizing exposure. The best models (XGB/Cubist) identified in this study could be developed into standalone SAIS (XGB/Cubist, XGB-Cubist ensemble, or web app) or integrated into existing instrumentations for PV estimation in waterbodies to enhance timely decision-making of microbiological qualities of waterbodies for irrigation and other purposes. The study also unveiled temperature and BOD as significant candidates for predicting AD in waterbodies in most models. Finally, AD in waterbodies could accurately and reliably predicted via AI-based smart systems that rely on waterbody physicochemical variables’ dynamics in a low-cost and time-effective manner.