Streamflow classification by employing various machine learning models for peninsular Malaysia

AlDahoul, Nouar; Momo, Mhd Adel; Chong, K. L.; Ahmed, Ali Najah; Huang, Yuk Feng; Sherif, Mohsen; El-Shafie, Ahmed

doi:10.1038/s41598-023-41735-9

Download PDF

Article
Open access
Published: 04 September 2023

Streamflow classification by employing various machine learning models for peninsular Malaysia

Nouar AlDahoul¹,
Mhd Adel Momo²,
K. L. Chong³,
Ali Najah Ahmed^4,5,
Yuk Feng Huang⁶,
Mohsen Sherif^7,8 &
…
Ahmed El-Shafie⁹

Scientific Reports volume 13, Article number: 14574 (2023) Cite this article

1218 Accesses
4 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Due to excessive streamflow (SF), Peninsular Malaysia has historically experienced floods and droughts. Forecasting streamflow to mitigate municipal and environmental damage is therefore crucial. Streamflow prediction has been extensively demonstrated in the literature to estimate the continuous values of streamflow level. Prediction of continuous values of streamflow is not necessary in several applications and at the same time it is very challenging task because of uncertainty. A streamflow category prediction is more advantageous for addressing the uncertainty in numerical point forecasting, considering that its predictions are linked to a propensity to belong to the pre-defined classes. Here, we formulate streamflow prediction as a time series classification with discrete ranges of values, each representing a class to classify streamflow into five or ten, respectively, using machine learning approaches in various rivers in Malaysia. The findings reveal that several models, specifically LSTM, outperform others in predicting the following n-time steps of streamflow because LSTM is able to learn the mapping between streamflow time series of 2 or 3 days ahead more than support vector machine (SVM) and gradient boosting (GB). LSTM produces higher F1 score in various rivers (by 5% in Johor, 2% in Kelantan and Melaka and Selangor, 4% in Perlis) in 2 days ahead scenario. Furthermore, the ensemble stacking of the SVM and GB achieves high performance in terms of F1 score and quadratic weighted kappa. Ensemble stacking gives 3% higher F1 score in Perak river compared to SVM and gradient boosting.

Predicting streamflow in Peninsular Malaysia using support vector machine and deep learning algorithms

Article Open access 10 March 2022

Predicting suspended sediment load in Peninsular Malaysia using support vector machine and deep learning algorithms

Article Open access 07 January 2022

Short-term forecasts of streamflow in the UK based on a novel hybrid artificial intelligence algorithm

Article Open access 29 April 2023

Introduction

More people are being harmed by floods worldwide. According to UNISDR¹, flooding is the primary cause of natural disasters globally, accounting for 90% of all catastrophes. Rivers in Malaysia exhibit significant seasonality, with most peak flows concentrating on torrential rains from the monsoon in the north-east and south-west due to their closeness to the equator^2,3. Consequently, consistent rain causes rivers to overflow their banks, which causes a considerable volume of streamflow to pass through⁴. Malaysia has had several large floods throughout its history, most notably the worst floods ever in 2006 and 2007, which resulted in significant losses for the government and total economic devastation⁵. The recent rapid population growth within the river's basin diminishes river capacity and accelerates streamflow, increasing flood amplitude and duration⁶. These factors, along with climate changes, has further substantial the frequent occurrence of flood in Malaysia⁷. A simple and low-cost tool for monitoring flood occurrence is streamflow time series monitoring, an effective indicator of trends and changes in the hydro-climatic system^8,9.

In a generic machine learning context, time series analysis may theoretically be viewed as either a classification or a regression situation. Machine learning streamflow regression has been the most often studied topic in streamflow predicting research^10,11. Hydrologists often distinguish this form of prediction as numerical forecasting in streamflow regression tasks, where they generate a single-point estimate of its expected value. Early in the year, time series forecasting included models like ARIMA and ARIMAX. However, there is substantial evidence that models based on linearity assumption do not provide good forecasts in streamflow forecasting¹². These models make predictions based on the dataset's correlation through autocorrelation and partial autocorrelation functions. Recognizing that the linear assumption is inadequate for complicated time series forecasting, researchers proposed an artificial neural network (ANN), which functions as a universal approximation function¹³. Other often used machine learning algorithms include random forest (RF)^14,15 and gradient boosting (GB)^16,17. And when uncertainty is factored in, the predicting process may be quantified using probability forecasting, another form of regression¹⁸. In practice, the over-fitting problem encountered makes it difficult for machine learning to forecast the continuous value with 100% accuracy¹⁹. A model that does well on both the training and testing datasets is often favorable in machine learning. In essence, the model gathers enough knowledge about the dataset from the inputs to make a generalized judgment²⁰.

Contrastingly, a classification task focuses on classifying the prediction into one of the many predetermined categories²¹. The easiest way to categorize streamflow is as a binary task, where streamflow may either be increased or decreased. The theoretical complexity of the multi-class classification problem is greater than that of the binary task, as streamflow is divided into more than two class labels, necessitating additional decision-making^19,22. The fact that streamflow classification considers more than simply whether or not the streamflow will change today should be stressed. The predicted streamflow classifications are linked to the likelihood of belonging to each class. However, transitioning a time series regression to a classification need careful planning since categorization entails a forced-choice presumptive decision with discrete, rather than stochastic, outcomes²³. There are situations in the real world where something is not definite, such as "It will rain today," and categorization them is not the best course of action. Though—a streamflow classification can be beneficial, especially in reservoir operations, where it is sometimes necessary to discretize the storage stage in order to derive the operational rule for optimizing the reservoir system²⁴. Recently, an illustration of streamflow classification may be seen in the study by Chong, Huang²⁵, where they examined two distinct streamflow machine learning formulations. They discovered that scenario-based streamflow forecasts outperform point forecasts in terms of accuracy. However, they also noted that in the absence of other predictors or data-preprocessing techniques, their findings could be biased in favour of univariate streamflow. Given the constraints imposed by numerical point forecasting, classifying streamflow outputs would necessitate a more thorough analysis and potentially a better decision to develop streamflow forecasting.

Another crucial consideration is the choice of a hydrological model. The advent of machine learning may allow a data-driven model to function better compared to a process-driven model but at the price of the physical interpretation of hydrological processes²⁶. The current transition to data-driven modeling may be due to the difficulty in fully comprehending the interactions that underlie the hydrological processes, which limits the efficacy of a process-driven model²⁷. Despite the reformulation from regression to classification, we hypothesize that the streamflow time series still retain their temporally ordered structure, characterizing them from other TSCs that do not make any assumptions regarding temporal dependency. Typical classification algorithms are not well adapted to such a task since they do not incorporate the time component²⁸. Developing an effective AI model to carry out this classification process is therefore necessary. Deep learning technologies, such as long short-term memory (LSTM), give additional feature extraction capabilities that might be used to supplement classic classifier algorithms' lack of time-dependent components. It may collect time series and memorize long-term associations using the memory storage capabilities of LSTM by applying many gates that regulate the information flow. Such qualities may be seen in a variety of applications where sequential information flow is crucial, including robotic control²⁹, handwriting recognition³⁰, and even time series prediction³¹.

The format of this paper is as follows: Section "Previous works" introduces the previous works related to this study; In Section "The significance of study", the significance of the study is discussed. Section "Materials and methods" describes the dataset used and demonstrates the machine learning and deep learning algorithms used. Section "Results and discussion" presents the results and discussion; Section “Conclusion and future work” summarises the conclusions and recommendations for future research.

Previous works

Probabilistic methods

In case water demand, allocation, and flooding event prediction, several studies have considered probabilistic methods to predict the chance of flood. Monte Carlo techniques have been utilized to estimate the probability of a region being impacted by a cyclone any year³². Monte Carlo method was found to be easy to implement and can continuously be improved with more data collected over years.

To respond to emergency cases and sudden rainstorms and flooding, integration of decision makers' emotions, dynamic Bayesian network and Dempster–Shafer (DS) evidence theory was proposed³³. Bayesian network worked effectively to simulate the dynamic change process. Additionally, the DS evidence theory can reduce the subjectivity of the model in dealing with the uncertainty of the evolution process. Another study was demonstrated to help on “scenario-response" paradigm. The target heavy rain event was studied to examine the intricate evolution of emergency response utilizing a constructed scenario Bayesian network³⁴. This network was built by fusing the knowledge meta-theory, scenario evolution and Dempster's rule.

To assess the risk and zone the flood disaster, another study was conducted³⁵ to highlight the high-risk areas clarifying the reasons behind the potential hazards. The authors analysed the disaster system theory and established the flood disaster evaluation index system for urban agglomerations.

Machine Learning methods

Artificial neural networks (ANNs) have been used as a useful soft computing tool to predict future water availability from a catchment in real-world scenario³⁶. The utilization of ANN was proposed due to the absence of intensive data, which are required for modelling practices in the context of hydrology. Levenberg–Marquardt ANN was able to give good prediction performance³⁷.

Another study compared stacked model that combines random forest and multilayer perceptron through elastic net with bidirectional long short-term memory networks for multiple steps ahead streamflow prediction³⁸. It was found that the stacked model outperformed the model based on bidirectional LSTM in many cases in predicting the highest flow rate but it was less accurate in predicting low flow rate. The prediction accuracy of both models decreased by increasing the length of the time series. The stacked model has shorter computation times than the bidirectional LSTM.

The evaluation and comparison between various deep learning models including convolutional neural networks (CNN), long short-term memory (LSTM), and self-attention (SA)-LSTM models, with simple extreme learning machine (ELM) model was demonstrated for monthly streamflow prediction³⁹. The experiments targeted to predict an unprecedented hydrologic event such as no-flow events and extreme floods. SA–LSTM model was proved to be an effective streamflow prediction model for extreme events.

Explainable AI with long short-term memory (LSTM) has been explored in the literature to predict the streamflow⁴⁰. In their study, the authors utilized the model's explainability using Shapley additive explanations method (SHAP). It was discovered that LSTM model's explainability in predicting the streamflow was enhanced by the SHAP method.

The significance of study

Forecasting streamflow lowers the risk of flooding and reservoirs while enhancing the management and planning of water resources. Due to its ability to detect the non-linarites and short- or long-term temporal interrelationship, statistical and machine learning techniques have been applied for streamflow forecasting challenges. However, the machine learning models with multivariate streamflow forecasting may be affected by over-fitting problem and inability to predict exact values of streamflow. To address the aforementioned issue, streamflow categorization approach has been proposed in this study to extract patterns from streamflow data and map these features to specific categories.

Due to the highly non-linear pattern, stochastic nature, and the extremely wide range of the streamflow in the selected rivers as shown in Tables 1 and 2, the water resources management strategy concluded to categorize the streamflow into different classes for each time increment and consider the streamflow class is operational constraints and the major component of the water management policy.

Table 1 Total duration for each river from eleven rivers.

Subjects

Abstract

Similar content being viewed by others

Predicting streamflow in Peninsular Malaysia using support vector machine and deep learning algorithms

Predicting suspended sediment load in Peninsular Malaysia using support vector machine and deep learning algorithms

Short-term forecasts of streamflow in the UK based on a novel hybrid artificial intelligence algorithm

Introduction

Previous works

Probabilistic methods

Machine Learning methods

The significance of study

Materials and methods

Data description

Data partitioning

Feature scaling

Category label annotation

Data balanced method

Equal range method

The proposed classification models

Support vector machine (SVM)

Gradient boosting (GB) classifier

Stacked ensemble

Long short-term memory (LSTM)

Performance metrics

Experimental setup

Results and discussion

Support vector machine

Gradient boosting

Stacked ensemble

Long short-term memory

Classification of few days ahead

Comparison between models for streamflow classification

Conclusion and future work

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

An evaluation of statistical and deep learning-based correction of monthly precipitation over the Yangtze River basin in China based on CMIP6 GCMs

Predicting daily wind speed using coupled multi-layer perceptron model with water strider optimization algorithm based on fuzzy reasoning and Gamma test

Comments

Search

Quick links