Background & Summary

Operational risk, following credit risk and market risk, has been the third type of risk covered in the first pillar of the Basel Accord II1,2. It is defined as the risk of loss resulting from inadequate or failed internal processes, people, and systems, or from external events3. Over the past decades, scores of operational risk events have brought severe losses to financial institutions, leading to serious financial market chaos including the bankruptcy of financial institutions4,5,6. One of the notable examples is the debacle of Barings Bank. This United Kingdom’s oldest bank filed for asset liquidation following a 1.3 billion USD loss from rogue trading and seized international headlines in 1995, which brought great attention from banks and other financial institutions to operational risk7. To resist the significant losses derived from risk events, and to formulate effective risk management strategies, regulatory authorities around the world require financial institutions to measure operational risks and allocate corresponding capital. Although they have made great efforts to collect operational risk data, the operational risk databases in the global scope are still insufficient, which is one of the main obstacles to the advancement of operational risk research8,9,10.

Nowadays, public data sources, among various operational risk data sources, are of particular concern for financial institutions, commercial corporations, and academia. When operational risk events occur in financial institutions, they are usually disclosed through news reports, regulatory announcements, and other public sources. By tracking these data sources and extracting key information such as time, money, and location from publications, external loss databases can be constructed, which is one of the essential data foundations for operational risk management and measurement11,12. It can facilitate the gain of sustainable knowledge and experience from historical events, enabling financial institutions to set aside sufficient capital to safeguard against risks while maximizing the money allocated to their operations. Nevertheless, collecting operational risk data from public media, which are mostly unstructured texts, is laborious work for financial institutions, commercial vendors, and researchers. The current manual collection method is time-consuming and labor-intensive. It largely relies on expert experience, leading to issues like untimely updates and inconsistent records13. For example, IBM Algorithmics forms a team of multilingual researchers who gather public data to update their IBM Risk Case Studies database daily14.

In this paper, by expanding and integrating various text mining methods, we construct a reliable external operational risk database for the Chinese banking industry, named the Public-Chinese Operational Risk Loss Database(P-COLD). To our knowledge, the constructed P-COLD database is one of the largest external databases of cross-institution operational risk in Chinese academia. It is also the first publicly available operational risk external database on a global scale with the longest and most up-to-date time span, the most comprehensive fields, and the largest volume of data15,16. An example of the database construction procedure from a public data source is illustrated in Fig. 1. Due to space limitations, the news text has been simplified, while the real-world case is much more complicated. It downloads news texts from Chinese major financial websites and selects news related to operational risks. Then the key fields of operational risk such as the exposure time and causal factor are extracted from the news texts quickly and composed database record. In the end, an operational risk external database with 12 fields is constructed, which contains 3,723 cross-institution operational risk loss data from 1986 to 2023.

Fig. 1
figure 1

An example of the operational risk data extraction procedure from online news.

This is a challenging task mainly for two reasons: First, there are massive news texts on the internet, and it is difficult to judge which news is related to operational risk. The news, for the most part, describes an event or a fact instead of pointing out operational risk events bluntly. For example, when a bank account manager raises funds for fraud, the news generally reports on the process of fraud instead of directly indicating ‘operational risk’. Besides, a large set of different topics contained in the financial news, such as stocks, securities, and macroeconomics, mounts the difficulty of operational risk events identification17,18. At present, both in academia and industry, this process relies on experts to manually identify operational risk events from diverse information based on their professional experience. There is an urgent need for a more efficient method to identify operational risk events automatically, liberating them from this repetitive work19.

The second difficulty lies in how to extract key fields embedded in the unstructured texts accurately. The operational risk news texts are primitive news texts that are difficult to support risk modeling, quantification, and management. Although some fields of operational risk are entities implicit in the text such as the amount and time which can be extracted by the named entity recognition (NER) method, further judgment and cleaning are indispensable in view of the fact that the same type of entity often appears more than once in the text. For example, the exposure time and end time both belong to temporal entities but need to be identified from all times that appear in the text according to the context. Moreover, some key fields crucial for risk measurement and management, such as causal factors, business lines, and event types, cannot be directly extracted and must be interpreted based on the semantics of the text. They require more advanced and precise text analysis techniques to parse content.

Therefore, by developing a novel text mining framework, we implement construct operational risk databases from public data sources more quickly and effectively, relieving experts from the burden of tedious tasks. Within this, a new method called the Bidirectional Dictionary Method (BDM), is proposed to select operational risk events automatically based on mining features of operational risk news texts. Since the fields embedded in texts cannot be captured by a single text analysis technique, we also integrate text mining methods such as NER and semantic analysis based on different entity text characteristics of operational risk. The setting of key fields in the database can help describe the operational risk events more comprehensively and provide a reference for financial institutions to construct the operational risk database more standardized. The disclosure of this database aims to provide potent support for further statistical analysis and risk modeling, developing various scenarios for scenario analysis, driving the culture of internal risk management in the institution, and other operational risk management and measurement.

Methods

The overall framework for database construction is divided into three steps in Fig. 2. Firstly, the texts of text financial news are downloaded from news websites. Secondly, a novel method called the BDM method, is used to identify texts that may contain operational risk events. Two dictionaries are constructed to select operational risk events more accurately from two directions: The Operational Risk Dictionary (OpRDic) is designed for selecting the news that might be an operational risk. The Non-Operational Risk Dictionary (Non-OpRDic) dictionary is designed for reidentifying non-operational risk events from the previously selected operational risk data set. Thirdly, in the information extraction step, we aggregate multiple text mining methods to extract key fields with different levels of complexity, such as the NER method, heuristic method, and semantic analysis. Each field of operational risk has a specific method for automatic extraction. Finally, the operational risk loss database is output. Since online news websites are one of the most common data sources for collecting operational risk external loss data, we apply news in the following sections to illustrate the process of constructing an operational risk database from public data sources.

Fig. 2
figure 2

The overview of the framework for operational risk database construction.

Online news collection and cleaning

The three largest Chinese vertical financial news websites are chosen as online news sources (including https://www.hexun.com, https://jrj.com, and https://finance.china.com). The financial news information, including the news headline, news text, publish time, publish source, and web page URL, is recorded from these websites through web crawlers written in Python. The data source, time interval, and amount of news from different sources are listed in Table 1. The 523,049 raw financial news texts downloaded comprise nearly all the historical news that can be found since these websites were built.

Table 1 The amount of news in the initial news database.

The data cleaning process contains two steps. The first step is cleaning up the news that has incomplete text information. The second step entails the removal of news items with identical headlines, as different websites always repeatedly publish the same news with the same headlines. If multiple news texts have the same headline, they are considered to describe the same event, and the one with the longest text body is retained. Finally, we obtain 484,256 Chinese online financial news texts published from 2004 to 2023 and name them as the initial news database. The cleaned text information of each event is stored in a unified format as the online texts database, assigned a unique number.

Operational risk news selection

The bank sections of major financial news websites generally report multiple types of events that happen in the bank industry, such as investment and earnings announcements20, new products and services21, monetary policy22, and risk events23. One of the difficulties in selecting operational risk news is that the operational risk events are implicit in the facts stated in the text. The news headlines and texts of operational risk events do not actually contain the word ‘operational risk’, which requires further analysis rather than intuitive identification.

Through observing the characteristics of operational risk news, we propose the BDM method, constructing two dictionaries to select operational risk news from bidirectional directions respectively. As shown in Table 2 Panel a, it is noteworthy that the headlines of news containing operational risk events have certain characteristic words such as ‘Stolen’, ‘Lose’, ‘Punish’, ‘Overdraft’, and ‘Mistake’, conveying the presence of operational risk events. For example, when ‘Stolen’ appears in the news headline of the banking section, the text often describes incidents such as bank credit card stolen, Automated Teller Machine (ATM) theft, or money stolen by employees, all of which are typical operational risk events. Thus, the foremost step is to find the typical words of operational risk news, defined as the OpRDic dictionary.

Table 2 Examples of news headlines on a Chinese financial website.

It is worth noting that, some non-operational risk news will be mistaken as an operational risk if only relying on matching whether the headlines contain typical words. As shown in Table 2 Panel b, when the headline contains not only ‘Stolen’ but also ‘Prevent’, the news text mainly introduces how to prevent credit card stolen instead of describing the loss events. When although keywords describing operational risks appear in the headline, words indicating location or time range such as ‘National’, ‘The first half of the year’ and ‘The second quarter’ appear at the same time, the news texts mainly report statistical values rather than specific loss events. It is also necessary to construct a dictionary to clearly define headlines containing these words as non-operational risk events, i.e., Non-OpRDic dictionary.

To make more accurate judgments on the news, we employ a database that has been used for operational risk studies to support the analysis. Our research group has started to collect operational risk data manually from open data sources such as newspapers, websites, books, surveys, and reports since 2003. The constructed data set called the Chinese Operational Loss Database (COLD), is one of the most comprehensive Chinese external operational risk datasets and has been used in many studies that quantify the operational risk24,25. In this paper, the benchmark samples preselected from COLD include 1,299 Chinese external operational risk data entries with accurate and complete information. Each entry consists of the event headline, event text, exposure time, end time, the person involved, loss amount, bank involved, province occurred, city occurred, causal factor, event type, and business line.

Before constructing the OpRDic and Non-OpRDic dictionaries, we pre-process the headlines of online news and COLD database including unrelated text removal and word segmentation. The unrelated text such as punctuation and stop words are removed according to the custom stop words list. Word segmentation involves cutting Chinese sentences into words using Jieba, a widely used Chinese text segmentation system known for its excellent accuracy in semantic-based applications.

Beyond the high-frequency words, the Term Frequency-Inverse Document Frequency (TF-IDF) method is utilized to measure the importance of each word in the headlines. This method represents the weight of a word in a specific background corpus and its usefulness in the information retrieval process. By calculating the TF-IDF weight of pre-marked operational risk news headlines, the terms with high weight can be identified as keywords characterizing operational risk26,27. The pre-marked operational risk news headlines include 1,299 benchmark sample headlines and 500 operational risk news items randomly selected from the online texts database and manually verified.

Assuming that the amount of news in the banking section of a website is N, among which we select N1 pre-marked operational risk news manually. N banking news is used to calculate the IDF value of the word as the background corpus and N1 operational risk news is used to calculate the TF of each word in the operational risk news headlines. IDF value represents the importance of the term relative to the document and is inversely proportional to the frequency of the term in all documents. When calculating the IDF value, each news is regarded as a document, the total number of all documents in the corpus is N, dfi is the number of documents containing the word i. With respect to measuring TF value, N1 pre-marked operational risk news is integrated as one document to calculate the TF of each word. Assuming that ni is the number of times the word i appears in the operational risk news document, \({\sum }_{k}{n}_{k}\) is the sum of the number of times all words appear in the operational risk news document. We calculate the TF-IDF of every word in the operational risk news using the following Eq. (1).

$${W}_{i,j}=TF\times IDF=\frac{{n}_{i}}{{\sum }_{k}{n}_{k}}\times \,\log \left(\frac{N}{d{f}_{i}}\right)$$
(1)

Since N1 pre-marked operational risk news is contained in N news, it can ensure that the words in the N1 news headlines have their corresponding IDF values in the background corpus. The more important the word, the higher the TF-IDF value, reflecting specific narrative characteristics of operational risk news headlines in the financial context. The custom financial IDF corpus is obtained conforming to Eq. (1) by calculating the IDF value of each word in the 30,000 news headlines randomly selected from the raw news.

To improve the accuracy of the election results of the two dictionaries, we take advantage of the characteristics of operational risk headlines. Three classic evaluation indexes are used to scrutinize the quality of selection and adjust the keyword dictionary: accuracy, precision, and recall. The accuracy represents the proportion of correctly classified samples to the total samples in all samples. That is, the operational risk events are indeed correctly identified as operational risks, and non-operational risk events are indeed correctly identified as non-operational risks. The precision represents the proportion of correctly identified operational risk samples out of all samples classified as operational risks by models. The recall represents the proportion of correctly identified operational risk samples out of all samples that should have been identified as operational risk. More specific formulas are shown in Eqs. (24):

$$Accuracy=\frac{TP+TN}{TP+TN+FP+FN}$$
(2)
$${Precision}=\frac{TP}{TP+FP}$$
(3)
$${Recall}=\frac{TP}{TP+FN}$$
(4)

Where True Positive (TP) is the number of positive samples (i.e., operational risk) correctly classified to be positive, True Negative (TN) is the number of negative samples (i.e., non-operational risk) correctly classified to be negative, False Positive (FP) is the number of negative samples misclassified to be positive, False Negative (FN) is the number of positive samples misclassified to be negative.

We randomly select 200 news headlines on average each time from the online texts database and test the performances of the selection. During this process, two professionals from industry and two from academia are invited to manually refine the list of keywords based on their extensive experience. The professionals from the industry, with nearly two decades of experience in operational risk management, are key members of the risk management departments of well-known banks. The professionals from academia also have more than 10 years of research foundation on the measurement and management of operational risks. Specifically, they help determine which initial OpRDic terms could effectively identify operational risk news and which could not, despite being included in the dictionary. They also provide insights on typical terms that should be included in the Non-OpRDic dictionary. Figure 3 records the accuracy, precision, and recall rate after each adjustment of the keyword list. After 37 rounds of tests and revisions, the precision of selecting stabilizes at 80% and the accuracy is over 95%. Finally, OpRDic and Non-OpRDic dictionaries are obtained to select the operational risk news more efficiently from two directions. These two dictionaries are both available on GitHub (https://github.com/cy21peng/oprisk).

Fig. 3
figure 3

The performances of OpRDic and Non-OpRDic dictionaries for selection.

Parts of keywords in two dictionaries are shown in Table 3. The OpRDic dictionary contains typical words that can pluck news headlines related to operational risks, such as ‘illicit money’, ‘compensate’, ‘embezzle’, and ‘debt’. Empirically, when the words in the OpRDic and Non-OpRDic dictionaries appear in a news headline at the same time, the content will not report specific operational risk loss events.

Table 3 Examples of OpRDic and Non-OpRDic dictionaries.

The description of BDM is illustrated in Fig. 4, achieved through a two-stage identification process. First, if the words in OpRDic dictionary appear in the news headline, the news is likely to report operational risk events and is regarded as operational risk news, otherwise is non-operational risk news. Second, the Non-OpRDic is used to further check the pre-identified operational risk news. If the words in the Non-OpRDic dictionary appear in the news headlines previously identified as operational risk, the content of the news is considered to be unrelated to operational risk and reclassified as non-operational risk. Eventually, the raw news is divided into the operational risk event dataset and the non-operational risk event dataset. The subsequent analysis in this paper will focus on the operational risk event dataset.

Fig. 4
figure 4

The overview of the Bidirectional Dictionary Method (BDM).

Key information extraction based on named entity recognition method

The key information, such as the bank involved, province occurred, city occurred, occurrence time, end time, loss amount, and person involved, can provide a comprehensive description of the operational risk event. This is conducive to supervising the operational risk from the macro perspective. They are primarily entities embedded in the text, and we use named entity recognition technology to extract them.

The bank involved, province occurred, and city occurred in an operational risk event are extracted through heuristic named entity recognition, which is accomplished by utilizing dictionary matching techniques. To develop lexical dictionaries, we collect the list of Chinese banking financial institutions from the China Banking and Insurance Regulatory Commission, and the list of Chinese administrative provinces and cities from the Ministry of Civil Affairs of China. The keys of dictionaries are the full name of every institution or region. The abbreviations are also integrated as the value of the dictionary. The examples of the lexical dictionaries for each field are shown in Table 4. The full version of the dictionaries is shown on GitHub (https://github.com/cy21peng/oprisk). If the VALUE appears in the operational risk event texts, its corresponding KEY is recorded in the database.

Table 4 Examples of lexical dictionaries for heuristic named entity recognition.

We normalize and tokenize the target text, checking that each word or phrase in the text matches the set of VALUE for any of the entities in the dictionary. If there is a match, the word is recognized as the entity of the KEY corresponding to that VALUE. Eventually, the target entity that appears in the text is extracted and associated with its official full name.

The Stanford Named Entity Recognizer (Stanford NER), one of the well-trained supervised named entity recognition systems, is applied to label time, money, and person. It is based on linear chain Conditional Random Field (CRF) sequence models and is particularly good at identifying the three types of named entities (person, organization, location) in sentences28. We apply it to extract the start year, start month, end year, end month, person involved, and loss amount fields.

The linear chain CRF is a supervised machine learning method that aligns the named entity recognition process with a sequence labeling problem29. It can provide higher tagging accuracy in general. Let Y, X be random vectors, \(\theta =\{{\theta }_{k}\}\in {{\mathbb{R}}}^{K}\) be a parameter vector, and \({\{{f}_{k}(y,y{\prime} ,{x}_{t})\}}_{k=1}^{K}\) be a set of real-valued field functions. The linear chain CRF is a distribution p(y|x) that takes the form30 and is shown in Eq. (5).

$$p(y|x)=\frac{1}{Z(x)}\mathop{\prod }\limits_{t=1}^{T}\exp \{\mathop{\sum }\limits_{k-1}^{K}{\theta }_{k}{f}_{k}({y}_{t},{y}_{t-1},{x}_{t})\}$$
(5)

Where Z(x) is an instance-specific normalization function and shown in Eq. (6).

$$Z(x)=\sum _{y}\mathop{\prod }\limits_{t=1}^{T}\exp \{\mathop{\sum }\limits_{k=1}^{K}{\theta }_{k}{f}_{k}({y}_{t},{y}_{t-1},{x}_{t})\}$$
(6)

The example of Stanford NER labeling an operational risk news text is presented in Fig. 5. The time entity is labeled as ‘DATE’, the money entity is labeled as ‘MONEY’, and the person’s name is labeled as ‘PERSON’. Although the Stanford NER outputs the predefined category of entity, not all the entities contribute to the database construction. For example, there are multiple temporal entities extracted from a news text, but only the exposure time and end time are beneficial. There may also be multiple amounts in a report, among which may be the total amount of losses or the amount of loss per crime committed by the suspect in the operational risk event. It is irrational to store all extracted entity information directly in the database, so we propose a rule-based named entity recognition method.

Fig. 5
figure 5

Example of Stanford NER labeling result.

Specifically, we choose the earliest time of all times that appears in the entire document as exposure time and the latest time of all times as end time. As the total amount involved in the incident is always disclosed in the news, the largest amount of money that appeared in the news is regarded as the loss amount. The person involved is directly defined as the ‘PERSON’ identified by Stanford NER.

To test the performance of the proposed named entity recognition method, we evaluate the average accuracy 5 times, each time randomly extracting 100 of the aforementioned recognition results. The recognition average accuracy rates of different key fields are shown in Table 5. Since not all risk events disclose the specific months in the text, we do not evaluate the accuracy of the start month and end month fields in Table 5. The rule-based named entity recognition obtains better performance with the average accuracy rates of fields above 80%, especially the average accuracy of province occurred field is 81.70%. Although the effects of supervised entity recognition are slightly inferior, the lowest accuracy still reaches 68.57%.

Table 5 Accuracy of the proposed named entity recognition method.

Finally, it is worth noting that some news items have different headlines but report the same event. Therefore, this paper also utilizes four of the same field characteristics for cleaning. After extracting the key information of each event, if the fields of the ‘people involved’, ‘bank involved’, ‘province occurred’, and ‘city occurred’ of multiple event fields are highly consistent, they are considered to describe the same event. Specifically, if all four fields are the same, keep only one of these events, and delete the rest.

Key information discrimination based on the semantic analysis method

Analyzing causal factors, event types, and business lines enables financial institutions to conduct a more detailed analysis of operational risks and to better identify which areas or activities may lead to losses31,32. This helps in the more effective allocation of resources, reducing potential losses while also controlling costs. However, such key fields are not the entities embedded in the texts so cannot be obtained directly from the news. They always require experts to discriminate by perusal and comprehension, which demands a considerable level of expertise and entails significant time and effort. Therefore, we explore a variety of text analysis methods based on semantic content rather than human experience to classify these three key fields automatically.

The COLD database is applied as the benchmark sample to train and ascertain the best classification models for causal factors, event types, and business lines, respectively. The number of different categories in each key field of benchmark samples is counted in Fig. 6. The categories are based on the standards recommended by the Basel II Accord. The ‘No business line’ in Fig. 6(c) means the events that cannot be categorized into 8 business lines, such as bribery. Due to the insufficient number of samples in some categories, feature learning may not be adequate during training, leading to poor classification performance. We thus merge categories containing fewer events. Specifically, the causal factor field includes ‘Personnel’, ‘External events’, and the remaining that are classified as ‘Other factors’ category. The event type field includes ‘Internal fraud’, ‘External fraud’, and ‘Other types’. The ‘Other types’ category encompasses ‘Execution, delivery & process’, ‘Damage to physical assets’, ‘Clients, products & business’, ‘Business disruption and system’, and ‘Employment practices and workplace safety’. The business line field includes ‘Retail banking’, ‘Commercial banking’, and the remaining business line categories that are classified as ‘Other lines’ category.

Fig. 6
figure 6

The statistics of three fields in the COLD database. (a) the statistics of the causal factor field; (b) the statistics of the event type field; (c) the statistics of the business line field.

The text describing an operational risk event contains information used to determine the above three key fields, such as a word phrase or sentence structures. Classifying an event into its appropriate category can be considered a text classification task. That is, text classification methods are used to learn the relationship between these keywords and categories from a large amount of text, thereby achieving the allocation of information to predefined categories. As most text classification algorithms operate in numerical space, they cannot directly handle raw text data. Therefore, before using the text classification methods, we cut every news text into a sequence of words and convert them into vectors of word weights, which is represented as the Vector Space Model (VSM). This process transforms text into numerical data that can be processed by text classification models while retaining most of the information in the text.

Assuming M documents consisting of N words \({\boldsymbol{Doc}}=({w}_{1},{w}_{2},\cdots ,{w}_{N})\), we reform them into a sequence composed of a single word and the combinations of a word and the following word. The sequence of a single document is \({\boldsymbol{seq}}=\left({w}_{1},{w}_{2},\cdots ,{w}_{N},{w}_{1}{w}_{2},{w}_{2}{w}_{3},\cdots ,{w}_{N-1}{w}_{N}\right)\). The field sequence is reckoned by calculating the TF-IDF weight of every component in a single document \({\boldsymbol{feature\_seq}}=\left({W}_{1},{W}_{2},\cdots ,{W}_{N},{W}_{N+1},\cdots ,{W}_{2N-1}\right)\) and the field matrix of all documents R is defined in Eq. (7).

$$R=\left(\begin{array}{ccccccc}{W}_{1,1} & {W}_{1,2} & \cdots & {W}_{1,N} & {W}_{1,N+1} & \cdots & {W}_{1,2N-1}\\ {W}_{2,1} & {W}_{2,2} & \cdots & {W}_{2,N} & {W}_{2,N+1} & \ldots & {W}_{2,2N-1}\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {W}_{M,1} & {W}_{M,2} & \cdots & {W}_{M,N} & {W}_{M,N+1} & \cdots & {W}_{M,2N-1}\end{array}\right)$$
(7)

The standard sample includes the feature matrix R and the causal factor, event type, and business line of the corresponding operational risk event. The COLD samples are divided into training samples and test samples at a ratio of 8:2. In order to choose the best classifier for three field extractions, we compare the performance between a set of text classifiers, which are commonly applied in the existing text classification studies19,33: K-Nearest Neighbour (KNN), Logistic Regression (LR), Random Forest (RF), Multinomial Naive Bayes (MNB), Support Vector Machine (SVM) and Multilayer Perceptron (MLP). Macro-average and Weighted-average evaluation criteria are used to analyze the performance of models34,35. The optimal models are determined by Grid Search, then the model evaluation indicators are obtained after 5-fold cross-validation. The number of neighbors for the KNN model is 8. The LR model is L2-regularized and the regularization parameter C is 5. The smoothing parameter alpha of the MNB model is 1. The RF model has 30 estimators and the max depth is 94, the minimum number of samples on the leaf node is 94, and the minimum number of samples required to split an internal node is 2. The kernel of the SVM model is linear kernel and we set regularization parameter C and gamma of SVM to 10 and 0.001 respectively. The MLP model with ReLu activation has 60 hidden layer sizes and 1e-4 regularization terms.

The method with the best performance will be used to extract fields. Macro-average calculates the effectiveness of each category separately and finally performs an arithmetic average to get the effectiveness of the test set. The calculations of Macro-averaged Precision, Macro-averaged Recall, and Macro-averaged F1-score are shown in Eqs. (810).

$${\rm{Macro}} \mbox{-} \mathrm{average}\,\mathrm{Precision}:Macro\_P=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}Precisio{n}_{i}$$
(8)
$$\mathrm{Macro} \mbox{-} \mathrm{average}\,\mathrm{Recall}:Macro\_R=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{Recall}_{i}$$
(9)
$$\mathrm{Macro} \mbox{-} \mathrm{averaged}\,{\rm{F}}1 \mbox{-} \mathrm{score}:Macro\_{F}_{1}=\frac{2\times Macro\_P\times Macro\_R}{Macro\_P+Macro\_R}$$
(10)

As shown in Fig. 6, the sample categories are imbalanced. If we only consider the Macro-average, it will give equal weight to each class, only taking into account the average performance of each class and ignoring the influence of the difference in sample quantities on the classification results. Therefore, we also report the performance of the weighted average considering the sample quantities for each class. This allows classes with larger sample sizes to have a greater impact on the evaluation metrics, providing a better reflection of the models’ overall performance on the entire dataset. Define wi as the proportion of the number of samples in class i to the total number of samples. Weighted-averaged Precision, Weighted-averaged Recall, and Weighted-averaged F1-score are calculated based on Eqs. (1113) as follows.

$$\mathrm{Weighted} \mbox{-} \mathrm{average}\,\mathrm{Precision}:Weighted\_P=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}Precisio{n}_{i}\times {w}_{i}$$
(11)
$$\mathrm{Weighted} \mbox{-} \mathrm{average}\,\mathrm{Recall}:Weighted\_R=\frac{1}{n}\mathop{\sum }\limits_{i=1}^{n}{Recall}_{i}\times {w}_{i}$$
(12)
$$\mathrm{Weighted} \mbox{-} \mathrm{average}\,{\rm{F}}1 \mbox{-} \mathrm{score}:Weighted\_{F}_{1}=\frac{2\times Weighted\_P\times Weighted\_R}{Weighted\_P+Weighted\_R}$$
(13)

The classifications of causal factors, event type, and business lines are three independent multi-classification tasks. Table 6 reports the average and standard deviation of model performances in 5-fold cross-validation. Through comprehensive trade-offs among multiple criteria, the MLP model is considered to have the best classification performance. Focusing on causal factors, the MLP model has the best performance and the Accuracy reaches 0.784. The other evaluation indicators are also the highest among the models. When analyzing the event types, the MLP model has the best Accuracy, Macro-P, Weighted-P, and Weighted-R with 0.766, 0.764, 0.762, and 0.766. The second-best performance of Macro-R, Macro-F1, and Weighted-F1 only has a relative decrement of 0.034, 0.044, and 0.006 compared with the best-performing model, i.e., RF model, in terms of this criterion respectively. Considering the business line, the MLP model has the best Accuracy, Macro-P, Weighted-P, Weighted-R, and Weighted-F1 which are 0.728, 0.678, 0.725, 0.728, and 0.698. The other criteria all have the second-best score in each criterion. Thus, we apply the MLP model to classify causal factors, event type, and business line respectively.

Table 6 The performance criteria of text classification models for each field.

Compared with the small amount of data that took several years to collect in previous studies, the automated methods proposed in this paper are capable of collecting real-time data as well as a larger volume of data. This framework can continuously crawl operational risk news from various news websites, automatically and swiftly extracting key information to be stored in the database. Additionally, in order to avoid model error in field categorization and ensure the data is reliable for operational risk research, we invited four professionals from both industry and academia to ultimately conduct manual verification on each field, ensuring the completeness, authenticity, and accuracy of each piece of data. Especially, they manually subdivided the ‘Other factors’ in the causal factor field, ‘Other types’ in the event type field, and ‘Other lines’ in the business line field according to the definitions in Basel Accord II. That is, these categories that were previously merged before text classification have been further divided into specific categories as shown in Fig. 6. Detailed information about these four professionals has been introduced earlier in the Operational Risk News Selection section.

Data Records

Finally, the 3,723 operational risk loss data from 1986 to early 2023 are collected from public sources, greatly exceeding the manual accumulation in previous decades. They form the P-COLD database and are available on the figshare data repository36. Each record includes 12 fields to fully describe operational risk events comprehensively and succinctly. Due to privacy protection, we cannot disclose the fields of the specific case description, persons involved, and bank involved. The descriptions of each field in the P-COLD and corresponding examples are shown in Table 7.

Table 7 Description and examples of key fields in the constructed database.

We compare the obtained P-COLD database with other operational risk databases established in the existing literature. The time window and the number of loss events in these databases are summarized in Table 8. To our knowledge, the constructed P-COLD database has been one of the largest Chinese cross-institutional operational risk external databases in the academic community. It spans from 1986 to 2023, offering the most comprehensive fields and the largest amount of data, making it significantly larger and with a longer time window than others.

Table 8 The comparison of existent operational risk databases.

Technical Validation

Some existing experience shows that the loss amount of operational risk databases often has statistical features including high peak and fat tail37,38. Therefore, we implement statistical analysis to test the quality of the constructed database. The statistical results of the loss severity of the database are shown in Table 9. The standard deviation reaches 704,614, indicating that there is a high degree of dispersion among loss severity, and there is a significant difference between the values. The skewness is 54, which is significantly greater than the skewness 0 of the standard normal distribution, indicating that the data has a significant right skew. The kurtosis reaches 3,088, which is much larger than 3 of the standard normal distribution, showing ‘peak’ characteristics. From the results of statistical analysis, the distribution of the loss severity of operational risk events has an obvious high peak and fat tail, consistent with the common judgment of operational risk data characteristics in existing studies39.

Table 9 The statistical description of the loss amount of the P-COLD database (Unit: 10,000 CNY).

According to existing literature, the frequency of operational risk events often follows certain mathematical distributions. We apply a variety of common distributions to fit the loss frequency distribution24,40 including Poisson distribution, negative binomial distribution, and geometric distribution. The Kolmogorov-Smirnov (KS) test is used to select the distribution that fits the loss frequency best. For the KS test, the larger the P value is, the better the distribution can fit the data. The confidence level is generally set as 5%. The KS test results reveal that the negative binomial distribution outperforms the other two distributions on all causal factors, event types, and business lines. For space considerations, only the parameter estimates using the negative binomial distribution and the results of the goodness-of-fit test are presented in Table 10. Due to the small number of entries in each category after subdividing the ‘Other factors’, ‘Other types’, and ‘Other lines’ fields, it is challenging to support precise distribution fitting and parameter estimation. Therefore, we only report the parameter estimation results from before these categories were subdivided. It shows that the P values for ‘Personnel’, ‘External event’, and ‘Other factors’ causal factors are 0.125, 0.613, and 0.106, the P values for ‘Internal fraud’, ‘External fraud’, and ‘Other types’ event types are 0.053, 0.630, 0.325, the P values for ‘Retail banking’, ‘Commercial banking’, and ‘Other lines’ business lines are 0.141, 0.336, 0.279, respectively, which are all more than 5%. This indicates that the constructed database conforms to the distributions of operational risk data in existing research.

Table 10 Best-fitting distributions for frequency and the goodness-of-fit test results.

Overall, the collection of global operational risk data currently heavily relies on manual labor, resulting in a severe shortage of publicly accessible operational risk databases for risk analysis. This significantly hinders the substantial development of research in this field. In this paper, we introduce a new data collection framework that leverages text mining methods to replace the labor-intensive manual collection process. Operational risk news is automatically gathered from web pages and key information is analyzed and extracted. This process leads to the creation and expansion of the Public-Chinese Operational Loss Data (P-COLD) database for financial institutions. Each record in the P-COLD database includes 12 critical pieces of information, such as the start time, loss amount, and business lines, providing a comprehensive description of operational risk events. With 3,723 data records spanning from 1986 to 2023, the P-COLD database has become one of the largest and most comprehensive cross-institutional external operational risk databases in China. Fitting the loss frequency with common fitted distributions of operational risk data can pass the KS test, which can assist in operational risk capital calculations, dependence analysis, and institutional internal controls. In addition, the proposed framework for automatic data collection is also universal, and each step can be quickly replicated in the collection of operational risk data in other fields such as insurance and securities.

This study still has several limitations. Firstly, this framework can be expanded to be applied in more countries. In our empirical work, we only use Chinese news texts to extract operational risk events. This framework can serve the operational risk management of more countries in the future. Secondly, the granularity of feature extraction can be further improved. Due to the limitation of the amount of training data, we combine some items with a small number to ensure accuracy when we extract the causal factors, event types, and business lines. Despite these limitations, our research has ultimately constructed a large, publicly accessible database with accurate fields, which can significantly contribute to advancing operational risk research.