Introduction

Cascading failures and the recovery from them is one of the most popular research directions in network science. Recently, the percolation theory has been widely used for modeling cascading failures in interdependent networks, where failures propagate among networks due to predefined dependency links1,2,3,4,5,6,7,8,9. Overload-triggered cascades in single or coupled networks have also been the subject of much work in the past decade10,11,12,13,14,15,16,17,18,19,20,21. Besides the above mentioned models, other models like k-core cascades, sandpile models have also been employed for understanding failure propagation and systems collapse22,23,24,25,26,27. Based on the above modeling frameworks of cascading failures, different approaches for system repair have also been studied. Most of these works consider including rules for restoring nodes that fail during the cascading failure process28,29,30,31,32,33. For example, A. Majdandzic, et al. in 2014 presented a model, where a node recovers from an internal or external failure after a fixed period of time. This model leads to an interesting phase-flipping phenomena, as well as a strong hysteresis behavior28. This model was later extended by using a randomized recovery method31. More recently, M.A. Di Muro, et al. studied a node repairing strategy for interdependent random networks, where a failed node can be repaired with a certain probability if it is a part of the current giant connected components32. A. Majdandzic, et al. further studied the cascade and node recovery model for multi-layer interacting networks and also investigated the optimal repairing strategy for a collapsed coupled system33.

Many cascading failure models exhibit the interesting phenomena of “critical slowing down”: systems near criticality can experience a much longer cascading process (the so called “plateau stage”), which is sensitive to noise, before a final total collapse1,34,35,36. For example, D. Zhou, et al. studied the branching process behind the critical cascading failures in interdependent networks, and showed the critical/non-critical scaling rules of the total cascade length34. In addition, G.J. Baxter, et al. studied the critical and non-critical dynamic processes in the k-core pruning model35. Recently, D. Lee, et al. presented a universal model for hybrid percolation transitions and investigated the resulting critical cascading process36. Most of these studies mainly focused on interpreting the time length of the critical slowing down phase. Further, early warning indicators for system transitions based on the critical slowing down have already been evaluated for many real systems37,38,39,40,41,42. This technique has also been used for predicting system collapse in cascading failure models. For example, B. Podobnik, et al. studied indicators to predict total collapses in a cascading failure model on random networks43.

Although the critical slowing down phenomena have been leveraged to provide indicators of impending cascades, there is still an important open question: how to restore the system after an early warning has been recorded? In this work, we attempt to investigate and answer this question. To this end, we have systematically explored several system recovery strategies after observing an early warning of a total system crash. We base our work on the recently proposed model of cascading failures by Y. Yu, et al.44. This cascading failure model is an extension of the k-core cascade, where a node will be removed from the network with a probability f if it has fewer than k s connections, or it has lost more than a fraction 1 − q of its original neighbors. Further, as in43, we employ the moving standard deviation (MVSD) of the remaining system size time series as an early indicator of an impending cascade. We then compare five different node-addition based recovery strategies and study the effect of response time delay on system recovery. We find that, for homogeneous Erdös-Rényi (ER) networks, an earlier node addition leads to a larger survival ratio. However, for scale-free (SF) networks, a delayed recovery can be better in some cases. We also find, for ER networks, that it is always better to connect the newly added nodes to existing nodes in a uniformly random manner. However, for SF networks, a roulette selection based on each node’s original degree (or its reciprocal) can perform better for earlier node additions. These results provide insights on how to save a system that has been predicted to collapse.

Results

Cascading failure model and recovery strategies

In this work, we follow the KQ modeling framework of system crash introduced by Y. Yu, et al.44. A node will be removed from the system with a probability f, if it’s current degree is smaller than a threshold k s or it has lost more than a fraction q of its original neighbors. The fraction of remaining nodes is used as a measure of the system robustness. The KQ model exhibits an interesting behavior for certain parameter values, where systems would experience a slow cascading failure process in a plateau stage (pseudo-steady states) before an abrupt total collapse. In the following, we focus on cases with sudden total collapse after a pseudo-steady state. Our goal is to investigate early warning indicators and compare system recovery strategies. We focus on two cases: ER networks with 〈k〉 = 20, k s  = 11, q = 0.09 and f = 0.1, and SF networks with γ = 1.8, k s  = 5, q = 0.39 and f = 0.2. These parameter values are inspired by the values used by Y. Yu, et al., who in turn based their choice of parameter values on measurements from real-world systems. We show 30 realizations of the cascading failure process for both ER and SF networks in Fig. 1. S(t), t = 1, 2, … denotes the proportion of remaining nodes at time step t. For both cases, the system is near criticality and has a plateau stage (pseudo-steady state) before reaching the final state (a total collapse or surviving near the plateau). Comparing Fig. 1(a) and Fig. 1(b), we find that the ER case has a plateau stage at around \(S(t)\sim 1\), while the SF case has a plateau at around \(S(t)\sim 0.2\). The latter has a much lower plateau stage, since the heterogeneous degree distribution leads to more failures at the beginning compared to ER networks. For ER networks, as long as the mean degree is significantly larger than the threshold k s , the system will have very few failures at the early time steps. In other words, the system size does not significantly change, which results in the observed plateau stage around 1.

Figure 1
figure 1

Examples of the cascading failure process near criticality. (a) 30 examples of the cascading failure process for ER networks with N = 1000, 〈k〉 = 20, k s  = 11, q = 0.09, and f = 0.1. (b) Similar to (a) but for SF networks with γ = 1.8, k s  = 5, q = 0.39, and f = 0.2.

In order to provide early warning indicators of a total collapse during the plateau stage, we need to capture both the beginning and the end of the plateau stage. To do this, we first define the moving standard deviation (MVSD) of S(t), MVSD(t), as the standard deviation (SD) of S(t) in time windows with length 5: S(t − 4), S(t − 3), …, S(t). For t ≤ 4, the first t values of S(t) will be used to calculate the MVSD instead. Note that the time series after the current time step t is not used, since we aim to provide early warning prediction based on historical records. We use a window length of 5 for calculating the MVSD, because some realizations for the SF network, as shown in Fig. 1(b), can reach a total collapse within 20 time steps.

We define the beginning of the plateau stage as Tstart = 1 for the ER network case. For the SF network case, Tstart is defined as the time step where MVSD(t) becomes smaller than 0.01 for the first time. This threshold is motivated by the observation that the MVSD will become smaller than 0.01 during the plateau stage in most cases. The end of the plateau stage, Tpred, is defined as the first time step where MVSD(Tpred) > mean(MVSD(t = Tstart, …, Tpred − 1)) + 3 · SD(MVSD(t = Tstart, …, Tpred − 1)). This definition is inspired by the fact that systems tend to have a continuously increasing SD when leaving the pseudo-steady state.

Following the prediction of the start and end of the plateau stage, we try to restore the system by adding N a new nodes at time step t = Tpred + Tdelay, where Tdelay ≥ 1 defines the time delay of the node addition process. Each of the additional nodes has k a connections to k a remaining nodes–If there are fewer than k a remaining nodes, all of them will be connected to each additional node. Next we discuss different strategies for wiring the newly added nodes.

“Uniformly random selection”: at time step Tpred + Tdelay, each additional node is connected to k a uniformly randomly sampled remaining nodes.

“Largest degree selection”: at time step Tpred + Tdelay, each additional node is connected to k a remaining nodes that had the largest degree values in the original network.

“Smallest degree selection”: at time step Tpred + Tdelay, each additional node is connected to k a remaining nodes that had the smallest degree values in the original network.

“Roulette selection”: at time step Tpred + Tdelay, each additional node is connected to k a randomly selected remaining nodes, and the probability that a remaining node is selected is proportional to its degree in the original network.

“Anti-roulette selection”: at time step Tpred + Tdelay, each additional node is connected to k a randomly selected remaining nodes, and the probability that a remaining node is selected is proportional to the reciprocal of its original degree.

We use a threshold d for the fraction of remaining nodes, S(t), to determine if one realization of cascading failures in simulation has a total collapse. d is set to 0.5 and 0.1 for the ER and SF networks, respectively. These thresholds correspond to half the system sizes at the pseudo-steady states. For each realization with a total collapse, we repeat the node addition independently M a times, and calculate a survival ratio over these M a tests, η, which is the number of trials without total collapses divided by M a . We also find the time step, t = T d , where S(t) decreases to below the threshold d, after each trial of node addition with a total collapse. We repeat the above process for M realizations. To illustrate the above mentioned processes of total collapse prediction and mitigation via adding new nodes, we show in Fig. 2 examples for an ER network and a SF network. For both examples, we use N a  = 100 and M a  = 10. We, however, use different k a , Tdelay values depending on the network: k a  = 30, Tdelay = 6 for the ER network; and k a  = 8, Tdelay = 5 for the SF network. These parameter values were carefully chosen to ensure that we do not end up with the extreme survival ratio η of 0 or 1. For node addition, we follow the uniformly random selection rule. The ER network survived in 8 out of the 10 trials, while 9 of them survived in the SF case (see the middle and lower panels of Fig. 2(a,b)). Therefore, the survival ratios for these two examples are 0.8 and 0.9, respectively.

Figure 2
figure 2

Examples of the system recovery after the early warning indicator. (a) One example of system recovery for ER networks with N = 1000 and 〈k〉 = 20. The uniformly random selection rule is used. M a  = 10, k s  = 11, q = 0.09, f = 0.1, N a  = 100, and k a  = 30. The threshold for determining a total collapse is d = 0.5. The upper panel shows the variation of S(t) (red line with circles) as well as the corresponding moving SD series (blue line with crosses). The black vertical line indicates the location of Tpred. The middle panel shows all the 10 new time series (blue dashed lines) of S(t) after the node addition with Tdelay = 6. The lower panel shows the 8 trials of node addition (red lines) without total collapses among the M a  = 10 trials in total. The threshold for determining a total collapse is d = 0.5. (b) Similar to (a) but for SF networks with γ = 1.8, k s  = 5, q = 0.39, f = 0.2, N a  = 100, k a  = 8 and Tdelay = 5. For this example, there are 9 trials of node addition without total collapses within the 10 trials. The threshold for determining a total collapse is d = 0.1.

Comparisons of node addition rules

In the following, we investigate how different node addition rules impact the ability to recover a system with an impending total collapse for both ER and SF networks. We also study the role of the time delay Tdelay.

First, we focus on the ER network case with N a  = 100 and different values of k a . Fig. 3(a–e) show how the mean survival ratio 〈η〉 varies for different values of k a as we vary Tdelay, for the five different approaches of node addition. For example, according to Fig. 3(a,d,e), for the three randomized selection rules, the survival ratio decreases from 1 to around 0 as Tdelay increases or as k a decreases. However, Fig. 3(b,c) show that, for the largest degree and smallest degree selection rules, the system has much lower survival ratios. This is because all the N a k a additional links are added between N a new nodes and the k a remaining nodes with the largest or smallest original degrees. This will lead to a final state, with around N a  + k a nodes, smaller than the threshold d = 0.5. We also notice that for the roulette/anti-roulette selection, when k a becomes too large, the survival ratio tends to decrease. This can be related to the fact that for each additional node, one remaining node can be selected multiple times, which reduces the positive effect of node addition.

Figure 3
figure 3

Mean survival ratio 〈η〉 as a function of Tdelay for N a  = 100 and different k a values. (a) ER networks with the uniformly random selection. N = 1000, M = 1000, M a  = 10, 〈k〉 = 20, k s  = 11, q = 0.09, and f = 0.1. The threshold for determining a total collapse is d = 0.5. (b–e) Similar to (a) but for the largest degree, smallest degree, roulette, and anti-roulette selection rules.

Figure 4(a–d) also compare the five node selection rules, but this time we check for different values of k a as we vary Tdelay. For example, Fig. 4(a) shows the results for Tdelay = 1, which is an immediate system recovery, and as we vary k a between 0 and 200. The uniformly random selection is evidently the best. The roulette/anti-roulette selection has similar but slightly smaller survival ratio values. According to Fig. 4(b–d), for larger Tdelay values, the uniformly random selection is always better than the roulette and anti-roulette selection rules. These results suggest that for restoring an ER network, there is no need to pick nodes to connect to based on degree.

Figure 4
figure 4

Mean survival ratio 〈η〉 as a function of k a for N a  = 100. (a) Different selection rules for the ER network case with Tdelay = 1. N = 1000, M = 1000, M a  = 10, 〈k〉 = 20, k s  = 11, q = 0.09, and f = 0.1. The threshold for determining a total collapse is d = 0.5. (b–d) Similar to (a) but for Tdelay = 11, Tdelay = 21, and Tdelay = 31.

In Figs 5 and 6, we present the same as in Figs 3 and 4, but for the SF case. We consider adding N a  = 100 nodes, with different k a and Tdelay values. In Fig. 5(a), we surprisingly find that for the uniformly random selection, the survival ratio η does not monotonically decrease with Tdelay, but has a peak at around Tdelay = 11 for different k a values. This means that to prevent the total collapse of a SF network, sometimes a delayed recovery can be better. As shown in Fig. 5(d,e), the roulette and anti-roulette rules behave similarly. Moreover, we find that, for an immediate node addition, the roulette rule performs better than the other two randomized rules (this will be explained later when we discuss the results in Fig. 6). Finally, as shown in Fig. 5(b,c), the largest degree and smallest degree selection rules perform much better compared to their performance in the ER network case. This is because almost all of the N a additional nodes and the k s selected remaining nodes tend to survive when k a is large enough (compared to k s ). Note that N a  + k s is larger than the threshold d = 0.1, which leads to an η value of ≈1.

Figure 5
figure 5

Mean survival ratio 〈η〉 as a function of Tdelay for N a  = 100 and different k a values. (a) SF networks with the uniformly random selection. N = 1000, M = 3000, M a  = 10, γ = 1.8, k s  = 5, q = 0.39, and f = 0.2. The threshold for determining a total collapse is d = 0.1. (b–e) Similar to (a) but for the largest degree, smallest degree, roulette, and anti-roulette selection rules.

Figure 6
figure 6

Mean survival ratio 〈η〉 as a function of k a for N a  = 100. (a) Different selection rules for the SF network case with Tdelay = 1. N = 1000, M = 3000, M a  = 10, γ = 1.8, k s  = 5, q = 0.39, and f = 0.2. The threshold for determining a total collapse is d = 0.1. (b–d) Similar to (a) but for Tdelay = 5, Tdelay = 9, and Tdelay = 13.

The increasing and decreasing trends of the mean survival ratio in Fig. 5(a) are caused by the fact that increasing Tdelay leads to two competing effects. On the one hand, a larger Tdelay leads to a smaller remaining network before node addition, which tends to cause a smaller final system state after node addition. On the other hand, for larger Tdelay, each remaining node on average is connected to more new nodes, which results in larger degree increments for the remaining nodes. To demonstrate this, we show in Supplementary Figs 1 and 2 the distributions of S(t) and node degrees before adding new nodes, as well as at the final state after node additions, for the ER and SF cases, respectively. Supplementary Fig. 1(a) shows the PDF of S(t) before node addition for different Tdelay values. Supplementary Fig. 1(b,c) shows the PDF and the CDF of the degree values of the remaining network before adding nodes. Supplementary Fig. 1(d) shows the PDF of the final state after node addition under the uniformly random selection with N a  = 100, k a  = 100 and M a  = 10. Supplementary Fig. 2(a–d) shows the same as Supplementary Fig. 1(a–d) but for the SF network case.

We find that for the ER case, the second trend (larger degree increments) due to increasing Tdelay is weaker. Consequently, for most systems at Tdelay = 21 and Tdelay = 31, the remaining system size, before node addition, plus another 100 nodes remains below the threshold d = 0.5. Thus, having larger degree increments does not help increasing the survival ratio in these cases. However, for the SF case, the remaining nodes with small degrees before adding nodes are non-negligible, even for Tdelay = 1. Therefore, having larger degree increments will be more helpful than in the ER case. For Tdelay = 1 and Tdelay = 5, the additional degree to each remaining node is still not large enough for saving them. For Tdelay = 9, thanks the increased degree increments, most final states are not at 0, but around 0.11. This is greater than the threshold d = 0.1, which leads to a larger survival ratio η. For Tdelay = 13 and Tdelay = 17, the first trend (reduced remaining system size) dominates as in the ER case, consequently most final system states are below the threshold d = 0.1.

Similar to Fig. 4, Fig. 6(a–d) compares the the five selection rules for the SF case using different time delay values. For Tdelay = 1, the roulette selection is better than the anti-roulette or the uniformly random one. However, at Tdelay = 5, the anti-roulette is better than the other two randomized rules. When Tdelay becomes larger, the uniformly random selection becomes the best. These results present a different phenomenon compared to the ER case. To interpret these findings, we consider the degree distribution of the surviving network before the node addition is performed for the SF network case. At Tdelay = 1 (see Supplementary Fig. 2(c)), the remaining nodes that fulfil the requirements of being removed are only a small fraction of all remaining nodes. Therefore, it is more important to add links to the original hub nodes to support the connectivity of the remaining network. At Tdelay = 5, the remaining networks before adding nodes include a much larger fraction of nodes with small degrees. Consequently, the anti-roulette rule is better, since it restores more susceptible nodes. Finally, for Tdelay = 9 or Tdelay = 13, the roulette and anti-roulette selection rule are worse than the uniformly random one. This is because both original hub nodes and original nodes with small degrees tend to fulfil the requirements of node removal, These intricate effects of time delay, Tdelay, are not observed for the ER network case, since the ER case has homogeneous degree distributions before the node addition.

The above results can be further viewed in light of the total “costs” of the recovery process. Considering that in real world social networks, the cost of introducing one more individual (node) is mainly determined by his/her importance. It costs much more to introduce famous people into the system. Therefore, we can assume that the cost of adding a node is proportional to its degree: number of connections to surviving nodes. This is equivalent to defining the cost of each additional node as k a , and the total costs of the system recovery as N a k a . According to the results presented in Figs 4 and 6, for recovering a homogeneous network, the uniformly random selection rule performs better, since it can reach higher survival ratios at a lower total cost (controlled by the parameter k a ). Further, for an early, an intermediate, or a late recovery of a SF network, the roulette, anti-roulette, or the uniformly random selection rules results in larger survival ratios at a lower cost, respectively.

Tradeoffs between the number of additional nodes and their degree

In this subsection, we investigate the tradeoffs between N a and k a for a given fixed total cost value. We can imagine that a larger N a tends to cause a larger final system state, which is good for system recovery. On the other hand, a larger k a leads to more robust additional nodes. Therefore, it is important to know which parameter is more critical to the survival ratio η. Note that in this subsection we only show the results for the three randomized node selection rules in order to focus on non-trivial results.

Figure 7(a–c) shows, for the ER case, how the mean survival ratio changes with N a for a fixed total cost N a k a  = 5000 and a set of Tdelay values. The survival ratio, in the uniformly random selection case, is not strongly affected by N a for different Tdelay, except for a very large N a (see Fig. 7(a)). This is because, under a fixed total cost, as N a becomes larger k a becomes smaller and eventually less than k s  = 11. For the roulette and anti-roulette selection rules, the effect of N a is similar to the uniformly random selection except for small N a values.

Figure 7
figure 7

Balance between N a and k a for a fixed N a k a . (a) Mean survival ratio 〈η〉 VS N a for the uniformly random selection with different Tdelay. ER networks. N = 1000, M = 1000, 〈k〉 = 20, k s  = 11, q = 0.09, and f = 0.1. N a k a  = 5000. The threshold for determining a total collapse is d = 0.1. (b) Similar to (a) but for the roulette selection. (c) Similar to (a) but the anti-roulette selection.

Figure 8(a–c) shows the same as Fig. 7 but for the SF case with a total cost N a k a  = 1200. We find that N a has a stronger impact on the mean survival ratio 〈η〉 than in the ER case. For the uniformly random selection, a very small N a is preferred at Tdelay = 1. However, the needed number of nodes rises to between 100 and 150 for Tdelay = 5 or Tdelay = 9 and it continues to rise further for Tdelay = 13 and Tdelay = 17 (see Fig. 8(a)). This means that for a more delayed system recovery, a larger N a and a smaller k a are needed. In other words, more additional nodes are needed for recovering a system with a smaller remaining size before starting the addition. The roulette and anti-roulette selection rules demonstrate a similar behavior (see Fig. 8(b,c)). These results provide suggestions for restoring near-collapse systems under a fixed total cost.

Figure 8
figure 8

Balance between N a and k a for a fixed N a k a . (a) Mean survival ratio 〈η〉 VS N a for the uniformly random selection with different Tdelay. SF networks. N = 1000, M = 1000, γ = 1.8, k s  = 5, q = 0.39, and f = 0.2. N a k a  = 1200. The threshold for determining a total collapse is d = 0.5. (b) Similar to (a) but for the roulette selection. (c) Similar to (a) but the anti-roulette selection.

Discussion

In this paper, we investigate the possibility of recovering networks that exhibit early warnings of total collapse by adding additional nodes. To this end, we model system collapse using the recently introduced KQ cascade-model and employ the moving standard deviation of the remaining network size time series as an early indicator of an impending cascade. We use five rules for regulating the wiring of the newly added nodes to existing nodes. These include three random rules: uniformly random, roulette and anti-roulette. The latter two connect a new node to a set of randomly selected existing nodes with a probability proportional and inversely proportional, respectively, to their degree in the original network. The five rules include also two deterministic rules that connect new nodes to existing nodes with largest and smallest degrees in the original network, respectively. We find that an early addition of nodes (i.e. immediately after observing early warning signals) is always better for preventing ER networks from a total collapse. This is because ER networks are characterized by a homogeneous degree distribution. SF networks, however, benefit more from a delayed intervention, that is to start adding nodes after a certain time delay Tdelay. Investigating the interplay between the five connection rules and Tdelay shows that the uniformly random selection is always the best strategy for saving ER networks. For SF network, the best wiring rules change from roulette to anti-roulette, and finally to the uniformly random rule as Tdelay increases. This complex interplay is a product of node degree heterogeneity in SF networks. Finally, we explore the balance between the number of needed nodes N a and their degree k a that are needed for restoring a collapsing system at a fixed cost of N a k a . We find that SF networks need to add more nodes as Tdelay increases. However, N a has minimal impact on ER networks survival.

Our findings provide insights into saving networks that are predicted to approaching a total collapse. For example, the counterintuitive results of SF networks restoration, i.e. the positive impact of time delay, can be applied to social structures (companies) and networks with impending cascade to prevent a total collapse. Note that many real-world social networks are known to have heterogeneous structures.

Going forward, we plan to apply the proposed network recovery framework to other sorts of cascading failure models. These include overload based cascades10,20, which are known to exhibit a slow down near criticality. Furthermore, while the KQ-cascade and node addition based-recovery are more related to social networks like Facebook, it will be interesting to investigate failure models and recovery scenarios that are relevant to other systems. For example, cascades based on dependencies or overloads, with recovery by reconnecting failed nodes29,30,32,33, are more applicable to systems with physical connections, such as the power-grid and traffic systems.