Main

The optimal influence problem was initially introduced in the context of viral marketing1, and its solution was shown to be NP-hard4 for a generic class of linear threshold models of information spreading17,18. Indeed, finding the optimal set of influencers is a many-body problem in which the topological interactions between them play a crucial role13,14. On the other hand, there has been an abundant production of heuristic rankings to identify influential nodes and ‘superspreaders’ in networks6,7,8,9,10,11,12,19. The main problem is that heuristic methods do not optimize a global function of influence. As a consequence, there is no guarantee of their performance.

Here we address the problem of quantifying nodes’ influence by finding the optimal (that is, minimal) set of structural influencers. After defining a unified mathematical framework for both immunization and spreading, we provide its optimal solution in random networks by mapping the problem onto optimal percolation. In addition, we present CI (Collective Influence), a scalable algorithm to solve the optimization problem in large-scale real data sets. The thorough comparison with competing methods (Supplementary Information section I20) ultimately establishes the better performance of our algorithm. By taking into account collective influence effects, our optimization theory identifies a new class of strategic influencers, called ‘weak nodes’, which outrank the hubs in the network. Thus, the top influencers are highly counterintuitive: low-degree nodes play a major broker role in the network, and despite being weakly connected, can be powerful influencers.

The problem of finding the minimal set of activated nodes17,18 to spread information to the whole network4 or to optimally immunize a network against epidemics11 can be exactly mapped onto optimal percolation (see Supplementary Information section IIB). This mapping provides the mathematical support to the intuitive relation between influence and the concept of cohesion of a network: the most influential nodes are the ones forming the minimal set that guarantees a global connection of the network5,9,10. We call this minimal set the ‘optimal influencers’ of the network. At a general level, the optimal influence problem can be stated as follows: find the minimal set of nodes which, if removed, would break down the network into many disconnected pieces. The natural measure of influence is, therefore, the size of the largest (giant) connected component as the influencers are removed from the network.

We consider a network composed of N nodes tied with M links with an arbitrary-degree distribution. Let us suppose we remove a certain fraction q of the total number of nodes. It is well known from percolation theory21 that, if we choose these nodes randomly, the network undergoes a structural collapse at a certain critical fraction where the probability of existence of the giant connected component vanishes, G = 0. The optimal influence problem corresponds to finding the minimum fraction qc of influencers to fragment the network: qc = min{q [0, 1]: G(q) = 0}.

Let the vector n = (n1,…, nN) represent which node is removed (ni = 0, influencer) or left (ni = 1, the rest) in the network (), and consider a link from i to j (ij). The order parameter of the influence problem is the probability that i belongs to the giant component in a modified network where j is absent, νi→j (refs 22, 23). Clearly, in the absence of a giant component we find {νij = 0} for all ij. The stability of the solution {νij = 0} is controlled by the largest eigenvalue λ(n; q) of the linear operator defined on the 2M × 2M directed edges as . We find for locally tree-like random graphs (see Fig. 1a and Supplementary Information section II):

where is the non-backtracking matrix of the network15,24. The matrix has non-zero entries only when (k, ij) form a pair of consecutive non-backtracking directed edges, that is, (k, j) with kj. In this case (equation (13) in Supplementary Information). Powers of the matrix count the number of non-backtracking walks of a given length in the network (Fig. 1b)24, much in the same way as powers of the adjacency matrix count the number of paths5. Operator has recently received a lot of attention thanks to its high performance in the problem of community detection25,26. We show its topological power in the problem of optimal percolation.

Figure 1: The non-backtracking (NB) matrix and weak nodes.
figure 1

a, The largest eigenvalue λ of exemplified on a simple network. The optimal strategy for immunization and spreading minimizes λ by removing the minimum number of nodes (optimal influencers) that destroys all the loops. Left panel, the action of the matrix is on the directed edges of the network. The entry encodes the occupancy (n3 = 1) or vacancy (n3 = 0) of node 3. In this particular case, the largest eigenvalue is λ = 1. Centre panel, non-optimal removal of a leaf, n4 = 0, which does not decrease λ. Right panel, optimal removal of a loop, n3 = 0, which decreases λ to zero. b, A NB walk is a random walk that is not allowed to return back along the edge that it just traversed. We show a NB open walk ( = 3), a NB closed walk with a tail ( = 4), and a NB closed walk with no tails ( = 5). The NB walks are the building blocks of the diagrammatic expansion to calculate λ. c, Representation of the global minimum over n of the largest eigenvalue λ of versus q. When qqc, the minimum is at λ = 0. Then, G = 0 is stable (still, non-optimal configurations exist with λ > 1 for which G > 0). When q < qc, the minimum of the largest eigenvalue is always λ > 1, the solution G = 0 is unstable, and then G > 0. At the optimal percolation transition, the minimum is at n* with λ(n*, qc) = 1. For q = 0, we find λ = κ − 1 (κ = 〈k2〉/〈k〉, where k is the node degree) which is the largest eigenvalue of for random networks25 with all nodes present (ni = 1). When λ = 1, the giant component is reduced to a tree plus one single loop (unicyclic graph), which is suddenly destroyed at the transition qc to become a tree, causing the abrupt fall of λ to zero. d, Ball(i, ) of radius around node i is the set of nodes at distance from i, and ∂Ball is the set of nodes on the boundary. The shortest path from i to j is shown in red. e, Example of a weak node: a node with a small number of connections surrounded by hierarchical coronas of hubs at different levels.

PowerPoint slide

Stability of the solution {νij = 0} requires λ(n; q) ≤ 1. The optimal influence problem for a given q (≥qc) can be rephrased as finding the optimal configuration n that minimizes the largest eigenvalue λ(n; q) (Fig. 1c). The optimal set n* of Nqc influencers is obtained when the minimum of the largest eigenvalue reaches the critical threshold:

The formal mathematical mapping of the optimal influence problem to the minimization of the largest eigenvalue of the modified non-backtracking matrix for random networks, equation (2), represents our first main result.

An example of a non-optimized solution corresponds to choosing ni at random and decoupled from the non-backtracking matrix23,27 (random percolation21, Supplementary Information section IID). In the optimized case, we seek to derandomize the selection of the set ni = 0 and optimally choose them to find the best configuration n* with the lowest qc according to equation (2). The eigenvalue λ(n) (from now on we omit q in λ(n; q) ≡ λ(n), which is always kept fixed) determines the growth rate of an arbitrary vector w0 with 2M entries after iterations of the matrix The largest eigenvalue is then calculated by the power method:

Equation (3) is the starting point of an (infinite) perturbation series that provides the exact solution to the many-body influence problem in random networks and therefore contains all physical effects, including the collective influence. In practice, we minimize the cost energy function of influence in equation (3) for a finite . The solution rapidly converges to the exact value as → ∞, the faster the larger the spectral gap. We find for ≥ 1, to leading order in 1/N (Supplementary Information section IIE):

where Ball(i, ) is the set of nodes inside a ball of radius (defined as the shortest path) around node i, ∂Ball(i, ) is the frontier of the ball, is the shortest path of length connecting i and j (Fig. 1d), and ki is the degree of node i.

The first collective optimization in equation (4) is = 1. We find , where Aij is the adjacency matrix (equation (39) in Supplementary Information). This term is interpreted as the energy of an antiferromagnetic Ising model with random bonds in a random external field at fixed magnetization, which is an example of a pair-wise NP-complete spin-glass whose solution is found in Supplementary Information section III with the cavity method28 (Extended Data Fig. 2).

For ≥ 2, the problem can be mapped exactly to a statistical mechanical system with many-body interactions which can be recast in terms of a diagrammatic expansion, equations (41)–(49) in Supplementary Information. For example, leads to 4-body interactions (equation (45) in Supplementary Information), and, in general, the energy cost contains 2-body interactions. As soon as ≥ 2, the cavity method becomes much more complicated to implement and we use another suitable method, called extremal optimization (EO)29 (Supplementary Information section IV). This method estimates the true optimal value of the threshold by finite-size scaling following extrapolation to → ∞ (Extended Data Figs 3, 4). However, EO is not scalable to find the optimal configuration in large networks. Therefore, we develop an adaptive method, which performs excellently in practice, preserves the features of EO, and is highly scalable to present-day big data.

The idea is to remove the nodes causing the biggest drop in the energy function, equation (4). First, we define a ball of radius around every node (Fig. 1d). Then, we consider the nodes belonging to the frontier ∂Ball(i, ) and assign to node i the collective influence (CI) strength at level following equation (4):

We notice that, while equation (4) is valid only for odd radii of the ball, CI(i) is defined also for even radii. This generalization is possible by considering an energy function for even radii analogous to equation (4), as explained in Supplementary Information section IIG. The case of one-body interaction with zero radius = 0 (equation (59) in Supplementary Information) leads to the high-degree (HD) ranking (equation (62) in Supplementary Information)10.

The collective influence, equation (5), is our second and most important result since it is the basis for the highly scalable and optimized CI algorithm which follows. In the beginning, all the nodes are present: ni = 1 for all i. Then, we remove node i* with highest CI and set ni* = 0. The degree of each neighbour of i* is decreased by one, and the procedure is repeated to find the new top CI node to remove. The algorithm is terminated when the giant component is zero (see Supplementary Information section V for implementation, and Supplementary Information section VA for minimizing G(q) ≠ 0). By increasing the radius of the ball we obtain better and better approximations of the optimal exact solution as → ∞ (for finite networks, does not exceed the network diameter).

The collective influence CI for ≥ 1 has a rich topological content, and consequently tells us more about the role played by nodes in the network than the non-interacting high-degree hub-removal strategy at = 0, CI0. The augmented information comes from the sum in the right hand side of equation (5), which is absent in the naive high-degree rank. This sum contains the contribution of the nodes living on the surface of the ball surrounding the central vertex i, each node weighted by the factor kj − 1. This means that a node placed at the centre of a corona irradiating many links—the structure hierarchically emerging at different levels as seen in Fig. 1e—can have a very large collective influence, even if it has a moderate or low degree. Such ‘weak nodes’ can outrank nodes with larger degree that occupy mediocre peripheral locations in the network. The commonly used word ‘weak’ in this context sounds particularly paradoxical. It is, indeed, usually used as a synonym for a low-degree node with an additional bridging property, which has resisted a quantitative formulation. We provide this definition through equation (5), according to which weak nodes are, de facto, quite strong. Paraphrasing Granovetter’s conundrum30, equation (5) quantifies the “strength of weak nodes”.

The CI-algorithm scales as by removing a finite fraction of nodes at each step (Supplementary Information section VB). This high scalability allows us to find top influencers in current big-data social media and the minimal set of people to immunize in large-scale populations at the country level. The applications are investigated next.

Figure 2a shows the optimal threshold qc for a random Erdös–Rényi (ER) network5 (marked by the vertical line) obtained by extrapolating the EO solution to N → ∞ and → ∞ (Supplementary Information section IV). In the same figure we compare the optimal threshold against the heuristic centrality measures: high-degree (HD)9, high-degree adaptive (HDA), PageRank (PR)7, closeness centrality (CC)6, eigenvector centrality (EC)6, and k-core12 (see Supplementary Information section I for definitions). Supplementary Information sections VI and VII show the comparison with the remaining heuristics6,11 and the Belief Propagation method of ref. 14, respectively, which have worse computational complexity (and optimality), and cannot be applied to the network sizes used here. Remarkably, at the optimal value qc predicted by our theory, the best among the heuristic methods (HDA, PR and HD) still predict a giant component 50–60% of the whole original network. Furthermore, the influencer threshold predicted by CI approximates very well the optimal one, and, notably, CI outperforms the other strategies. Figure 2b compares CI in scale-free (SF) networks5 against the best heuristic methods, that is, HDA and HD. In all cases, CI produces a smaller threshold and a smaller giant component (Fig. 2c).

Figure 2: Exact optimal solution and performance of CI in synthetic networks.
figure 2

a, G(q) in an ER network (N = 2 × 105, 〈k〉 = 3.5, error bars are s.e.m. over 20 realizations). We show the true optimal solution found with EO (‘×’ symbol), and also using CI, HDA, PR, HD, CC, EC and k-core methods. The other methods are not scalable and perform worse than HDA and are treated in Supplementary Information sections VI and VII (Extended Data Figs 8, 9). CI is close to the optimal obtained with EO in Supplementary Information section IV. Note that EO can estimate the extrapolated optimal value of qc, but it cannot provide the optimal configuration for large systems. Inset, qc (obtained at the peak of the second-largest cluster) for the three best methods versus 〈k〉. b, G(q) for a SF network with degree exponent γ = 3, maximum degree kmax = 103, minimum degree kmin = 2 and N = 2 × 105 (error bars are s.e.m. over 20 realizations). Inset, qc versus γ. The continuous blue line is the HD analytical result computed in Supplementary Information section IIG (Extended Data Fig. 1b). c, Example of SF network with γ = 3 after the removal of 15% of nodes, using the three methods HD, HDA and CI. CI produces a much reduced giant component G (red nodes).

PowerPoint slide

As an example of an information spreading network, we consider the web of Twitter users (Supplementary Information section VIII19). Figure 3a shows the giant component of Twitter when a fraction q of its influencers is removed following CI. It is surprising that a lot of Twitter users with a large number of contacts have a mild influence on the network. This is witnessed by the fact that, when CI (at = 5) predicts a zero giant component (and so it exhausts the number of optimal influencers), the scalable heuristic ranks (HD, HDA, PR and k-core) still give a substantial giant component of the order of 30–70% of the entire network. These heuristics also, inevitably, find a remarkably large number of (fake) influencers, which is at least 50% larger than that predicted by CI (Fig. 3b and Supplementary Information section VIII). One cause for the poor performance of the high-degree-based ranks is that most of the hubs are clustered, which gives a mediocre importance to their contacts. As a consequence, hubs are outranked by nodes with lower degree surrounded by coronas of hubs (shown in detail in Fig. 3c), that is, the weak nodes predicted by the theory (Fig. 1e).

Figure 3: Performance of CI in large-scale real social networks.
figure 3

a, Giant component G(q) of Twitter users19 (N = 469,013) computed using CI, HDA, PR, HD and k-core strategies (other heuristics have prohibitive running times for this system size). b, Percentage of fake influencers or false positives (PFI, equation (120) in Supplementary Information) in Twitter as a function of q, defined as the percentage of non-optimal influencers identified by the HD algorithm in comparison with CI. Below , PFI reaches as much as 40%, indicating the failure of HD in optimally finding the top influencers. Indeed, to obtain G = 0, HD has to remove a much larger number of fake influencers, which at reaches PFI ≈ 48%. c, An example of the many weak nodes found in Twitter. These crucial influencers were missed by all heuristic strategies. d, G(q) for a social network of 1.4 × 107 mobile phone users in Mexico representing an example of big data to test the scalability and performance of the algorithm in real networks. CI immunizes this social network using half a million fewer people than the best heuristic strategy (HDA), saving 35% of the vaccine stockpile.

PowerPoint slide

Finally, we simulate an immunization scheme on a personal contact network built from the phone calls performed by 14 million people in Mexico (Supplementary Information section IX). Figure 3d shows that our method saves a large number of vaccines or, equivalently, finds the smallest possible set of people to quarantine; our method therefore also outranks the scalable heuristics in large real networks. Thus, while the mapping of the influencer identification problem onto optimal percolation is strictly valid for locally tree-like random networks, our results may apply also to real loopy networks, provided the density of loops is not excessively large.

Our solution to the optimal influence problem shows its importance in that it helps to unveil hitherto hidden relations between people, as witnessed by the weak-node effect. This, in turn, is the by-product of a broader notion of influence, lifted from the individual non-interacting point of view6,7,8,9,10,11,12,19,20 to the collective sphere: influence is an emergent property of collectivity, and top influencers arise from the optimization of the complex interactions they stipulate.