Introduction

Measles virus is one of the most infectious viruses on the planet1 and a leading cause of death and disability-adjusted life-years lost2. With a basic reproduction number (i.e., the number of cases directly generated from one case in a a susceptible population) of 12–181, its transmissibility far exceeds other diseases, including SARS-CoV-2, which has a reproduction number of 2.5–3.53 and its Omicron variant, which has a reproduction number of 8.24. About 75–90% of susceptible household contacts develop the disease5,6,7. Before the introduction of measles vaccines, 95–98% of children were infected by the measles virus by age 188,9,10,11.

Sixty years after effective vaccines were licensed in 1963, measles continues to cause death and diseases in children worldwide. In 2018, the World Health Organization (WHO) reported more than 140,000 measles deaths globally, mostly among children under the age of 512. Complications from measles can occur in almost every organ13. Measles infection can also diminish previously acquired immune memory, potentially leaving individuals at risk for reinfection by previously acquired pathogens14. Studies during the 1970s and 1980s revealed that measles case-fatality rates ranged from 3 to 34%15,16,17 in low- and middle-income countries (LMICs), 10–20 times higher than high-income countries13. Although measles vaccines are highly effective with an efficacy of 97%18, outbreaks still occur in places with low vaccination coverage rates. Significant, yet inconsistent, progress has been made in measles vaccination since 2000. From 2000 to 2016, measles cases worldwide decreased from 145 to 18 cases per million, after which they increased again to 120 cases per million in 201919.

Although measles cases did decrease during the COVID-19 pandemic (to 22 cases per million in 2020)19, millions more children were susceptible to measles at the end of 2020 than in 2019. Specifically, 22.3 million children among 194 WHO member states and at least 93 million persons in 23 countries did not receive measles-containing vaccines (MCVs) because of COVID-19-related postponement of measles supplementary immunization activities (SIAs) for 202019. Measles surveillance also deteriorated during COVID-1919. In 2020, the number of measles specimens submitted was the lowest in over a decade. Many countries did not report, and few countries (32%) achieved the measles surveillance sensitivity indicator (i.e., the proportion of cases that have an imported source)20.

Increased population susceptibility and suboptimal measles surveillance portend an immediate elevated risk for measles transmission and outbreaks, threatening the already fragile progress toward regional elimination goals19. Furthermore, measles cases were not only in low-vaccination LMICs but also in high-vaccination high-income countries. In 2018, 47 of 53 Member States of the WHO European Region reported over 84,000 confirmed measles cases. Cases rose by 300% during the first 3 months of 2019 compared with the same period in 201821. Although endemic measles was declared “eliminated” from the United States22, more than 1200 confirmed cases were reported in 31 states in 201923.

The deteriorated surveillance over an increased susceptible population of one of the most infectious viruses highlights the value of real-time surveillance systems for measles. The WHO has recommended the Moving Epidemic Method (MEM) as a tool for assessing the severity of epidemics24,25. We previously applied the MEM to Google Trends search data for respiratory syncytial virus (RSV) to demonstrate the feasibility of using Google Trends as a data source for real-time monitoring of RSV outbreaks26. This approach complements existing surveillance systems to monitor disease outbreaks in real-time, especially in countries with limited or no sentinel network surveillance. An important step in validating this surveillance approach is to obtain both Google Trends search data and clinical case data to verify that these data are highly correlated and result in equivalent estimates for outbreak thresholds. In this study, we aim to explore the feasibility of extending this surveillance approach to other diseases, using measles as a worked example. Compared to previous work for RSV, which has no widespread immunization program, 81% and 71% of children had received 1 and 2 doses of measles-containing vaccines respectively in 183 WHO member states by the end of 202127. This high vaccination coverage could lead to much smaller outbreaks and potentially much weaker signals reflected on Google Trends. Consequently, other studies have found high correlation between monthly clinical case and Google Trends data over measles by summing up 3 countries’ Google Trends signals and cases for Italy, France, Germany, and Romania during 2013–2018 due to each country’s weak Google Trends signal28,29.

This study aimed to provide guidance for evaluating whether Google Trends can be applied to monitoring other diseases, such as measles. If Google Trends search data is found to be highly correlated with disease clinical case data in the context of a highly-vaccinated disease like measles, then previously published methods can be adapted to establish a pseudo-surveillance system for measles. We developed insights into what disease outbreak patterns are captured by Google Trends at both country and regional levels, how to better utilize these data, and limitations of using Google Trends to monitor disease outbreaks. We also share insights of which similarity measurements may be more suitable for this particular task. Popular performance measurements are adopted with further justification in this application area. However, those widely used performance measurements could lead to dramatically different conclusions30,31.

Methods

Correlation analysis of measles between Google Trends search data and clinical case data was performed to evaluate if Google Trends search data are highly correlated with clinical case data, even for highly vaccinated diseases like measles. If so, then the same methods from the previous study26 can be easily adapted to other diseases to establish the pseudo-surveillance system. The analysis was performed at the country level across 29 EU/EEA Member States and the UK. Japan and Germany were investigated at the regional level. With limited clinical case data, only Google Trends search data of the top 10 countries with the largest number of measles cases from October 2022 to March 2023 were evaluated.

Data

Monthly measles clinical case data for 29 EU/EEA Member States and the UK from 2016/04 to 2020/02 were collected from the European Centre for Disease Prevention and Control (ECDC) monthly measles and rubella monitoring reports32. Empty entries were filled with the floor of the average for previous and next months. Japan and Germany were selected for further investigation at the regional level, as the weekly case reports of those two countries at regional level were available. Weekly measles clinical case data in Germany from 2017 to 2019 were obtained from SurvStat database provided by Robert Koch Institut (RKI)33. Weekly measles clinical case data for Japan from 2017 to 2019 were gathered from the National Institute of Infectious Diseases (NIID)34.

Google Trends35 search data reflects how a specific search interest varies for a region over time, ranging from 100 to 0%, scaled by the highest search number that a specific search interest ever generated within the chosen time period. Weekly or monthly data points are extracted if the chosen time period is shorter or longer than 5 years, respectively. The keyword “麻疹”, in Japanese was used for Japan, and “Measles” in boèth English, as well as translations into the first language of each European country using Google Translate, were used. The keyword “Measles”, in English, was used for the top 10 countries with the largest number of measles cases from October 2022 to March 2023.

Measurement

Both Pearson’s correlation coefficient (PCC) and Spearman’s correlation coefficient (SRCC) were calculated between Google Trends and clinical case data. PCC measures the linear correlation between two sets of data, while SRCC measures the rank correlation (i.e., the statistical dependence between the rankings of two variables). Both range from − 1 to 1, with 1 indicating perfect correlation, 0 indicating no correlation, and − 1 indicating perfect anti-correlation. PCC does not imply significance of SRCC (and vice versa)36. Results of both estimators with the statistical significance levels of 0.05 and 0.01 were listed, as both statistics have been used in previous studies26,29. The Python library package SciPy37, was used to perform the correlation analyses.

Results

Outbreaks captured in Google Trends for high-income countries

The monthly number of measles cases for all 29 EU/EEA member states and the UK from 2016/04 to 2020/02 is shown in Fig. 1. For illustration purposes, among 30 countries, the top 10 countries ranked by number of total cases showed clear acute outbreak patterns in Fig. 1. Correlations between monthly Google Trends search and clinical case data of the top 10 member states and the UK by month from 2016/04 to 2020/02 are shown in Fig. 2. The results for all countries are listed in Table 1. Countries with blank results are due to: (1) The measurement is not statistically significant (p-value\(\ge\)0.05); (2) No search activities for the specified keyword were captured on Google Trends data during the selected time period. Google Trends with keywords in each country’s official language usually resulted in a higher correlation with clinical case data compared to keywords in English. A search with keywords combined in multiple languages does not necessarily result in a higher correlation.

Measles outbreaks were not captured on Google Trends for LMICs. The top 10 countries with the largest number of measles cases ranged from 68,473 (India) to 1769 (Nigeria) from October 2022 to March 202338 were investigated. Only India showed clear patterns on Google Trends.

Figure 1
figure 1

The monthly number of measles cases for 29 EU/EEA member states and the UK from 2016/04 to 2020/02. Number of total cases are indicated after the country name in the legend. Plots of 10 countries with the most number of cases are shown in lower figures in various scales to show the outbreak trends.

Figure 2
figure 2

Correlation between monthly Google Trends and clinical case data of top 10 EU/EEA member states and the UK by month from 2016/04 to 2020/02.

Table 1 Correlation between monthly Google Trends and clinical case data for 29 EU/EEA member states and the UK from 2016/04 to 2020/02.

Accurate acute outbreaks captured in Google Trends at regional level

High correlations were found between weekly Google Trends search and clinical case data. Germany and Japan were investigated at regional level. For Germany, low correlations for either the Pearson correlation coefficient (PCC) (0.25) or the Spearman’s rank correlation coefficient (SRCC) (0.37) measurements were observed at the country level, as shown in Fig. 3. At the regional level, two states were selected for illustration purposes. Lower Saxony was selected because it had the highest Google Search volumes compared to all other states. North Rhine-Westphalia was selected because it had the highest number of cases from 2017 to 2019. At the country level, the outbreak in 2018 was completely missed on Google Trends. However, it was well captured on Google Trends in regions where the outbreak occurred (e.g., North Rhine-Westphalia). Regions (e.g., Lower Saxony) without any outbreak in 2018 showed no activity on Google Trends as well. Similar observations were found in Japan. At the country level, both low correlations for PCC (0.33) and SRCC (0.37) measurements were observed from 2016 to 2019 as shown in Fig. 4. In 2017, the outbreak was not captured on Google Trends for at the country level, but it was captured on Google Trends of Yamagata, where the outbreak occurred. In 2018, although Google Trends search and clinical case data aligned well, Google Trends of big cities (e.g., Tokyo, Kyoto) also captured search volume spikes, where no outbreak happened. The outbreak was mainly in Okinawa. In 2019, the amplitude of Google Trends signals was far lower than clinical case data. This is because several cases happened in multiple regions, adding up to a high number of weekly cases at the country level. Acute outbreaks (sudden large numbers of cases within a short period of time) were captured on Google Trends in Osaka. However, there were few cases circulating around during a long period of time in Tokyo, which did not trigger a high search volume spike pattern on Google Trends.

Figure 3
figure 3

Correlation between weekly Google Trends search and clinical case data of Germany and regions with most Google searches and cases between 2017 to 2019.

Figure 4
figure 4

Correlation between weekly Google Trends search and clinical case data of Japan and regions with most cases between 2017 to 2019.

The Pearson correlation coefficient more suitable than the Spearman’s rank correlation coefficient

The Pearson correlation coefficient (PCC) seems to be more suitable than Spearman’s rank correlation coefficient (SRCC) estimation for this task. For example, for Poland, as shown in Fig. 2, using the keyword in English for Google Trends resulted in a pattern more similar to the clinical case data, leading to a higher PCC (0.77 vs. 0.32) and a lower SRCC (0.44 vs. 0.52) compared to using the “odra” keyword in Polish. For Belgium, the first spike in clinical data was completely missed in Google Trends, resulting in a low PCC (0.35), but a high SRCC (0.67). In Japan, as shown in Fig. 4, Okinawa showed perfect correlation between Google Trends search and clinical case data. However, the SRCC only yielded a low value of 0.40, while the PCC showed 0.86.

Discussion

Google Trends can complement existing surveillance systems for monitoring disease outbreaks in real-time. High correlations between Google Trends search and clinical case data were observed for measles. It is most suitable to monitor acute disease outbreaks at the regional level in high-income countries. Although these high-income countries usually have high-quality weekly case reports, we observed that weekly reports may be delayed for several weeks due to various reasons. On the other hand, Google Trends is able to provide weekly trends in real-time. It can also be used as a supplemental surveillance system for countries with limited sentinel network coverage.

Occasionally, a single keyword such as “measles” in the first language could be sufficient for identifying the clear outbreak patterns for measles on Google Trends in most countries. Adding the keyword “measles” in English may result in noisier data, which could lower the accuracy of monitoring outbreaks using Google Trends.

When estimating correlations between Google Trends search and clinical case data, the Pearson correlation coefficient seems to be more suitable than Spearman’s rank correlation coefficient for this particular task.

Previous studies have only investigated the correlations between clinical case data and Google Trends search data for measles at the country level28,29,39. For example, due to the weak signal from Google Trends data, Samaras and colleagues aggregated Google Trends data from three countries to evaluate the correlation with clinical case data29. In contrast, we evaluated correlations at the regional level and found that correlations between clinical case data and Google Trends were stronger at the regional level than the national level. Using this approach in developing a pseudo-surveillance system has greater potential to localize disease outbreaks.

Limitations

There are also limitations to using Google Trends to monitor disease outbreaks. At the country level, Google Trends does not work well in LMICs. This may be due to poor Internet infrastructure limiting Internet access, low education levels, or low healthcare coverages, limiting knowledge-seeking behaviors. In high-income countries, compared to acute outbreaks, Google Trends cannot capture prolonged outbreaks with very few cases (<10 cases/week) circulating around all the time, such as the outbreaks in Tokyo in 2019 shown in Fig. 4. This may be due to the disease being around for too long but not widespread, causing people not to worry to continue to search. Also, local signals on Google Trends may not necessarily mean local outbreaks, such as the spikes on Google Trends of Tokyo and Osaka in 2018. This may be due to searches in big cities are coming from people like news staff, healthcare officials, or researchers, whose searches are not related to local outbreaks only. However, big cities usually have alternative existing surveillance systems to confirm whether there is a local outbreak. Google Trends data are sensitive to the selection of keywords. In this paper, we’ve only used one keyword to identify trends for our preliminary investigation, which could be more prone to false alerts triggered from news that may not relate to disease outbreaks.

Conclusion

This paper investigated the adaptation and feasibility of monitoring disease outbreaks using Google Trends data in real-time, especially for countries and diseases with limited or no sentinel network surveillance system. Using measles as an extreme case, which was much less widespread due to high vaccination coverage rates and early introduction (i.e., more than 60 years ago), Google Trends was found to be a potentially useful tool for monitoring of disease outbreaks at the regional level in developed countries. These results show promising potential for Google Trends data to be used in real-time disease surveillance for many diseases, even in challenging contexts. The Pearson correlation coefficient was more suitable than Spearman’s rank correlation coefficient with respect to evaluating correlations between clinical case data and Google Trends search data.