Turning a Sparse Data Problem into a Big Data Solution

Nita Madhav, MSPH, Vice President of Data Science, Metabiota
Nita Madhav, MSPH, Vice President of Data Science, Metabiota

Nita Madhav, MSPH, Vice President of Data Science, Metabiota

Infectious disease epidemics can cause significant human and economic harm. The second-largest Ebola epidemic on record (with 2,309 cases reported by the Democratic Republic of the Congo Ministry of Health as of 27 May 2019) continues to rage, with no end in sight. In the United States, measles has reached its highest peak in 25 years, according to the US Centers for Disease Control and Prevention. Zika virus is estimated to have caused $18 billion in economic losses in Latin America and the Caribbean, based on estimates by the United Nations. Governments and corporations continue to face disease threats and need innovative strategies to prepare for, mitigate and manage the risk. Traditionally, epidemics pose a sparse data problem, but today’s technologies have made it possible to mine big data insights, even when starting out with little information.

Sparse Data

During the course of an epidemic, the data types available include officially reported counts of cases and deaths at different time points and locations during an outbreak. However, a large amount of the data goes undetected and unreported. As a result, a large portion of the epidemic risk space remains unmeasured and unknown. This is even more acute for very rare, but highly catastrophic scenarios, which can be enormously devastating for countries’ and companies’ livelihoods.

While electronic data sources are extremely valuable, there are currently several challenges to realize their full potential. Electronic health records capture a wide variety of clinical and utilization data but can run afoul of privacy concerns. Similarly, cell phone location record data can be invaluable for estimating how people move, which is an important predictor for where an outbreak might go next. But these data often apply only to a subset of the population and are typically proprietary. So while the potential exists to use many different data sources, the reality is that it is very difficult to obtain and utilize them.

Synthesizing Data

Given the need to fill in the knowledge gaps in the epidemic risk space, it is necessary to convert the sparse data problem into a big data solution through the use of synthetic data generation. The term “synthetic data” refers to data generated, for example, using complex mathematical simulation models. The term is achieving broader usage in the field of data science and is expected to gain even more traction in the future.

At Metabiota, we generate these data by feeding sparse data from real-world datasets, such as statistical distributions of disease-spread parameters, into computationally-intensive epidemic simulation models, which replicate the entire world -- all 7.5 billion people -- and estimate where an epidemic starts, how it spreads from person to person, and how it moves from place to place. These simulated epidemics are tracked on a daily time step until they burn themselves out or are successfully contained using different intervention measures such as quarantines and vaccines. We track hundreds of thousands of simulated epidemics in this way, which provides a tremendous big dataset from which to derive insights about potential impacts, such as numbers of infections, hospitalizations, deaths, employee absences, and monetary losses.

This rich synthetic data set allows us to explore a wider set of realistic scenarios that can provide insights about just how bad an epidemic could be, the likelihood of seeing an epidemic of such a size, and the most effective types of intervention measures. Ultimately, they can help countries and companies to make decisions on optimal risk mitigation plans for future outbreaks.

 During the course of an epidemic, the data typically available include officially reported counts of cases and deaths at different time points and      locations during an epidemic. However, a large amount of the data goes undetected and unreported 

Selecting the right tools, infrastructure, and resources

Synthetic data generation can be a very computationally-intensive process. A data science department running large-scale computer simulations requires a significant amount of computation resources. For example, running the tens of millions of epidemic and pandemic simulations to date have required over 90,000 compute hours, 11.4 billion I/O requests, and resulted in over 100 terabytes of uncompressed data. This magnitude of computing power is made possible through the use of high-performance cloud computing. To increase cost savings, we utilize multiple cloud providers, select optimal class storage based on data access frequency, and use spot/preemptible instances whenever possible.

Once the simulations are completed, what is needed next is a way to mine the output and derive insights from the massive amounts of data. To do this, we need the right team and the right tools. First, we have assembled a high-functioning team featuring top talent and a collaborative team structure. Our team includes data analysts who do a lot of initial data collection and structuring, which is used to inform the models. Our data scientists are strong in three key areas: programming, statistics and subject-matter expertise (for example, epidemiology or actuarial science). In-house tools developed in R are used to generate and analyze the large data sets. However, it is also important to build in flexibility and be open to experimentation so as to not get locked into a single approach. We have experimented with other approaches and tools, although R continues to win out due to its flexibility and open-source nature. We also continue to explore new data sources, models, and collaboration opportunities.

Our business environment has been transformed by the ability to utilize large-scale computing resources and simulation modeling to generate massive quantities of simulated epidemic data. These data can be mined to gain deep insights about how epidemics can affect us all. With these insights, countries and companies can more effectively mitigate and manage epidemic risk and can improve the world’s resilience to epidemics

See Also:

Top Big Data Solution Companies

Read Also

Big Data: Separating the Hype from Reality in Corporate Culture

Big Data: Separating the Hype from Reality in Corporate Culture

Brett MacLaren, VP, Enterprise Analytics, Sharp HealthCare
Maintaining Maximum Relevancy for Buyers and Sellers

Maintaining Maximum Relevancy for Buyers and Sellers

Zoher Karu, Vice President and Chief Data Officer, eBay
Building Levies to Manage Data Flood

Building Levies to Manage Data Flood

Adam Bowen, World Wild Lead of Innovation, Delphix
Resolving Disassociated Processing of Real-Time and Historical Data in IoT

Resolving Disassociated Processing of Real-Time and Historical Data in IoT

Konstantin Boudnik, Chief Technologist Bigdata Open Source Fellow, EPAM