Using Big Healthcare Data to Accelerate Medical Discovery

Marzieh Nabi, Research Scientist and Technical Lead, PARC

Marzieh Nabi, Research Scientist and Technical Lead, PARC

Metabolic syndrome, which increases the risk of heart disease, stroke and diabetes, is a medical condition that affects the health of nearly 34 percent of Americans. After more than ninety years of research, it’s now understood that the syndrome is caused by a cluster of conditions—increased blood pressure, high blood sugar, excess body fat around the waist and abnormal cholesterol levels.

"The promise of big healthcare data is set to significantly pick up the pace, kicking off a new age of intelligent medicine"

This understanding has vastly impacted the treatment of metabolic syndrome, which has improved the quality of life of those who are affected. The process of medical discovery has historically been very slow and starts with a small set of observations and many pre-clinical and clinical trials on different patient population cohorts. Heterogeneous environments, uncertainties in original hypotheses, the passage of time and accumulating costs make it a very complex process.

The promise of big healthcare data is set to significantly pick up the pace, kicking off a new age of intelligent medicine where information from different medical resources will become integrated. When combined with clinical perspectives from medical care professionals, we will see the pace and reach of medical discovery change in ways that we can only now start to imagine.

Sharing Data Sets from Different Resources

Artificial intelligence and machine learning approaches hold the potential to reveal hidden information in biological and medical healthcare datasets. Combining observational data, medical and pharma trials data, medical literature, knowledge of key functions like metabolic and genetic pathways, and more will change the pace and outcome of medical discovery in the near future. This will lead to the development of novel diagnostic and prognostic tests as well as descriptive, predictive and prescriptive analytics that guide hypothesis generation.

By putting an infrastructure in place, it will be possible to achieve better treatment plans, more efficient preventions, explore new medications, and more. This also would drive the need for innovative business models around knowledge discovery, reshaping the healthcare landscape.

It’s clear that inpatient Electronic Medical Records (EMR) enable scientific discovery to some extent. But imagine the impact on medical discovery if, for example, we can combine ambulatory outpatient data and Quantified-Self (QS) devices, with inpatient EMR data. When transcribing conversations between patients and physicians, many inaccuracies occur which ultimately affect EMRs. Patient-generated data sets provide more accuracy. Integrating these data sets with inpatient and outpatient EMRs will improve the efficiency of care delivery. Additionally, large amounts of anonymized, integrated data can be used to implement population health modeling, and develop new drugs and treatments.

While the possibilities are exciting, it’s important to note the limitations big healthcare data poses to the process of knowledge discovery. While rich healthcare datasets exist, including electronic medical records, large collections of complex physiological information, medical imaging data, genomics, as well as other socio-economic and behavioral data, it’s not easy for artificial intelligence and machine learning researchers to extract knowledge from them. The complex nature of the factors involved in transforming the verbal information exchange between patients and physicians into written information on medical charts—and from there to International Classification of Disease (ICD) codes used in EMR data leads to enormous coding errors. In addition, coding standards are not universal, as different hospitals require unique-to-them coding quality standards. Medical claims are the backbone of EMR data, but they are collected for billing purposes, which brings yet another source of bias into the data.

In order to perform data-driven analysis or build causal models using these datasets, challenges, such as integrating multiple data types, dealing with missing data and handling irregularly sampled data must be addressed.

Additionally, clinical perspectives from medical care professionals are required to assure that advancements in healthcare data analysis result in positive impact to eventual point-of-care and outcome-based systems.

Knowledge Discoveries in Comorbidity Analysis of Autism

Comorbidities—cases in which patients have two or more chronic conditions—was named the 21st century challenge for healthy aging by the “White House Conference on Aging” in 2014. In developed nations, about one in four adults have at least two chronic conditions, and more than half of older adults have three or more chronic conditions. In the United States, the $2 trillion healthcare industry spends 71¢ of every dollar on treating individuals with comorbidities. In Medicare spending, the amount rises to 93¢ of every dollar

Costs related to Autism Spectrum Disorder (ASD) were estimated to be $268 billion during 2015 in the United States. ASD refers to a group of complex brain-based development disorders characterized by challenges in behavior, social skills, and communication. The communication impairments combined with an ambiguous presentation of symptoms, create a climate in which comorbidities in patients with autism are not always discovered or treated.

Clinicians who treat patients with comorbidities must be cognizant of many layers of care and complexities. An automated platform designed to integrate information from different medical resources could help clinicians better address this challenge, especially in patients with autism. It would benefit society by improving patient treatment and reducing the financial and emotional burden carried by families and caregivers.

What Big Data Allowed Me to See

As a scientist, I am always eager to get hold of interesting data sets. I recently gained access to a rich longitudinal inpatient EMR data set with more than nine million unique patients. I began researching comorbidities associated with autism, and how they evolve over time.

I considered age groups from 0 to 35 years old, and divided them into buckets of five years.

The next step was to choose the methodology. I chose to apply apriori, algorithm used for frequent item set mining and association rule learning over transactional databases. The algorithm was invented in 1994 and has demonstrated its practicality in many different industries. We applied the apriori algorithm to our data set, and were able to confirm some of the already-known knowledge about comorbidities of autism such as a higher prevalence of epilepsy. We observed how digestive problems in people with autism evolve from early age to adulthood, how obesity and diabetes as a comorbid condition to autism change over time, and how epilepsy and mental disorders progress in patients with autism over time. We plan to publish the results of this analysis in a scientific paper.

Integrating different data resources—such as pharma data, clinical data, inpatient and outpatient data, quantified-self device readings, and insurance info—could add value and different perspectives to the knowledge discovery process. We used inpatient data sets. So our analysis focused on the very sick population of autism patients who visited the hospital due to a serious medical condition they had. In cases of autism, patients appear more often in ambulatory settings rather than hospitals. Therefore, more information can be captured if the ambulatory data is also available.

But, the process of compiling, combining, researching, analyzing and distributing the data is complex—and the road to get there is undetermined at this point. But the implications of sharing data are far-reaching and can positively impact patient diagnosis and care, so the road is worth traveling.

Read Also

Why Big Data is a Big Deal

Neetu Seth, Founder, NITS Solutions

Five Best Practices for Building a Data Warehouse

Frank Orozco, Vice President Engineering, Verizon Digital Media Services

Resolving Disassociated Processing of Real-Time and Historical Data in IoT

Konstantin Boudnik, Chief Technologist Bigdata Open Source Fellow, EPAM