On Big Data & Not Being Evil

John Speakman, Senior Director, Research Information Technology, NYU Langone Medical Center
1095
1798
356
John Speakman, Senior Director, Research Information Technology, NYU Langone Medical Center

John Speakman, Senior Director, Research Information Technology, NYU Langone Medical Center

In 2013, researchers from the Massachusetts Institute of Technology detailed in Science Magazine their successful attempts to reidentify individuals in a “deidentified” genomic dataset published by the National Institutes of Health (NIH), using only publicly accessible data on the Internet. This and other well-publicized instances suggest that “deidentified” data may never truly be so. Currently, NIH grant solicitations require applicants to attest that data will be “fully deidentified.” But the power of combining datasets (Big Data) has gotten ahead of the rules.

The first wave of Big Data to hit biomedicine was in 2003 with the first sequencing of the human genome, which had taken a decade to complete and cost over a billion dollars. Farsighted individuals, however, were already warning of a “tsunami of data” bearing down upon ill-equipped infrastructures. The cost of sequencing has dropped exponentially in recent years, outpacing Moore’s law. In January 2014, the instrument manufacturer Illumina announced a new sequencer with the claim that it can sequence human genomes at a cost of $1,000 for each sequence, in the form of about 250 gigabytes of raw data-storage and analysis not included. Pundits argued the semantics behind this claim, but as those of us in academic medical centers had known for some time, the tsunami is upon us. Using any sequencer, it is assumed that the user already has a robust IT infrastructure of storage, high-performance computing resources, and network bandwidth. Storage in particular is a critical aspect since discarding data once analyzed is not an option, funding agencies and medical journals require researchers to make it available on request. Allocating petabytes of storage for rarely-accessed data indefinitely is not a palatable option either.

Biomedicine is not waiting for us to discover a resolution, it is racing ahead. Organizations such as the Institute for Systems Genetics at NYU Langone Medical Center are establishing biology production lines with the potential to generate petabyte-scale volumes of new data annually. Furthermore, healthcare is preparing for whole-genome sequencing, previously a research activity,to become a routine part of patient care. As a result, genome sequence data will be part of every patient record within the next few years.

Read Also

Getting the Most out of Big Data

Getting the Most out of Big Data

Kris Rowley, Chief Data Officer, GSA
Big Data: Separating the Hype from Reality in Corporate Culture

Big Data: Separating the Hype from Reality in Corporate Culture

Brett MacLaren, VP, Enterprise Analytics, Sharp HealthCare
Maintaining Maximum Relevancy for Buyers and Sellers

Maintaining Maximum Relevancy for Buyers and Sellers

Zoher Karu, Vice President and Chief Data Officer, eBay
Building Levies to Manage Data Flood

Building Levies to Manage Data Flood

Adam Bowen, World Wild Lead of Innovation, Delphix