A Pragmatic Approach to Handle Big and Little Data Needs

Martin Leach, VP R&D IT, Alexion Pharmaceuticals, Inc. and Venk Dakshin, Sr IT Director, Enterprise Applications & Architecture, Alexion Pharmaceuticals, Inc.
394
636
129
Martin Leach, VP R&D IT, Alexion Pharmaceuticals, Inc.

Martin Leach, VP R&D IT, Alexion Pharmaceuticals, Inc.

I remember in the late 90’s/early 00’s working with our first Cyberstorage Systems disk array. There must have been close to a hundred standard format spinning hard drives amounting to 1-2 TB of usable disk space and the cost was in the six figures. Today, fifteen years later you can buy a 2 TB flash drive for less than $15. In 2013, while CIO of the Broad Institute of MIT and Harvard, my team managed 10 PB of spinning disks in support of world-leading genomic and basic research. Today, working at the world’s leading Biotech for ultra-rare diseases, we also tackle the growing needs to make data frictionless through our organization. Unlike the 10’s of TB of data we would create on a daily basis at the Broad Institute our volume is much less but we face many of the common challenges around data that really need to be wrestled and wrangled into submission.

A key challenge around building out your ‘big data’ strategy is laying the foundation while developing the use-cases that will be leveraging the data capabilities. At Alexion, we don’t use the term ‘Big Data’ but instead focus on ‘Enterprise Data Management & Analytics.’ Since the scope of data and data needs is so vast, encompassing all areas of the company, we are using a basic ‘IDEA’ framework to layout the needs and flow of data throughout the company.

The data value chain we have created has a standard flow from ETL and data ingestion, through to providing the data to a number of end-user applications for reporting, visualization, visual analytics, and data analytics. We decided to start our data journey not with ingestion, as this was covered reasonably well, but with Data Mastering to provide a focus on reusing high quality key data in our organization. We leveraged a cloud-based MDM platform (Reltio) with a graph-based (Cassandra) back-end that has given us agility and speed to build our master data needs for several business areas. With an MDM foundation in place and a patchwork of capabilities for data ingestion, a multitude of reporting servers for Business Intelligence, we needed to get away from the many manual steps and data-wrangling needed to create a seamless data flow. The challenge was where to start with the competing needs across business units.

  A key challenge around building out your ‘big data’ strategy is laying the foundation while developing the use-cases  

With an existing portfolio of IT projects planned to support business needs, we took the pragmatic approach of leveraging the existing portfolio to focus our efforts on building out the data capabilities. We narrowed our focus to several key business areas and mapped out portfolio project needs as follows:

• Did a portfolio project have needs around: Ingest, Digest, Explore, or Analyze?

• Did the needs of a project map into the following Data Capability areas: Data standardization/normalization, Data Cataloging, Data Mastering, Data Lake/Data Mart, Data Cleansing and Auditing, Report Creation, Data Visualization, KPI management, Dataset Preparation/Curation, Analytics Consulting (Data Sciences), Self-service enablement, Data Governance, Data computing.

Following this mapping of projects into capabilities, we generated a clear line of sight around funded initiatives, their data capability needs, timing, and a sense of priority around what capabilities should be built out first.

A key gap in our current-state architecture was being able to easily build and manage our data catalog, with the streams and sources of data we have, and furthermore, the ability to slice out pieces of our data catalog to create ad hoc or routine data marts for reporting, visualization or analytics. We are in the process of adopting a two-tier approach to our data catalog and data marts. Essentially, following ingestion we will be depositing our data into a Hadoop-based data catalog and then slice out data on demand and be deposited into a graph or SQL-based data store for use in reporting or analytics. Spark and Hive will be key to pulling and extracting slices from our data catalog but will become unwieldy over time. We will need to look at building Pipeline Management (E.g. Streamsets) for the consistent flows from our data catalog to our active repositories (Cassandra or an SQL DB) for use in BI and analytics. There are a growing number of visual analytics tools (e.g. Arcadia Data ) that can tap directly into Hadoop data catalogs, but our initial approach will be to use standard Tableau, Spotfire, or Qlik-based tools for visual analytics and BI.Venk Dakshin, Sr IT Director, Enterprise Applications & Architecture, Alexion Pharmaceuticals, Inc.

We are examining the manageability and governance of our data platform and looking at end-to-end solutions (e.g. Cognizant Big Decisions, Zaloni, Unifi, SAP) that provide an end-to-end platform but also provides an easy administration and governance layer to monitor the flow and system operation for our data. This becomes important when you do not have an army of data engineers that understand the idiosyncrasies to keep your ‘Big Data’ technology stack running by feeding it command line instructions.

As we move forward with our Enterprise Data Management and Analytics roadmap, we have picked an initial high priority business initiative to ‘pull the thread through’ the entire value-chain. We have mapped out the data flows from source to analytics/reporting and are putting in the building blocks so that data can move without friction. As we put in these building blocks, we are pressure-testing them so that additional business use-cases can also be applied and that we don’t have to bring in additional technologies for different shaped data.

The IDEA framework and data value chain it supports works well for data that flows with frequency or reporting/ analytics that needs to be reproducibly generated on an ongoing basis. We believe this approach can work with ad hoc data needs but does come with a setup time and cost to getting the initial data flowing through the value chain. To overcome this, we have enabled a small set of our colleagues in IT and various business areas (typically business analytics/ data science groups) with specialized research computing capabilities where they can fire off container-based applications for data computing on our cloud-based computing platform. In other cases, data scientists have been enabled to perform desktop computing through data manipulation and visualization using tools like the iPython/Jupyter or Beaker data science notebooks. The latter two examples are more focused around the specialized data scientists and analysts we have but are the beginning of some self-service capabilities that we hope to bring to other areas as our next generation data platform matures.

Summing up, to really move forward your data sciences and Big Data agendas in an organization, you really need an organized portfolio of work to drive it. You will still need data engineers and data scientists to organize information coming in and to intelligently extract information and server it up to data scientists, BI experts and report builders. At the end of the day, clean data moving rapidly through to the end-user visualization, self-service analytics or reporting tools is all our end-users are really looking for.

As we move forward with our Enterprise Data Management and Analytics roadmap, we have picked an initial high priority business initiative to ‘pull the thread through’ the entire value-chain. We have mapped out the data flows from source to analytics/reporting and are putting in the building blocks so that data can move without friction. As we put in these building blocks, we are pressure-testing them so that additional business use-cases can also be applied and that we don’t have to bring in additional technologies for different shaped data.

The IDEA framework and data value chain it supports works well for data that flows with frequency or reporting/analytics that needs to be reproducibly generated on an ongoing basis. We believe this approach can work with ad hoc data needs but does come with a setup time and cost to getting the initial data flowing through the value chain. To overcome this, we have enabled a small set of our colleagues in IT and various business areas (typically business analytics/data science groups) with specialized research computing capabilities where they can fire off container-based applications for data computing on our cloud-based computing platform. In other cases, data scientists have been enabled to perform desktop computing through data manipulation and visualization using tools like the iPython/Jupyter or Beaker data science notebooks. The latter two examples are more focused around the specialized data scientists and analysts we have but are the beginning of some self-service capabilities that we hope to bring to other areas as our next generation data platform matures.

Summing up, to really move forward your data sciences and big data agendas in an organization, you really need an organized portfolio of work to drive it. You will still need data engineers and data scientists to organize information coming in and to intelligently extract information and server it up to data scientists, BI experts and report builders. At the end of the day, clean data moving rapidly through to the end-user visualization, self-service analytics or reporting tools is all our end-users are really looking for.

Read Also

Getting the Most out of Big Data

Getting the Most out of Big Data

Kris Rowley, Chief Data Officer, GSA
Big Data: Separating the Hype from Reality in Corporate Culture

Big Data: Separating the Hype from Reality in Corporate Culture

Brett MacLaren, VP, Enterprise Analytics, Sharp HealthCare
Maintaining Maximum Relevancy for Buyers and Sellers

Maintaining Maximum Relevancy for Buyers and Sellers

Zoher Karu, Vice President and Chief Data Officer, eBay
Building Levies to Manage Data Flood

Building Levies to Manage Data Flood

Adam Bowen, World Wild Lead of Innovation, Delphix