
A Pragmatic Approach to Handle Big and Little Data Needs


Martin Leach, VP R&D IT, Alexion Pharmaceuticals, Inc.
I remember in the late 90’s/early 00’s working with our first Cyberstorage Systems disk array. There must have been close to a hundred standard format spinning hard drives amounting to 1-2 TB of usable disk space and the cost was in the six figures. Today, fifteen years later you can buy a 2 TB flash drive for less than $15. In 2013, while CIO of the Broad Institute of MIT and Harvard, my team managed 10 PB of spinning disks in support of world-leading genomic and basic research. Today, working at the world’s leading Biotech for ultra-rare diseases, we also tackle the growing needs to make data frictionless through our organization. Unlike the 10’s of TB of data we would create on a daily basis at the Broad Institute our volume is much less but we face many of the common challenges around data that really need to be wrestled and wrangled into submission.
A key challenge around building out your ‘big data’ strategy is laying the foundation while developing the use-cases that will be leveraging the data capabilities. At Alexion, we don’t use the term ‘Big Data’ but instead focus on ‘Enterprise Data Management & Analytics.’ Since the scope of data and data needs is so vast, encompassing all areas of the company, we are using a basic ‘IDEA’ framework to layout the needs and flow of data throughout the company.
The data value chain we have created has a standard flow from ETL and data ingestion, through to providing the data to a number of end-user applications for reporting, visualization, visual analytics, and data analytics. We decided to start our data journey not with ingestion, as this was covered reasonably well, but with Data Mastering to provide a focus on reusing high quality key data in our organization. We leveraged a cloud-based MDM platform (Reltio) with a graph-based (Cassandra) back-end that has given us agility and speed to build our master data needs for several business areas. With an MDM foundation in place and a patchwork of capabilities for data ingestion, a multitude of reporting servers for Business Intelligence, we needed to get away from the many manual steps and data-wrangling needed to create a seamless data flow. The challenge was where to start with the competing needs across business units.
A key challenge around building out your ‘big data’ strategy is laying the foundation while developing the use-cases
With an existing portfolio of IT projects planned to support business needs, we took the pragmatic approach of leveraging the existing portfolio to focus our efforts on building out the data capabilities. We narrowed our focus to several key business areas and mapped out portfolio project needs as follows:
• Did a portfolio project have needs around: Ingest, Digest, Explore, or Analyze?
• Did the needs of a project map into the following Data Capability areas: Data standardization/normalization, Data Cataloging, Data Mastering, Data Lake/Data Mart, Data Cleansing and Auditing, Report Creation, Data Visualization, KPI management, Dataset Preparation/Curation, Analytics Consulting (Data Sciences), Self-service enablement, Data Governance, Data computing.
Following this mapping of projects into capabilities, we generated a clear line of sight around funded initiatives, their data capability needs, timing, and a sense of priority around what capabilities should be built out first.
A key gap in our current-state architecture was being able to easily build and manage our data catalog, with the streams and sources of data we have, and furthermore, the ability to slice out pieces of our data catalog to create ad hoc or routine data marts for reporting, visualization or analytics. We are in the process of adopting a two-tier approach to our data catalog and data marts. Essentially, following ingestion we will be depositing our data into a Hadoop-based data catalog and then slice out data on demand and be deposited into a graph or SQL-based data store for use in reporting or analytics. Spark and Hive will be key to pulling and extracting slices from our data catalog but will become unwieldy over time. We will need to look at building Pipeline Management (E.g. Streamsets) for the consistent flows from our data catalog to our active repositories (Cassandra or an SQL DB) for use in BI and analytics. There are a growing number of visual analytics tools (e.g. Arcadia Data ) that can tap directly into Hadoop data catalogs, but our initial approach will be to use standard Tableau, Spotfire, or Qlik-based tools for visual analytics and BI.Venk Dakshin, Sr IT Director, Enterprise Applications & Architecture, Alexion Pharmaceuticals, Inc.
We are examining the manageability and governance of our data platform and looking at end-to-end solutions (e.g. Cognizant Big Decisions, Zaloni, Unifi, SAP) that provide an end-to-end platform but also provides an easy administration and governance layer to monitor the flow and system operation for our data. This becomes important when you do not have an army of data engineers that understand the idiosyncrasies to keep your ‘Big Data’ technology stack running by feeding it command line instructions.
As we move forward with our Enterprise Data Management and Analytics roadmap, we have picked an initial high priority business initiative to ‘pull the thread through’ the entire value-chain. We have mapped out the data flows from source to analytics/reporting and are putting in the building blocks so that data can move without friction. As we put in these building blocks, we are pressure-testing them so that additional business use-cases can also be applied and that we don’t have to bring in additional technologies for different shaped data.
The IDEA framework and data value chain it supports works well for data that flows with frequency or reporting/ analytics that needs to be reproducibly generated on an ongoing basis. We believe this approach can work with ad hoc data needs but does come with a setup time and cost to getting the initial data flowing through the value chain. To overcome this, we have enabled a small set of our colleagues in IT and various business areas (typically business analytics/ data science groups) with specialized research computing capabilities where they can fire off container-based applications for data computing on our cloud-based computing platform. In other cases, data scientists have been enabled to perform desktop computing through data manipulation and visualization using tools like the iPython/Jupyter or Beaker data science notebooks. The latter two examples are more focused around the specialized data scientists and analysts we have but are the beginning of some self-service capabilities that we hope to bring to other areas as our next generation data platform matures.
Summing up, to really move forward your data sciences and Big Data agendas in an organization, you really need an organized portfolio of work to drive it. You will still need data engineers and data scientists to organize information coming in and to intelligently extract information and server it up to data scientists, BI experts and report builders. At the end of the day, clean data moving rapidly through to the end-user visualization, self-service analytics or reporting tools is all our end-users are really looking for.
As we move forward with our Enterprise Data Management and Analytics roadmap, we have picked an initial high priority business initiative to ‘pull the thread through’ the entire value-chain. We have mapped out the data flows from source to analytics/reporting and are putting in the building blocks so that data can move without friction. As we put in these building blocks, we are pressure-testing them so that additional business use-cases can also be applied and that we don’t have to bring in additional technologies for different shaped data.
The IDEA framework and data value chain it supports works well for data that flows with frequency or reporting/analytics that needs to be reproducibly generated on an ongoing basis. We believe this approach can work with ad hoc data needs but does come with a setup time and cost to getting the initial data flowing through the value chain. To overcome this, we have enabled a small set of our colleagues in IT and various business areas (typically business analytics/data science groups) with specialized research computing capabilities where they can fire off container-based applications for data computing on our cloud-based computing platform. In other cases, data scientists have been enabled to perform desktop computing through data manipulation and visualization using tools like the iPython/Jupyter or Beaker data science notebooks. The latter two examples are more focused around the specialized data scientists and analysts we have but are the beginning of some self-service capabilities that we hope to bring to other areas as our next generation data platform matures.
Summing up, to really move forward your data sciences and big data agendas in an organization, you really need an organized portfolio of work to drive it. You will still need data engineers and data scientists to organize information coming in and to intelligently extract information and server it up to data scientists, BI experts and report builders. At the end of the day, clean data moving rapidly through to the end-user visualization, self-service analytics or reporting tools is all our end-users are really looking for.
See Also:
ON THE DECK
Featured Vendors
Next Level Business Services (NLB): Applying Digital Transformation to Create Supply & Service Value Chains of the Future
Gerber Technology: Reshaping the Dynamics of the Fashion & Apparel and Flexible Materials Industries
FileFacets: A One-stop Solution for Locating and Identifying Data Across the Enterprise" title="Jennifer Nelson, VP, Sales & Marketing" style="float:left; margin-right:10px; margin-bottom:20px;" width="60px" height="50px">
FileFacets: A One-stop Solution for Locating and Identifying Data Across the Enterprise
Infoworks: Dynamic Data Warehousing on Hadoop that Automatically Ingests and Organizes Enterprise Data for All Use-cases
ThetaRay: Advanced Data Analytics Provide an Enhanced Security Layer to Combat Bank Fraud and Cybercrime
VentureSoft Global: Robust Big Data Solutions for Customer, Product Profitability and Operational Efficiency
Absolut-e Data Com BizStats – Leveraging Artificial Intelligence To Extract The True Potential Of Data
Relational Solutions, Inc.: Delivers Enterprise Demand Signal Repositories to the Consumer Goods Ind
Emagine International: Adaptive Contextual Marketing Platform for Personalized Customer Interactions
Cygnus Professionals: Translate Big Data into Actions: An Analytics Platform Transforming Enterprise
EDITOR'S PICK
Essential Technology Elements Necessary To Enable...
By Leni Kaufman, VP & CIO, Newport News Shipbuilding
Comparative Data Among Physician Peers
By George Evans, CIO, Singing River Health System
Monitoring Technologies Without Human Intervention
By John Kamin, EVP and CIO, Old National Bancorp
Unlocking the Value of Connected Cars
By Elliot Garbus, VP-IoT Solutions Group & GM-Automotive...
Digital Innovation Giving Rise to New Capabilities
By Gregory Morrison, SVP & CIO, Cox Enterprises
Staying Connected to Organizational Priorities is Vital...
By Alberto Ruocco, CIO, American Electric Power
Comprehensible Distribution of Training and Information...
By Sam Lamonica, CIO & VP Information Systems, Rosendin...
The Current Focus is On Comprehensive Solutions
By Sergey Cherkasov, CIO, PhosAgro
Big Data Analytics and Its Impact on the Supply Chain
By Pascal Becotte, MD-Global Supply Chain Practice for the...
Technology's Impact on Field Services
By Stephen Caulfield, Executive Director, Global Field...
Carmax, the Automobile Business with IT at the Core
By Shamim Mohammad, SVP & CIO, CarMax
The CIO's role in rethinking the scope of EPM for...
By Ronald Seymore, Managing Director, Enterprise Performance...
Driving Insurance Agent Productivity with Mobile and Big...
By Brad Bodell, SVP and CIO, CNO Financial Group, Inc.
Transformative Impact On The IT Landscape
By Jim Whitehurst, CEO, Red Hat
Get Ready for an IT Renaissance: Brought to You by Big...
By Clark Golestani, EVP and CIO, Merck
Four Initiatives Driving ECM Innovation
By Scott Craig, Vice President of Product Marketing, Lexmark...
Technology to Leverage and Enable
By Dave Kipe, SVP, Global Operations, Scholastic Inc.
By Meerah Rajavel, CIO, Forcepoint
AI is the New UI-AI + UX + DesignOps
By Amit Bahree, Executive, Global Technology and Innovation,...
Evolving Role of the CIO - Enabling Business Execution...
By Greg Tacchetti, CIO, State Auto Insurance
Read Also
Challenges that Compliance Officers face Today
Risk Exposures and How to Tackle them
Creativity Overcomes Scarcity
Putting The Customer At The Centre Of The Energy Transition
The Rise of Algorithmic Trading In The Power Sector
How to Align the Business and Operating Models of an Insurance Company
