
Big Data: An Evolutionary Perspective on Data Warehouse Architecture


Moises J. Nascimento, Chief Data Architect, PayPal
The challenge of developing an enterprise data system that is able to meet millisecond transaction response times—and, simultaneously, integrate data fast enough for near real-time analysis across multiple data platforms—can be overwhelming. Most companies struggle with multiple data silos built across platforms that blend transactional database systems, data warehousing, Big Data systems, NoSQL, in-memory stores and message buses, dragging years of technical debt in their wake.
What is the way out? Let’s take a look on how we got here first.
From Mainframe to Data Warehousing
In the early 90’s, when I was around seventeen years old I started working as an intern on the database architecture team at large car factory in Brazil. We were using the “state of the art technology”–an IBM mainframe running IMS and DB2 (hierarchical and relational) databases. On the mainframe, we had a very consistent and mature data management system, with data models, physical schemas, metadata and standardization. However, reporting capabilities were very limited.
At the time, delivery of month-end reports meant loading stacks of boxes—containing the printouts—into a pickup truck. The reports were delivered to an army of analysts, who would laboriously entered this data in Excel so that they could run analysis, aggregations and do graphical work.
When client/server became an alternative to mainframe, I started working as an Oracle DBA. Though I would expand and leverage the same data architecture principles from the mainframe, the new architecture presented a major challenge: how to integrate data across platforms.
“In the early days, the common challenges were on the physical design, ETL architecture, network latency, unstable storage systems, and database servers that were just starting to add features like parallel processing”
The answer to that was Data Warehousing; I started building a marketing and financial database that would bring all the data back together. During those early days, the common challenges were on the physical design, ETL architecture, network latency, unstable storage systems, and database servers that were just starting to add features like parallel processing.
While most DW literature would make references to data mining and unstructured data management, there was no technology available outside the mainframe to process massive amounts of data. Yet, even as relational EDW databases matured, they were not designed to deal with unstructured data sets. While we attempted to use creative solutions, the growth of data and its use would soon reveal that the EDW model was not going to scale or make sense from a cost perspective.
The EDW limitation worsened with growth of e-commerce when the amount of unstructured data started to increase: Web traffic data, logs, and data from social networks. We clearly needed a new massive parallel processing paradigm.
Hadoop: A Data Warehouse Architect Dream Come True
Data Warehouse architecture helped us to address a lot of the data management frameworks in the context of a largely distributed database environment. However, unstructured data management, as well as scientific data processing and mining, constituted a major gap.
When I started researching Hadoop, I felt really excited with the possibilities of addressing these gaps. However, with the new technology came the hype and it looked like, all of sudden, Data Warehousing was a thing of the past. What the new data professionals missed is that to successfully harvest the power of the vast amount of unstructured data, it was critical that the core company transactional data was integrated and modeled together—in order to add context to the unstructured data.
Therefore, in order to be able to architect a data system that leverages both EDW and Hadoop, we need to revisit some old EDW facts that are no longer true:
1) All data must be in one monolithic EDW server.
2) Analytical data duplication and redundancy is bad.
3) EDW is a downstream system.
From the “Warehouse” into a“Store” next to you!
Before we cover my view on the next generation of data systems, let me talk a bit about how do I see the Hadoop ecosystem today. The system is in its initial phase of evolution and I believe that the paradigm will continue to evolve to perform DW RDBMSlikefunctions on a much cheaper open source, shared nothing architecture. Solutions like hBase, Impala and Drill are a proof of this trend but we are still not there.
Therefore, considering the current state of maturity of the Hadoop ecosystem and applying all key Data Warehousing and Data Architecture core functions, let’s take a look into the role of the EDW traditionally and how it can integrate with Hadoop in an effective architecture that considers storage and access patterns to rationalize the data across the platform and its lifecycle.
In the new architecture, we expand all layers to create a flexible, on-time (online, near real time, batch) and democratic data platform without loosing control over the data governance, quality and source of record. To achieve that, we combine the Operational Data Sore (ODS) with the core EDW layer were we achieve a lower latency repository for the all source of record and core metrics so that the data gets distributed for all different kinds of analytical usages. We also achieve lower latency by bringing data streaming, real time analytics engine (such as Storm) and
Hadoop into the Integration layer to perform all data processing closer to the source with controlled performance and SLA. The three EDW facts I mentioned earlier could then be rewritten in this new architecture:
1) Data is rationalized and stored into an integrated data platform dedicated for real time streaming, batch and core EDW metrics processing.
2) Data is replicated and distributed into the end user RDMBS and Hadoop environments and, at the same time, will maintain the EDW concept of single source of truth.
3) EDW and ODS concepts merge and become an enterprise data store and source of record for the core data sets and metrics. EDW becomes an integration environment without ad-hoc query.
Concluding, when we leverage the strengths of data warehousing architecture, Big Data technologies and cloud computing principles, we can build a data platform were data and insights can be delivered as a Service.
See Also:
ON THE DECK
Featured Vendors
Next Level Business Services (NLB): Applying Digital Transformation to Create Supply & Service Value Chains of the Future
Gerber Technology: Reshaping the Dynamics of the Fashion & Apparel and Flexible Materials Industries
FileFacets: A One-stop Solution for Locating and Identifying Data Across the Enterprise" title="Jennifer Nelson, VP, Sales & Marketing" style="float:left; margin-right:10px; margin-bottom:20px;" width="60px" height="50px">
FileFacets: A One-stop Solution for Locating and Identifying Data Across the Enterprise
Infoworks: Dynamic Data Warehousing on Hadoop that Automatically Ingests and Organizes Enterprise Data for All Use-cases
ThetaRay: Advanced Data Analytics Provide an Enhanced Security Layer to Combat Bank Fraud and Cybercrime
VentureSoft Global: Robust Big Data Solutions for Customer, Product Profitability and Operational Efficiency
Absolut-e Data Com BizStats – Leveraging Artificial Intelligence To Extract The True Potential Of Data
Relational Solutions, Inc.: Delivers Enterprise Demand Signal Repositories to the Consumer Goods Ind
Emagine International: Adaptive Contextual Marketing Platform for Personalized Customer Interactions
Cygnus Professionals: Translate Big Data into Actions: An Analytics Platform Transforming Enterprise
EDITOR'S PICK
Essential Technology Elements Necessary To Enable...
By Leni Kaufman, VP & CIO, Newport News Shipbuilding
Comparative Data Among Physician Peers
By George Evans, CIO, Singing River Health System
Monitoring Technologies Without Human Intervention
By John Kamin, EVP and CIO, Old National Bancorp
Unlocking the Value of Connected Cars
By Elliot Garbus, VP-IoT Solutions Group & GM-Automotive...
Digital Innovation Giving Rise to New Capabilities
By Gregory Morrison, SVP & CIO, Cox Enterprises
Staying Connected to Organizational Priorities is Vital...
By Alberto Ruocco, CIO, American Electric Power
Comprehensible Distribution of Training and Information...
By Sam Lamonica, CIO & VP Information Systems, Rosendin...
The Current Focus is On Comprehensive Solutions
By Sergey Cherkasov, CIO, PhosAgro
Big Data Analytics and Its Impact on the Supply Chain
By Pascal Becotte, MD-Global Supply Chain Practice for the...
Technology's Impact on Field Services
By Stephen Caulfield, Executive Director, Global Field...
Carmax, the Automobile Business with IT at the Core
By Shamim Mohammad, SVP & CIO, CarMax
The CIO's role in rethinking the scope of EPM for...
By Ronald Seymore, Managing Director, Enterprise Performance...
Driving Insurance Agent Productivity with Mobile and Big...
By Brad Bodell, SVP and CIO, CNO Financial Group, Inc.
Transformative Impact On The IT Landscape
By Jim Whitehurst, CEO, Red Hat
Get Ready for an IT Renaissance: Brought to You by Big...
By Clark Golestani, EVP and CIO, Merck
Four Initiatives Driving ECM Innovation
By Scott Craig, Vice President of Product Marketing, Lexmark...
Technology to Leverage and Enable
By Dave Kipe, SVP, Global Operations, Scholastic Inc.
By Meerah Rajavel, CIO, Forcepoint
AI is the New UI-AI + UX + DesignOps
By Amit Bahree, Executive, Global Technology and Innovation,...
Evolving Role of the CIO - Enabling Business Execution...
By Greg Tacchetti, CIO, State Auto Insurance
Read Also
How Digital Experience Is Of Growing Importance To P&C Insurers And...
What It Truly Means For IT Security To Bea Business Enabler
Digital Transformation 2 Requires a CIO v2.x
Leverage ChatGPT the Right Way through Well-Designed Prompts
Water Strategies for Climate Adaption
Policy is a Key Solution to Stopping Packaging Waste
