
The Other Five V's of Big Data: An Updated Paradigm


Richard Mallah, Director of Unstructured and Big Data Analytics, Cambridge Semantics
By this point, anyone vaguely familiar with big data knows about its associated "five V's": volume, velocity, variety, veracity, and value. That is only half of the big data story, however, and only half of the V's. Industry is just starting to appreciate that management and analysis of big data today also require upfront cognizance of its verbality, verbosity, versatility, viscosity, and visibility. Understand these, and you'll understand why the pitfalls of big data management manifest today, and how to prevent them tomorrow.
Verbality refers to a great majority of data being unstructured text-like artifacts. It's well-known that unstructured data makes up 80+% of enterprise data. News, internal emails, call recordings, research reports, presentation decks, customer communications, patents, social media content, and documents of all kinds are highly relevant to your organization as data.
Verbosity means that within the unstructured, the semi-structured, and in structured data, there is a lot of redundancy, often the majority of raw volume. Understanding how to quickly disentangle the meaning you care about from its redundancies is important for efficiency of processing, but even more important for supporting the value and versatility dimensions, the reuse, of the data.
Versatility of data reflects how useful the data is, in different scenarios, and in applications for different sets of stakeholders, despite invariably having been created for a certain purpose. Understanding its quality, provenance, meaning, and context are a key to this.
Viscosity is with how much ease or difficulty data can be flowed to other use cases that would leverage its versatility. Highly viscous data has a lot of internal friction stemming from bespoke, though hopefully internally consistent, representations that, at minimum, require high-touch interpretation, transformation, and integration.
Visibility of data is the access control around it, whether completely blocking access to unauthorized groups, or conditionally allowing certain groups to see some summary existence or description of the data without exposing its sensitive details. This latter empowers self-service discovery of data sets by groups who can then request access through proper governance channels.
In contrast to most big data platforms, one emerging big data paradigm that addresses these five dimensions just as well as the more established five, is the multistructured smart data lake, often just called a smart data lake (SDL). Let's break down what that means.
See Also : Top Big Data Analytics Companies In Europe
“The emerging SDL paradigm meets these new V's of Big Data head on”
Multistructured means data is any mix of structured, semi structured, and unstructured, with multiple different, sometimes overlapping, views or schemas of data. First-generation data lakes tried to address the question of where to put the reams of incoming data, encouraging a late-binding paradigm in theory, but doing nothing to facilitate that binding later. They were just a small step up from shared drives and HDFS: there's low upfront cost, with total cost mostly incurred by repeatedly trying to interpret, cleanse, transform, and extract information from the data. Data lakes automatically store a massive amount of information in near-line storage, from which users can draw slices. The emerging generation of SDLs takes it much further. Smart data is self-describing data: quite literally, the semantics, or business meaning, of the data are encoded along with the data itself using a set of universal standards. Once data is semantically elevated via these linked data technologies, it can be more easily discovered, integrated, and reused.
SDLs stand in contrast with the rigid end of the spectrum, the highly curated systems inefficient on intake, maintenance, and data reusability, such as DAMS, CMSs, RDBMSs, warehouses, point-to-point integrations, custom applications, and FTEs manually finding and combining data: all high-touch, narrow-usage scenarios. With SDLs, you don't need to know what questions you'll be asking in the future. SDLs plot a middle course of control, providing a framework with which IT can map datasets into the semantic format that unlocks its benefits. Here's how the SDL addresses all ten dimensions of big data:
There are no limits to the sizes or volume of datasets. Scaling to teraquads and beyond (where a quad is an RDF semantic statement) is par for the course. High-velocity, parallelizable assimilation of streaming data into semantic formats empowers downstream use. As flows of information grow, there's less time to analyze them, yet it's critical to identify value soon enough that it still matters. Automatedly curating in a first pass, SDLs let you throttle the speed you triage new information without letting it just fall onto the floor. The wide variety of data formats and structures are harmonized semantically into a map of the meaning of, and relationships between, types of information. Particularly important for data quality efforts and general QoS, determining veracity is aided by maintenance of the provenance of data back to its source and all transformation steps it then underwent. This helps consumers trust that the information is complete and accurate. The use of data as information comes from contextualizing and mapping the value of each piece of information, giving focus on the most valuable types of information, and making all information more valuable by inducting it into the network effect.
Whether freeform database fields, Weibos, or technical PDFs, SDLs leverage verbality by automatic semantic annotation, finding entities, relationships, properties, concepts, classifications, and tags upfront. This enables users to find relevant assets for specific needs and analyze them at the time and depth needed. Excessive verbosity abates for discovery by automatically: de-duplicating entities, providing aggregate views, creating summary networks, and annotating key points within incoming data. Versatility is unlocked when parties can find data, understand it, and use it, flowing naturally from the semantic models. Elevating ingested data once will support multiple follow-on projects finding that data useful, and for each, speed data integration. The viscosity of data is thinned significantly by that meaningful business ontology describing source data. ACLs defining visibility down to individual entities are driven by any mix of attributes of the data, and with this metadata being data itself, those rules can flow to arbitrary downstream systems. Disparate parties will apply discovery to find and use datasets when requirements arise, searching datasets by content, concept, or association, using facets and filters, and viewing sample data, star ratings, descriptions, or colleagues' comments.
Today, meeting the needs of more enterprise stakeholders requires annotated repositories with understood meaning, broad findability, clear provenance, ease of reuse, and flexible access control. The emerging SDL paradigm meets these new V's of big data head on.
See Also:
ON THE DECK
Featured Vendors
Next Level Business Services (NLB): Applying Digital Transformation to Create Supply & Service Value Chains of the Future
Gerber Technology: Reshaping the Dynamics of the Fashion & Apparel and Flexible Materials Industries
FileFacets: A One-stop Solution for Locating and Identifying Data Across the Enterprise" title="Jennifer Nelson, VP, Sales & Marketing" style="float:left; margin-right:10px; margin-bottom:20px;" width="60px" height="50px">
FileFacets: A One-stop Solution for Locating and Identifying Data Across the Enterprise
Infoworks: Dynamic Data Warehousing on Hadoop that Automatically Ingests and Organizes Enterprise Data for All Use-cases
ThetaRay: Advanced Data Analytics Provide an Enhanced Security Layer to Combat Bank Fraud and Cybercrime
VentureSoft Global: Robust Big Data Solutions for Customer, Product Profitability and Operational Efficiency
Absolut-e Data Com BizStats – Leveraging Artificial Intelligence To Extract The True Potential Of Data
Relational Solutions, Inc.: Delivers Enterprise Demand Signal Repositories to the Consumer Goods Ind
Emagine International: Adaptive Contextual Marketing Platform for Personalized Customer Interactions
Cygnus Professionals: Translate Big Data into Actions: An Analytics Platform Transforming Enterprise
EDITOR'S PICK
Essential Technology Elements Necessary To Enable...
By Leni Kaufman, VP & CIO, Newport News Shipbuilding
Comparative Data Among Physician Peers
By George Evans, CIO, Singing River Health System
Monitoring Technologies Without Human Intervention
By John Kamin, EVP and CIO, Old National Bancorp
Unlocking the Value of Connected Cars
By Elliot Garbus, VP-IoT Solutions Group & GM-Automotive...
Digital Innovation Giving Rise to New Capabilities
By Gregory Morrison, SVP & CIO, Cox Enterprises
Staying Connected to Organizational Priorities is Vital...
By Alberto Ruocco, CIO, American Electric Power
Comprehensible Distribution of Training and Information...
By Sam Lamonica, CIO & VP Information Systems, Rosendin...
The Current Focus is On Comprehensive Solutions
By Sergey Cherkasov, CIO, PhosAgro
Big Data Analytics and Its Impact on the Supply Chain
By Pascal Becotte, MD-Global Supply Chain Practice for the...
Technology's Impact on Field Services
By Stephen Caulfield, Executive Director, Global Field...
Carmax, the Automobile Business with IT at the Core
By Shamim Mohammad, SVP & CIO, CarMax
The CIO's role in rethinking the scope of EPM for...
By Ronald Seymore, Managing Director, Enterprise Performance...
Driving Insurance Agent Productivity with Mobile and Big...
By Brad Bodell, SVP and CIO, CNO Financial Group, Inc.
Transformative Impact On The IT Landscape
By Jim Whitehurst, CEO, Red Hat
Get Ready for an IT Renaissance: Brought to You by Big...
By Clark Golestani, EVP and CIO, Merck
Four Initiatives Driving ECM Innovation
By Scott Craig, Vice President of Product Marketing, Lexmark...
Technology to Leverage and Enable
By Dave Kipe, SVP, Global Operations, Scholastic Inc.
By Meerah Rajavel, CIO, Forcepoint
AI is the New UI-AI + UX + DesignOps
By Amit Bahree, Executive, Global Technology and Innovation,...
Evolving Role of the CIO - Enabling Business Execution...
By Greg Tacchetti, CIO, State Auto Insurance
Read Also
Digital Transformation & Innovation
Digital Transformation and technological advancements in a NEO Bank
Digitising your businesses DNA
The Bank's Experience: How a Company's Use of Fintech Can Accelerate...
Fintech solutions for the exploding savings market: How banks can...
Looking to Finance a Tech Startup? Your Timing May Be Just Right
