The Other Five V's of Big Data: An Updated Paradigm
By this point, anyone vaguely familiar with big data knows about its associated "five V's": volume, velocity, variety, veracity, and value. That is only half of the big data story, however, and only half of the V's. Industry is just starting to appreciate that management and analysis of big data today also require upfront cognizance of its verbality, verbosity, versatility, viscosity, and visibility. Understand these, and you'll understand why the pitfalls of big data management manifest today, and how to prevent them tomorrow.
Verbality refers to a great majority of data being unstructured text-like artifacts. It's well-known that unstructured data makes up 80+% of enterprise data. News, internal emails, call recordings, research reports, presentation decks, customer communications, patents, social media content, and documents of all kinds are highly relevant to your organization as data.
Verbosity means that within the unstructured, the semi-structured, and in structured data, there is a lot of redundancy, often the majority of raw volume. Understanding how to quickly disentangle the meaning you care about from its redundancies is important for efficiency of processing, but even more important for supporting the value and versatility dimensions, the reuse, of the data.
Versatility of data reflects how useful the data is, in different scenarios, and in applications for different sets of stakeholders, despite invariably having been created for a certain purpose. Understanding its quality, provenance, meaning, and context are a key to this.
Viscosity is with how much ease or difficulty data can be flowed to other use cases that would leverage its versatility. Highly viscous data has a lot of internal friction stemming from bespoke, though hopefully internally consistent, representations that, at minimum, require high-touch interpretation, transformation, and integration.
Visibility of data is the access control around it, whether completely blocking access to unauthorized groups, or conditionally allowing certain groups to see some summary existence or description of the data without exposing its sensitive details. This latter empowers self-service discovery of data sets by groups who can then request access through proper governance channels.
In contrast to most big data platforms, one emerging big data paradigm that addresses these five dimensions just as well as the more established five, is the multistructured smart data lake, often just called a smart data lake (SDL). Let's break down what that means.
“The emerging SDL paradigm meets these new V's of Big Data head on”
Multistructured means data is any mix of structured, semi structured, and unstructured, with multiple different, sometimes overlapping, views or schemas of data. First-generation data lakes tried to address the question of where to put the reams of incoming data, encouraging a late-binding paradigm in theory, but doing nothing to facilitate that binding later. They were just a small step up from shared drives and HDFS: there's low upfront cost, with total cost mostly incurred by repeatedly trying to interpret, cleanse, transform, and extract information from the data. Data lakes automatically store a massive amount of information in near-line storage, from which users can draw slices. The emerging generation of SDLs takes it much further. Smart data is self-describing data: quite literally, the semantics, or business meaning, of the data are encoded along with the data itself using a set of universal standards. Once data is semantically elevated via these linked data technologies, it can be more easily discovered, integrated, and reused.
SDLs stand in contrast with the rigid end of the spectrum, the highly curated systems inefficient on intake, maintenance, and data reusability, such as DAMS, CMSs, RDBMSs, warehouses, point-to-point integrations, custom applications, and FTEs manually finding and combining data: all high-touch, narrow-usage scenarios. With SDLs, you don't need to know what questions you'll be asking in the future. SDLs plot a middle course of control, providing a framework with which IT can map datasets into the semantic format that unlocks its benefits. Here's how the SDL addresses all ten dimensions of big data:
There are no limits to the sizes or volume of datasets. Scaling to teraquads and beyond (where a quad is an RDF semantic statement) is par for the course. High-velocity, parallelizable assimilation of streaming data into semantic formats empowers downstream use. As flows of information grow, there's less time to analyze them, yet it's critical to identify value soon enough that it still matters. Automatedly curating in a first pass, SDLs let you throttle the speed you triage new information without letting it just fall onto the floor. The wide variety of data formats and structures are harmonized semantically into a map of the meaning of, and relationships between, types of information. Particularly important for data quality efforts and general QoS, determining veracity is aided by maintenance of the provenance of data back to its source and all transformation steps it then underwent. This helps consumers trust that the information is complete and accurate. The use of data as information comes from contextualizing and mapping the value of each piece of information, giving focus on the most valuable types of information, and making all information more valuable by inducting it into the network effect.
Whether freeform database fields, Weibos, or technical PDFs, SDLs leverage verbality by automatic semantic annotation, finding entities, relationships, properties, concepts, classifications, and tags upfront. This enables users to find relevant assets for specific needs and analyze them at the time and depth needed. Excessive verbosity abates for discovery by automatically: de-duplicating entities, providing aggregate views, creating summary networks, and annotating key points within incoming data. Versatility is unlocked when parties can find data, understand it, and use it, flowing naturally from the semantic models. Elevating ingested data once will support multiple follow-on projects finding that data useful, and for each, speed data integration. The viscosity of data is thinned significantly by that meaningful business ontology describing source data. ACLs defining visibility down to individual entities are driven by any mix of attributes of the data, and with this metadata being data itself, those rules can flow to arbitrary downstream systems. Disparate parties will apply discovery to find and use datasets when requirements arise, searching datasets by content, concept, or association, using facets and filters, and viewing sample data, star ratings, descriptions, or colleagues' comments.
Today, meeting the needs of more enterprise stakeholders requires annotated repositories with understood meaning, broad findability, clear provenance, ease of reuse, and flexible access control. The emerging SDL paradigm meets these new V's of big data head on.