Why Your Big Data Lake Might Be a Croc
The growing hype surrounding data lakes is causing substantial confusion in the information management space, according to a report last year by Gartner, Inc. Several vendors are marketing data lakes as an essential component to capitalize on Big Data opportunities, but there is little alignment between vendors about what comprises a data lake, or how to get value from it.
Furthermore, although there are a plethora of technologies available, with more being added each day, companies are still worried about being left in the wake of the cost and effort required to re-train their staff. Many are scaling back their efforts under a wave of concerns including data quality and compliance.
But first, what exactly is a data lake? A simple definition would be a large storage repository that holds data in its native format until it is needed. Often associated with a data lake is Hadoop since many vendors recommend it as a means of low cost storage and processing for big data volumes. Much of the depth of the data lake debate stems from the trade-off between structured data schema within traditional data warehouses and marts, vs. the schema-less or schema-on-read aspects of a lake.
You Say Lake, I Say Swamp
There are sound arguments from both pools, with those cautioning against data lakes as a dumping ground, without data quality and form. Believing that data swamps will form unless there is some form of master data management (MDM) focus and rigor. They claim that each dam lake must be filtered, creating a crystal clear reservoir. They also are concerned about the lack of repeatability or reuse, where each individual or group has to go back to the well and re-interpret the data each time.
On the other side of the channel, fans of data lakes point to the speed of mixing and blending together data of different flavors, as well as the flexibility of relating and uncovering answers to questions inventing a new cocktail each time. Like Christopher Columbus, there are hidden worlds waiting to be discovered, and that can’t possibly happen if the boat is stuck at the dock. There are also those who tout self-service access to data at anytime, freeing day-to-day business users from the anchor of back-end IT efforts before data can be used.
I Want to Make It Drinkable; You Just Want to Go Fish
It comes down to a balance between governance, security and reliability of the data, the responsibility of IT data environmentalists, and deriving relevant insights in a timely manner, the hope of business fishermen.
IT teams have been told, at tsunami data volumes, you couldn’t possibly hope to keep up. Furthermore, most of the information is transactional or in some cases machine generated, so no purification is needed. Just manage your reference and profile data using traditional MDM tools, and just help preserve the quality level of the data lake, when needed.
One of the hottest job titles most often associated with a data lake happens to be the “data scientist” who is tasked with using analytic tools and languages to go trawl for patterns and insights in a sea of numbers and facts. This is far removed from a frontline business user, such as an individual marketing or sales person out in the field.
Access to Rod and Reel Does Not a Bouillabaisse Make
Beyond data quality concerns, there’s a disconnect between a data scientist angling for insights using a standalone analytics tool, with the poor frontline business user being schooled by the competition, and looking for help swimming through increasingly rough waters.
Business users work inseparate moored applications such as CRM, ERP, HR and financials that provided the streams of data that contributed to the lake. These legacy applications and the myriad of integration and tools seem overboard now, but they were created for each siloed purpose. Many end-users question why in the age of splashy consumer apps such as Facebook and LinkedIn, they are still stuck with enterprise applications that require labor intensive manual data entry, surfing between applications to get the complete view they need, and having to net out complex patterns of information.
If the data lake should be good for anything, it should produce relevant insights and recommended actions specific to their daily operations, so they can significantly improve productivity and outcomes.
Believe Me, the World is Round
When Christopher Columbus proposed to reach India by sailing west from Spain, he knew that the Earth was round. He was able to close the loop by circumnavigating the globe. Given the amount of data that can be mixed into a lake, and the technology we have at our disposal, it’s remarkable that we can glean insights but still have to guess as to how downstream actions and outcomes map back. We don’t know if it was truly the lake or just a fluke.
Surprisingly this disconnect seems to be okay with both IT environmentalists and business fishermen. But imagine if IT could source quality streams of data into the lake using data-as-a-service, and track how it was being consumed by business as they fished for insights. Business could also take action and provide feedback on the quality of their catch, signaling to IT what they need to do to refine the lake itself. Instead they are not in the same boat, each spending time on technology deep dives, hoping to overcome the barriers they face.
Creating a big data lake is no small delta; companies have already poured billions into their lakes, and spent countless months getting everyone onboard. Data scientists and Hadoop experts are needed for the long haul, with expensive consultants offshore and MDM coast guards on call. Make no mistake, a lake could be harboring a croc or three beneath the surface, and most companies would be wise to proceed with caution, lest it cost them more than just an arm and a leg.