Why IT Operation is Becoming a Big Data Problem
We as an industry are at an inflection point, of the type that comes around every decade or so, in terms of how enterprises manage their IT Operations. We hear constantly of skunk works DevOps teams and shadow IT efforts that are really symptoms of a larger, darker trend that no one wants to admit—the typical enterprise IT department no longer has control of its own datacenters. Paralyzed between commitments to the business, increased competition from the public cloud and a perpetual fear of a security breach, IT is now the bottleneck for nearly all business initiatives or modernization efforts. For the most part, this unfortunate state is an accepted way of doing business. Enterprises that figure out how to regain that control most quickly will enjoy significant competitive advantages over their competitors.
How did we get here?
Since the advent of the modern computer age in the late 1970s, we’ve seen three major shifts in the way that IT has operated. In the 1970s and early 80s, most applications were running on monolithic mainframes or minicomputers. Monitoring in this environment was dominated by the tools provided by the vendors of those platforms. This world was relatively simple, as the number of users, applications and systems under management were a mere fraction of those we see today. The first major change came with the mass adoption of the client-server model in the late 80s and early 90s. As systems slowly became more and more decoupled, the need for more flexible, cross-vendor monitoring options became necessary. The challenge then was beginning to understand the network and the impact on monitoring. This was primarily a conceptual challenge, not necessarily a technical one. The dominant player that grew out of this era was clearly BMC Patrol, and it would remain until the next major shift in application delivery, the web.
In the late 90s, as the Internet really took hold with the general public, the dominant application delivery paradigm shifted again, this time to web based applications. This era was defined by new levels of complexity at both the network and application layers. From this we saw vendors such as Quest Software emerge— focused not necessarily on monitoring individual components like database, application and systems as distinct, but as part of a greater whole. This model worked very well until it became disrupted by the web’s own success—more and more users came online, mobile applications proliferated, and virtualization was being implemented full-steam ahead. Existing tools started to creak under the weight of user demands and application complexity and the Big Data nature of monitoring modern infrastructures became apparent.
"Big Data is famously (or infamously) defined by what’s come to be known as the 3 Vs:velocity, variety, and volume"
Given that Big Data (at least as we think of it now) wasn’t even a concept in the early 2000s; the first reaction was not directly to Big Data. It was towards a different level of abstraction in monitoring: the focus on log data rather than system metrics as the fundamental basis. This led to the development of Splunk, who came out with an innovative idea at the time—a centralized storage and search infrastructure for log data. In 2004, the Splunk founders didn’t have the luxury of leveraging Big Data open source components and built their own, proprietary technology. Over the ensuing decade they built an empire on that engine and their unit of abstraction, the log.
Now a decade in, we are seeing people move away from Splunk and the log search engine clones that it spawned, driven away by a multitude of factors:
● Lack of analytic capability
● Siloing of data
● Unpredictability of pricing model
● High TCO, especially at the high-end where non-commodity hardware is frequently seen
● Limited scale and arbitrary limitations on use of the customer’s own data
In the interim, we saw the dawn of the Big Data movement, allowing us to truly rethink the problem from scratch. In my experiences at Cloudera, Splunk offload/log analytics was the second most frequent use case. Big Data is famously (or infamously) defined by what’s come to be known as the 3 Vs: velocity, variety, and volume. When we think about datasets, IT shops need to regain the level of control that is critical for smooth operations at the enterprise level which translates into achieving the 3 Vs.
One theme to call out explicitly is that search as the primary metaphor for IT triage is truly broken. Short-term memory can hold about 7 objects; IT environments have thousands of servers and who knows how many containers and hypervisors, not to mention interconnections. Without purpose built software to guide users in the right direction, a junior site reliability engineer who is paged in the middle of the night with a ticket stating “The website is slow,” would have a challenge trying to figure out where to start debugging, let alone be able to maintain any semblance of MTTR. The problem is that the search metaphor, by definition, assumes tribal knowledge of the application and its underlying components a priori— an impossibility in most Fortune 500 enterprises these days.
If we continue to frame the problem by examining the reasons organizations are citing as they move away from Splunk, a solution emerges:
• Ingestion of not only tens or hundreds of terabytes or eventually even petabytes of data per day of not only logs, but also APM, network data, and system metrics (Flume, Logstash)
• Multi-datacenter support with reliable data
• Single store of record for all those data types to avoid silos (HDFS)
• Search (Solr/ElasticSearch)
• SQL (Impala/Drill/Hive/SparkSQL)
• Ad hoc dashboards to do simple correlations
• Publish subscribe model for downstream data consumption (Kafka)
• Out of box visualizations
• Direct programmatic access to data in open formats with tools like R and SAS
• Real-time, general-purpose anomaly detection on all metrics in the system
• Real-time stream processing for custom analytics
Based on our experiences at Cloudera and by implementing similar systems many number of times with professional services, my cofounders and I have built this as a shrink-wrapped system at Rocana. Whether you choose our solution or a DIY option, running IT Operations needs to change. Continuing to run IT operations as-is virtually guarantees failure as we enter the Big Data era of operations.