Big Data and Machine Learning
With machine learning you just dump all your data into a fancy algorithm and everything gets sorted out. Right? Not quite. Analysts will tell you that they typically spend about 4/5s of a project munging data not analyzing it. And that percentage doesn’t necessarily change when it comes to machine learning. To get big data and machine learning right it’s critical to understand what type of analysis fits your data. How much data you have, whether the data numeric or categorical, whether or not there a pre-defined outcome and whether time sequences are involved are just a few of the factors that will drive both the machine learning techniques you need and the ways you need to structure your data. Understanding the choices, the obstacles and the potential barriers to success are all critical if you’re going to help your organization use data science and machine learning effectively.
It’s probably necessary to start with what we mean by machine learning. Machine learning analytic tools have been around for quite some time and are routinely used by analysts – almost certainly so in every large organization. In fact, almost every statistical analysis technique used for predictive analytics is an example of machine learning. Techniques like regression, decision-trees and clustering analysis are all “machine-learning” even though they’ve been in common use for decades.
Let’s take a closer look at linear regression – probably the most common and basic machine learning technique in today’s world. Linear regression has been used to tackle a huge array of problems and make useful predictions about them. Linear regression is nothing more or less than drawing a straight line through a set of points so as to minimize the total distance between the line and each point.
Techniques like regression, decision-trees and clustering analysis are all “machine-learning” even though they’ve been in common use for decades.
Although not every machine learning technique is as easy to understand or transparent as regression, none of them is magic either. They all have roots in processes very similar to what I’ve described for regression analysis.
So what are the key factors when considering whether machine learning is right for you and which flavor is most appropriate? Here are some factors to think about:
How much data you have
There’s two types of data problems: having too much and having too little. Both crop up quite a bit. Statistical analysis techniques can be expensive to run that can make for very poor performance even on high-powered modern systems like Hadoop clusters. That’s why it’s common to train machine learning models on samples of the data. Less solvable are situations where you don’t have enough data. Machine learning isn’t magic and, in most cases, it needs a significant amount of data to create a model. If you have, for example, 12 months of GRP data from your media buys, you don’t have enough data to create a reliable model. 12 data points just isn’t enough. In addition, the more fine-grained your technique the more data you’ll need. If you’re trying to create a deep learning model of visitor behavior you’ll need, at minimum, thousands of rows of training data and possibly much more.
Supervised or Unsupervised
Supervised machine learning techniques require data that tells them what the right answer is. Suppose, for example, you want to predict which customers will purchase your products. If you have a data set that includes behaviors of customers that did and did not purchase and you have a variable that flags those that made a purchase, you can use supervised learning. The model will figure out which factors are most predictive of that outcome. If you don’t have that flag, then supervised techniques won’t work. Not every machine learning problem requires a training set with the “right” answer. Suppose you want to understand what types of customers you have. There are machine learning techniques that can “cluster” the data into logical groupings that you can then use for segmentation. This type of unsupervised learning is ideal for open-ended problems. It’s also important to know that you can always “create” a training set.
Categorical or Quantitative
The two major types of traditional analytic data are categorical (things like gender, company, zip-code) and quantitative – variables that represent a quantity of something. Usually this is as simple as whether or not a variable is a number, but that’s not always the case. Zip code, for example, is a categorical variable even though it’s a number. You don’t have more zippiness because you live in 90450 than 10211! There are different statistical techniques for handling categorical data. Some tools can work with either categorical or quantitative data. Regression, on the other hand, is really just for quantitative analysis.
Sequenced or Flat
Big data isn’t just bigger than traditional data, it’s often structured differently. In particular, a lot of big data problems are hard because understanding the data requires an understanding of the sequence or timing of events. Many traditional machine learning techniques like regression and clustering don’t handle this type of data well. For data with sequence or internal time orderings, techniques like Markov Chains or deep learning are required.
Complex Structure or Columnar Data
Most enterprise data is stored in nice, neat columns. That’s great for traditional machine learning techniques – it’s exactly what tools like regression and clustering expect. But if your input data is a topological map, a geographic survey, a video stream or a podcast, then you’re largely out of luck with those techniques. Deep learning techniques create a stack of neural networks to analyze really complex data shapes and patterns. Making sense of data with a complex structure is where deep learning techniques have really shone these past few years. For straightforward machine learning problems, these techniques are harder and not necessarily better. But if you have to decode complex data structures or patterns, they’re invaluable. That’s why deep learning techniques have become so prominent in applications like voice and facial recognition, image classification, and in complex pattern problems like playing Go.
There isn’t one “best” machine learning technique. How much data you have, whether you have a “right” answer, and how your data is structured all make an important difference in the potential for machine learning and appropriate machine learning technique to use. The newest and fanciest machine learning techniques for big data don’t change this – they just extend the reach of machine learning into domains where the previous generation of tools couldn’t go.