Data Prep: Time to Smarten Up Your Data
Data preparation is the manipulation of data into a form suitable for further processing and analysis. It’s a demanding and labor-intensive stage that involves many different kinds of tasks and cannot be fully automated. It’s estimated that data preparation accounts for 60 percent-80 percent of the time spent on data mining projects.
"The true significance of data preparation lies in the early insights that will later manifest themselves in smarter, more contextual model results"
The technical preparation of the data is only part of this phase. Obviously, it’s mandatory to have clean and pretty data—to arrange the data in specific formats, to fill missing data entries, and reach some minimal data quality. Otherwise—garbage in, garbage out. But the true significance of data preparation lies in the early insights that will later manifest themselves in smarter, more contextual model results, and in greater model transparency. Without careful and thoughtful analysis, a data prep project might as well not occur.
One of the key contributions that can be made to the data during prep is smart transformations to the key predictors (independent variables). We sometimes forget that we want to exhaust the potential of the variables. This is a golden opportunity to create value, and if we fail to see this we might miss highly important aspects, and also lose meaningful prediction power.
In their default form, many of our variables have some given, limited level of information within them. Our job is to make them more informative. In other words, we want to squeeze more information out of them. We want to help them tell the story they want to tell but sometimes have a limited ability to do so on their own.
One common way to achieve this is by using data transformations. Transformations are a very effective instrument by which we can help the data become more informative and tell its story.
I’ll try to illustrate this point using an example from e-commerce. Many times we take the number of orders as a predictor of LTV. But it's common knowledge that the gap between the 1st and 2nd orders has by far the largest impact and reflects a much more critical stage in predicting a customer's future activity then the gap between orders 21, 22, 101, 102 etc. The absolute difference in both cases is the same (1 order), but the first situation tells us much more than the second one.
In this case, we can achieve better effectiveness from our data if we use a logarithmic transformation on such a field.
Moreover, just because a variable naturally has some specific scale (such as USD, days, etc.) it doesn’t mean that we should stick with that scale or that that scale is necessarily the most informative one. I often find that common sense and basic business understanding can be translated into some simple arithmetic actions that eventually make the difference.
As another example, we can combine several variables that have different scales into one factor that reflects the value of all three, such as in RFM segmentation (making 3 continuous variables into one discrete variable, which can bring much more efficiency and decrease redundancy).
In both instances, the key is stepping back from the data, looking at the big picture and integrating business considerations into the data. A well-known saying is that “although we often hear that data speak for themselves, its voices can be soft.” Smart transformation during data prep can make your data’s voice loud and clear.
Room of Influence
To sum up, data prep shouldn’t be viewed as a burden but rather as an essential golden opportunity for smartening up your data. Beyond the technical aspects, which are mandatory, one important challenge is to get the data to be as valuable and informative as possible. This is where the data scientist has a very large room of influence, and where his abilities are put to the test.
Do this process well and you could save time and spare model complexity later. Often, good transformations that rely on a solid understanding of the data will enable you to use simpler models to solve the given problem, and will save many re-iterations and minor calibrations. So invest time and energy here, and it will pay off!