Big Data: More Grey than Black or White
Let’s face it: we take digital everything for granted, often treating insights at face value without questioning underlying assumptions. Recently, I had a great dining experience based on an “unfavorable” online restaurant review citing “very spicy food.” I love spicy food. A filtered “favorable” search would have excluded this choice. The seemingly objective “unfavorable” rating was in reality more nuanced in a way that could be interpreted only by looking at the rating. If we struggle with this sort of data dichotomy, imagine the plight of developing an algorithm to scan millions of pieces of nuanced data that come from multiple sources. We increasingly trust such algorithms every day, sometimes with chilling implications.
We sit at the cusp of a new era, where exciting new types of data will be used in ways that we have yet to understand
In many ways, we inappropriately act as if outcomes are either right or wrong, knowing that the world around us doesn’t behave like that (“does this luggage contain anything dangerous?”). There is no defendable premise that we can simply scale our black and white approach to data from years ago to address the dynamic nature of data that is really neither black nor white.
Computable Data: Sound Waves and Singing Angels
Our brains create computable data (i.e. “consumable by an algorithm”) just by listening to music. The nuance of interpreting pressure waves comes from the brain’s interpretation, making music “computable.” The brain also curates, drawing on past experiences, learned and innate responses, and many other pieces of data to derive higher-order meaning. But it all starts with making data computable.
The first basic computational scenario to consider is when the data is ingested for the first time from the “real world,” like taking a picture. Light striking the lens of a digital camera is raw data. The optics in the lens transforms that data first, although still in analog form. As the data gets digitized internally, it becomes computable (insert the sound of singing angels here). The magic of color correction, image detection, and light balancing comes to us courtesy of algorithms processing ingested data while simultaneously combining other previously-curated information (for example, attaching metadata such as time, date, and location).
One of the most overlooked truths in today’s world is the rise in the availability of computable data. Failure to consider the new kinds of questions we can ask with all of these new types of data could be the single greatest mistake we can make.
Unstructured Data: Playgrounds and Adverbs
We have become numb to the dramatically increasing amount of data available to an enterprise. Many organizations are struggling to deal with the quantity and variety of information they already have, let alone considering new sources of data. It is, therefore, especially tempting to ignore data that is not conveniently packaged (i.e. digitized, with metadata), considering it “unstructured.” A great way to ease into the phenomenon can be seen by looking at a playground.
Few things initially look more unstructured than a school playground during recess. Closer inspection reveals the first hint at underlying organization and governance: a playground monitor watches to make sure that certain things do or don’t happen. Further inspection reveals lines and numbers painted on the ground, implying some sort of game rules. One might notice children of different ages, genders, or cultures exhibiting different inferred social norms. All of a sudden, what seemed to be unstructured seems a little more structured. This is a great analogy for much “unstructured” data.
To make classic “unstructured data” such as social media text more computable, we might inspect the text and decompose it via entity extraction (extracting nouns, verbs, adverbs.), sentiment analysis (attributing the mood), and language detection (ascertaining primary language). Like a brain processing music or a camera processing an image, there will be information loss. The trick is to understand the implications of that loss and to do something consistent with the extracted information. The curation step, combining the extracted information with other previously-computed information is the key to deriving meaningful insight.
Unstructured data may be an oxymoron. In almost all cases, there is some treatment that can be applied to derive some higher-order meaning from a collection of non-random data.
Methods: Standard Deviations and Standards for Deviation
Transforming data into computable information and inferring structure allows one to consider nuance via methods to create, transform or understand data. Common methods are measuring central tendency (For example, mean) or dispersion (For example, standard deviation). Another common method is regression, using past (longitudinal) data to estimate a relationship in the form of a prediction equation. These methods underpin much of today’s interpretation of numerical (digital) data. They are, however, notoriously dangerous when considering the types of new data and variation underlying subjective decisions such as whether or not a piece of luggage contains something “dangerous,” especially since the underlying behaviors change as they are being observed over time.
The good news is that there are many other methods that can help us look at this kind of question in a scientific way. Examples include machine learning and heuristic evaluation. There are also emerging technologies, such as quantum computing, that will allow us to ingest, organize, manipulate and understand data that is not binary.
The challenge of problem formulation is to always question why the method we select is the best method for the question at hand and the data available.
Computable data, treatment of unstructured data, and method selection are only three, but a very important three considerations when addressing nuance in data. The journey to understanding data in richer, deeper ways is daunting, but enormously exciting and rewarding. We sit at the cusp of a new era, where exciting new types of data will be used in ways that we have yet to understand. There is no better time to be exploring a world of data that is neither black nor white.