The Curator's Dilemma
Back around the year 2000, I was running a long term multi-client research program on future of the (at the time) rapidly growing digital economy. In response to a participant’s question, I had a team working on estimating how much of human experience was being capture and stored digitally. There were about 6 billion people in the world at the time, with maybe 1 billion “on line” in some way. Remember that this was essentially pre smartphones and tablets, so there were lots of documents, emails, texts and images online, but little video. The web was only 5 years old. Google was two.
With some reasonable assumptions about levels of use, some fairly comprehensive secondary research on data traffic and stored data volumes, some simulation modelling and a few (reasonable I think) assumptions, we came up with a number—around 1 percent.
With pushing 3.5 billion people online today (and a lot more devices recording what those people are doing) I wonder what the number would be today.
Back then, 99 percent of human experience was either being lost or was confined to human memory (not a great long term storage mechanism or known for its accuracy or speed of recall). Roll the calendar forward to 2016 and we are creating and storing much more information (I recently saw an estimate of over 1 TB a year for every online individual, plus another 1 TB of associated data—log files, account information and so on). With 4 billion online users by 2018, that’s a lot of data—no wonder it’s called “Big”!
And it’s growing every year as we collect more of each individual’s experience and add more connected users and more smart devices. By 2035, connected cars alone will add more data each year than the entire digital accumulation up to 2010. Data accumulation is becoming a power law of growth and humans don’t do well looking at the future of power law phenomena, which are generally rare in nature and catastrophic when they do occur.
When we were doing the original research, we estimated that at the rate of growth we were seeing, if we used just 1 physical atom to store each digital “bit” of data and there were no duplicate copies, we’d run out of atoms sometime before 2020, and as far as I can tell, we’re still on track to do so. Using those atoms costs money—and the atoms can’t be used for anything else without losing the data they represent. So left alone, big data will eventually, and literally, “eat the entire world” or at least the entire technology budget.
But the total of human digital experience isn’t as simple as the sum of each individual’s activity history. There are masses of duplication—intersecting and overlapping viewpoints; “shared” experiences, where viewpoints may differ but the view is the same for everyone. As challenging as it may seem (and will be) we’re sooner or later going to have to start reducing the amount of duplication (or get more stored bits per atom) in the stored data.
The Big Data frenzy we are experiencing today is just the tip of the iceberg and is going to get bigger, more complicated and more difficult to deal with
And then there are all the things that don’t change much or at all from moment to moment (or day to day, or year to year). If the view is always the same, we can “edit” it out of the data and replace it with a “tag” that links each view to the first (that’s essentially how compression software works to reduce the size of large files). If we do this well, we can reduce the size of the stored data by over 80 percent and still keep enough to recreate every scene as it actually happened from the viewpoint of everyone who was involved.
[OK, for the math purists among you, I know some things get larger when processed by “lossless” compression algorithms, but in the real world, where there’s a lot of “well behaved” and static data, the 80 percent reduction percentage is a pretty good target]
So we can probably push out the day when we will run out of atoms to store our bits by a couple of decades (maybe), but sooner or later we are going to run out. At which point “curation” strategies will become really important. Just what should be kept? Who and what gets edited out and essentially forgotten? If this seems unfair or unreasonable, remember that throughout history to date almost everything that happened has been forgotten. Only a tiny fraction of the total of human experience made it from generation to generation–especially before the invention of the printing press. It had to be pretty important to get remembered–and even so, plenty of pretty important things weren’t. Many great ideas were lost and had to be rediscovered–and it’s probable that some remain lost to this day.
So curation will matter. And so will who gets to be a curator, because curation decisions all too often depend on your view of what’s important.
And then there’s the time factor. The closer we get to recording all of human experience, the less time there is to go back and review what we recorded. Today, we can use the huge gaps in the total recorded and stored experience to watch what happened to others–real or imagined. But at close to 100 percent experience capture, there’ll be no time to do so. We will be living only going forward. And if we can’t ever go back and review the past (because by doing so we will miss being part of the present), why bother to record everything in the first place?
Finally, there’s entropy—which you can think of as the propensity for organized things to self-randomize over time. The more bits we store, the more bits will be randomly flipping from one to zero or vice versa, unless we watch them to make sure they don’t (we have to keep adding orderliness to the total system to counter the inclination to randomness). But the more bits we store, the more time we need to check for errors and the less time we have to do so. At some point, we’re going to be doing damage just with the checking process, which is also part of the entropic environment. Eventually, if the curators don’t delete you, entropy will—even if you’re important.
The Big Data frenzy we are experiencing today is just the tip of the iceberg. It’s going to get bigger, more complicated and more difficult to deal with. Not a pretty picture, even with all the claimed benefits.
And always remember Sturgeon’s Law: 90 percent of everything is, in general, crud. Which specific 90 percent depends on your point of view? Better start training as a curator. So you get to decide which points of view matter.
John Parkinson is an affiliate partner at Waterstone Management Group in Chicago. He has been a global business and technology executive and a strategist for more than 35 years, having served in both senior operating and advisory roles. John has served as a Chief Technology Officer and in similar capacities for large global companies including TransUnion, Cap Gemini, AXIS Capital and Ernst & Young Consulting.
Getting the Most out of Big Data
Big Data: Separating the Hype from Reality in Corporate Culture
Maintaining Maximum Relevancy for Buyers and Sellers
Building Levies to Manage Data Flood
By Tom Farrah, CIO & SVP, Dr Pepper Snapple Group
By George Evans, CIO, Singing River Health System
By John Kamin, EVP and CIO, Old National Bancorp
By Phil Jordan, CIO, Telefonica
By Elliot Garbus, VP-IoT Solutions Group & GM-Automotive...
By Dennis Hodges, CIO, Inteva Products
By Bill Krivoshik, SVP & CIO, Time Warner Inc.
By Gregory Morrison, SVP & CIO, Cox Enterprises
By Alberto Ruocco, CIO, American Electric Power
By Sam Lamonica, CIO & VP Information Systems, Rosendin...
By Sven Gerjets, SVP-IT, DIRECTV
By Marie Blake, EVP & CCO, BankUnited
By Lowell Gilvin, Chief Process Officer, Jabil
By Walter Carvalho, VP & Corporate CIO, Carnival Corporation
By Mary Alice Annecharico, SVP & CIO, Henry Ford Health System
By Bernd Schlotter, President of Services, Unify
By Bob Fecteau, CIO, SAIC
By Jason Alan Snyder, CTO, Momentum Worldwide
By Jim Whitehurst, CEO, Red Hat
By Marc Jones, Distinguished Engineer, IBM Cloud Infrastructure