Friday, February 02, 2024

You can't burn data

 As the concept of digital transformation takes root, you may frequently hear comparisons between data and oil.  After all, both are abundant commodities that can create value.  This comparison was strong enough to lead Wired magazine to define data as the new oil in a magazine article some years ago.

On the surface, this comparison seems to make some sense.  Both data and oil are commodities, and exist to some degree in large volumes.  Both have the ability to create incredible wealth when harnessed appropriately.  Both can be used for good purposes or misused.  Both data and oil have interesting issues and side effects, pollution in the case of oil and loss of privacy in the case of data.

But the more you consider the two, the faster the analogy breaks down.  That's because the analogy works at the surface, but when you carefully think about the value propositions and the issues associated with oil and with data, you'll see that there are some really interesting similarities but some stark differences.  We gain value from oil by burning it to generate heat, light or kinetic energy.  The conversion of data to value simply isn't that straightfoward.

Let's look at some interesting similarities first.


I'm not a data scientist, but I did start my career developing enterprise systems and worrying about the value and quality of data that flowed through those systems.  Data quality and data volume may not seem like interesting concepts, but when we talk about using data to "transform" a business or process, the data needs to be of the highest quality if decisions are to be based on the data.

Most businesses have a range of data types, data quality and data management, none of which lead to high quality data.  And, as stated previously, without good quality data, you cannot automate or transform anything.  So, all the talk of digital transformation is talk until companies normalize, standardize, simplify their data and ensure it is of the highest quality and veracity.

Another problem many companies face is the Volume and Velocity of data.  And yes, I am capitalizing the Vs because a good way to think about data is in Vs:  Veracity, Volume, Variety and so on.  Veracity or truthfulness or data quality is obviously key, especially when training a machine learning algorithm.  However, veracity is difficult if the data comes in a variety of types and the sheer volume of data is so high that people and machines cannot determine what data is useful and what data isn't.

In other words, most companies need to refine their data, in much the same way that oil coming out of the ground needs to be refined.  Venezuela has perhaps the largest oil reserves on the planet, but since most of its oil is thick and loaded with sulfur, the oil there is of less value and needs to be refined in order to be used effectively.  Oil straight from the ground can be refined into a number of different grades of fuel we consume, from bunker oil for ships to gasoline for cars to home heating oil and many other types.  But raw oil isn't all that useful until it is refined, and sweet crude from Texas is easier to turn into marketable products than Venezuelan oil.  In the same way, some data is more valuable, and easier to put to work, than other kinds of data.  As Einstein said, not everything that is countable counts.

Like oil, data need to be managed, refined, combined and prepared to be used effectively, and few firms have a strong handle on what it takes to manage data and ensure that the data is of high quality when it goes into a system, and remains high quality as it is used, combined with other data and manipulated.  When you stop to consider that the most common reporting mechanism in companies is Excel, where any cell can be changed or any calculation can be written incorrectly, there are thousands of opportunities for even high-quality data to be mis-used or reported incorrectly.


While it is a simple metaphor to compare oil to data, it is also grossly misleading, for several reasons.

First, oil is a non-renewable commodity.  That is, there is only so much oil in the ground, and it is getting more difficult to extract what's left.  On the other hand, data is getting created every day, by billions of people, about millions of topics, in hundreds of types.  Data is not only renewable, but it is almost inexhaustible, limited only by human creativity and need. Each month we created as much cumulative data as was created from the dawn of writing until last year.  The problem isn't data, the problem is the volume and veracity of the data we are creating.  With so much data generated from so many platforms, how do we know what data is useful and meaningful, and what data is created merely to distract or confuse?  I can imagine it will soon be possible, if it is not already, to create AIs specifically for the purpose of creating seemingly valid information that has little or no basis in reality, to confuse or distort economic projections or scientific inquiry.

Second, while oil was the basis for carving up nation-states in the Middle East after the first World War, data will be the dividing line in the future.  Oil to a great extent is a monopoly based on accidental or purposeful geography.  Some countries - Saudi Arabia, Venezuela and others - have lots of oil, while others - Japan for example - have little or no oil reserves.  Japan has been a major success for a country with limited natural resources such as oil, but data is unlike oil.  Anyone can create data, and almost all of us do so every day.  Increasingly, it's not countries or geographies that control data, but companies.  One could easily say that Meta is the Saudi Arabia of the data stores, since it has so much data about so many individuals.  Nations, which once coveted oil, are just waking up to the value of data and realizing how much power data provides and wondering how virtual companies like Meta and Google have managed to hoover up so much data and become so powerful.  Why is Mark Zuckerberg advising Congress on data?  Because Meta controls more data and has more insight than most agencies in the US Government.

Third, oil has a provenance and a supply chain, for the most part.  That is, we talk about "sweet Texas" crude or Saudi oil or Venezuelan oil.  While oil is a commodity, it typically has a provenance that indicates its value, and further there is a value chain associated with oil.  Oil moves from a driller through a pipeline or other distributor to a refiner and then on to another distributors, a wholesaler and a retailer.  In other words, there are specific value-added components that make up the oil to marketable commodity (such as retail gasoline) that we consume.  Data does not necessarily come with a provenance and does not need a supply chain to reach a consumer.  Anyone can post a data set that they've created, create their own research or surveys, and make that data accessible to almost anyone, instantaneously, on the internet.

Next, oil is a commodity, and certain grades of oil are priced in global markets, regardless of where they are drilled.  Data, on the other hand, is often not quantifiable as to its value or price, and the same data set can be more valuable or less valuable, depending on where it is, who has it and who needs it. A list of names can be very important, if the list is a list of spies in a foreign country, or conversely very unimportant if it is a grocery list for a suburban family.  But if a company could compile thousands of lists of grocery shopping for families in similar circumstances, that compendium of data would become valuable to grocery chains and the brands that sell on grocery store shelves.  The value of data is in the eyes of the beholder, and the value of data varies with a number of factors.   

Finally, since oil is a commodity, it is easy to acquire in a specific grade at a specific price.  Data, on the other hand, is difficult to aggregate, difficult to grade and almost impossible to price.  I recently acquired a list of prospect companies and executives that promised to be up to date and highly accurate.  As I scanned through the list, I quickly found several mistakes or omissions that should have been easy to identify in just the first fifty to one hundred names.  It is hard to keep many types of data accurate and fresh, difficult to validate data without human intervention, which makes almost all data stores suspect to some degree and in need of constant evaluation and refreshing.

Thanks for the analogy, what does this all mean?

In the end, the analogy between oil and data falls apart.  Oil is a standard commodity that for the most part we burn to gain heat or kinetic energy.  We transform oil into one or two potential outcomes, at specified temperatures and pressures.

Data, on the other hand, is not a commodity, is perishable, is constantly renewing and generative, is difficult to price or establish a value for, and we find it difficult to create real quality metrics for data.  Plus, data is not a commodity.  The same data sets have different values at different times to different users.  

What this means is that the talk of digital transformation - using data to fuel a new era of economic growth, in the same way that oil spawned economic growth and benefits in the 20th century, is optimistic or true in some circumstances but not in others.  Companies that want to transform themselves to be more effective and to create new revenue streams based on data must first address the questions of veracity of data, volumes of data, varieties of data.  In a time when analysts suggest than less than 10% of the data that companies currently possess is being used to create value or insight, what leaps of technological advancement and data quality improvement are necessary to base a new economic model on driving competitive advantage from data?  After all, we aren't converting a commodity into heat, light or kinetic energy.  We are transforming trillions of data sets from millions of data sources into useful insight, which companies can act on.  

This means that the benefit to data will be widely and unevenly distributed, highly beneficial in some industries or market segments or niches, and almost impossible to understand in other segments of the economy, for decades yet to come.  Those companies and industries that cleanse their data, understand the variety and volumes of data and who can make sense of the data to convert it into useful insights that indicate future actions will benefit to an enormous degree.  The rich will get far richer.

Many industries will spend billions of dollars trying to get the same benefits, but will lack the basics - good, high-quality data, understanding which data matters, being able to capture and use the most effective and useful data, and being able to convert the data into beneficial actions. 

AddThis Social Bookmark Button
posted by Jeffrey Phillips at 8:20 AM


Post a Comment

<< Home