Thought Piece

Cooking with big data: Why you can’t just press download

Working with big data is not the same as downloading data from the Office of National Statistics (ONS). The expectation that it is, is perhaps the most common misconception I’ve come across since joining The Data City. Let me set out here why, and what this means for your analysis.

Let’s go shopping 

The analogy I like to use between using ‘traditional’ and ‘big’ data is going to the supermarket. Going to a website like the ONS and downloading aggregated statistics is like going straight to the ready meal aisle. (The user assumes that) the data is pretty much ready to go. 
 
Using big data on the other hand is much more like going to the fresh fruit and vegetable aisle. There’s work to do to turn it into a meal on the dinner table. 
 
There are upsides and downsides to both. So please forgive me for stretching the supermarket analogy to the extreme to show you what they are.

ONS and government data is ‘oven ready’

Official statistics have two main benefits: 

  • The data is ready to go. You download it and start your analysis straight away. Someone else has done the cooking, and you just have to pop the ready meal in the microwave. 
  • Definitions in the data are fixed. Sector definitions, for example, don’t change through time. You can be pretty sure that when you go back to buy the same ready meal again it is going to taste the same. 
  • It’s free at the point of use, being funded by the taxpayer. But this brings with it the following downsides:

The creation of this data is a bit of a black box. You don’t really know what has gone into it – what is the quality of the ingredients in your meal, for example?  

Definitions of the data are fixed. While this allows you to compare apples with apples (the benefit above), the evolution of the economy means that this definition may become less fit for purpose. A cookbook today looks very different to one thirty years ago, because tastes have changed, the range of ingredients is now broader, and healthy eating is a bigger thing. 

With big data you’ve got to do the cooking yourself (but it’s worth it!)

In contrast, the benefits of big data are: 

  • Flexibility. You can cut and cross tabulate the data in many ways to shape the output the way you need it, rather than having to work with pre-defined outputs. So you can cut the peas from your recipe if you don’t like them. 
  • It’s real-time. Often you’re working with data that was generated in recent days, not some time ago. You get to pick fresh produce. 
  • It’s dynamic. You can flex definitions to reflect changes in the economy or people’s behaviours. So you can amend your recipe if you think you can improve the taste. 
  • You can see where the problems are. All data can have problems, whether it has an official statistics stamp or not. But with big data you can usually go and find where that problem is and correct it; it isn’t a black box. You can find the mouldy carrot and put it in the bin.
     

So, what’s the downside?

Like cooking a good quality meal, it requires time and expertise. Creating aggregated stats from big data can take a long time and can require a lot of preparation just to get you to the equivalent point of pressing ‘download’ on a statistics website.

Using big data requires a slightly different mindset 

There are three implications of this for users of big data. The first is to view big data as a rich source of information that you can craft to help you answer your research questions, rather than it being a place to take the answers off the shelf. We are currently working on making a version of our data look more like this (with the benefit that you can also go and look under the hood if you’d like to), but it is impossible to be able to do this for every single different user need. You need to do some of the cooking. 

The second is to become more comfortable with flexibility. For years we have been conditioned to think that sectors don’t evolve through time, and companies fit into one sector or another, and the thought of these things not being the case in a dataset causes concern. Being able to compare through time is important, but we should embrace change and flexibility rather than seeing it as a downside. Why? Because it reflects the world around us. And that is ultimately why we’re collecting data in the first place. 

The third is that being able to spot and correct mistakes is good. No-one wants avoidable mistakes in their product, but dealing with millions of big data points means that this is inevitable. But if we assume that mistakes also lurk in the microdata of official datasets, which they most certainly do, then being able to spot and correct these mistakes is undoubtedly an improvement. 

And of course, if you’re ever in doubt about this, come and talk to us.

About the author