Big data – i.e. the large scale collection, combination, interpretation and analysis of a diverse set of data sources in order to derive useful information – is taking the world by storm. It is considered the cure-all of a diverse set of hard societal problems. Although in specific instances the proper collection and analysis of data can be tremendously useful and worthwhile, there are many limits and pitfalls to the application of big data one should be aware of.

Correlation is not the same as causality.

Combining and analysing data from disparate sources allows one, using the appropriate set of algorithms, to discern patterns in the data. It is tempting to think about the relationships discerned in these patterns in terms of causality: the fact that something happens while something else also happens, means the one event is caused by the other. As a consequence one might be tempted to think that by reducing the frequency of the first event (through some kind of intervention, that depends on the type of event), the frequency of the other event will automatically decrease as well. This is wrong. Patterns only reveal correlations, and not (necessarily) causality.

The classical example to illustrate the difference runs as follows. The number of catholic churches in a town correlates strongly with the number of crimes committed in that city. That does not mean that crime is caused by being catholic (of either the perpetrator or the victim). The real explanation is simply that a big city has more inhabitants and hence more catholics (and hence more catholic churches) as well as more criminals. Demolishing churches (the intervention) will clearly not reduce crime.

Start with problems that need solving, not with data you happen to have.

Applications of big data often get the order wrong. Instead of first trying to understand a particular problem, then determine whether it is a problem at all, and finally decide how the problem can best be dealt with (it may not be solvable at all), big data turns the process around. The big data approach first starts with collecting and analysing large sets of data. Depending on the patterns that emerge, one then looks at the type of problems these patterns might help solving.

This is problematic for several reasons. First of all, the decision to tackle a problem depends on whether data is available to solve it. So the decision to solve a problem depends on whether data is available. Instead of collecting data to solve a problem deemed important. To paraphrase Ybo Buruma, it’s like looking for a crime committed by a known criminal instead of looking for a criminal that committed a known crime. Problems are prioritised based on the availability of data, and not on the intrinsic severity of the problem itself.

Moreover, some perceived problems are not even problems to begin with. Many aspects of human life are simply complex and inefficient (cooking, social life, politics) by definition. For example, there are more efficient ways to declare one team a winner over another than to have them play soccer for 90 minutes in a stadium crammed with spectators…

(Evgeny Morozov calls this reversal of order ‘solutionism’ in his book “To save everything, click here”.)

Convenience bias.

This reversal of order is particularly worrying if you realise that there is a real risk that instead of collecting the most relevant data for the problem at hand, people will settle for data that is easy (in terms of time, money or other resources) to collect. This leads to something I call convenience bias: people use the data that is most convenient for them to collect.

In fact, instead of collecting new data, people may prefer to use data that is already available. This creates additional issues if that data was collected for a different purpose, and therefore may be biased or simply inapplicable to the problem at hand. People fill in forms and answer questions depending on the purpose for which they think the question is asked.For example, if you ask people about their drinking behaviour, they may underestimate it for a health-related survey. To take another, silly example: if you would use the answers on the U.S. Customs Declaration (Form 6059-B) question that asks whether people have recently been in close contact with livestock, you would severely underestimate the amount of livestock in the Netherlands  [;-)]

Data collection is always biased.

The previous pitfall is actually a very particular instance of a much larger problem: data is never neutral. That’s because the data collection process is always biased. In fact it is biased in at least three different ways.

First of all, the decision which data to gather is not neutral. The decision is made based on interests of those that decide or have the power to influence the decision. A pharmaceutical company looking for data to show that a new medicine is safe to bring to the market will, once it is convinced the medicine works and has no significant side effects, have few incentives to collect data that shows otherwise. Such decisions are partly political and often based on existing power structures. For example, we collect huge amounts of data to combat social welfare fraud, but little data to combat, say, frauds by bankers or other high-ranking financial executives, even though the damage they do is an order of magnitude higher.

Once the decision which data to gather is made, the next question becomes from which sources the data will be sampled. It is well known in statistics that any bias in the sample population will show as a bias in the data collected, making it harder to draw generally applicable conclusions that extend beyond the peculiarities of the sample population. For example, if data is collected through smart phones, one should realise that not all people have smart phones… As Zeynep Tufekci writes:

“Even with 10 million subjects, your findings could be generally inapplicable. It all depends on who those 10 million were and how they were selected.”

Finally, once the sample population is determined, the question becomes how to extract the data from the sample. If you try to collect data through app usage of people in your sample, the design of the app may have a hidden influence on the data you collect. Say you were trying to determine whether people think it is important to adjust their privacy settings. And you do so by measuring how many people successfully change their privacy settings in your app. Then if your app happens to make it hard to successfully change the settings, you may (perhaps erroneously) conclude that your users do not care about privacy.

If you collect data by sampling sensors, the time you take the samples may affect the results. For example, if you wanted to measure how congested the roads are, you will not get very reliable data if you sample traffic patterns once a day at noon… If you ask people to fill in a questionnaire, what questions do you ask them, and in which order? All these aspects influence the values you measure or the answers you get, again influencing the final outcome.

In other words, data is biased by what you collect, from whom and how.

You can lie with big data.

If Darell Huff was still alive, he could write a sequel covering big data to his famous book “How to lie with statistics”. You can lie with big data, just as you can with statistics. This should come as no surprise, as big data is just statistics in disguise (except, perhaps, that it deals with much larger and diverse data sets). In fact this becomes rather obvious if you realise that all the risks of biases discussed above can actually be exploited maliciously to arrive at a predetermined conclusion.

Big data intermediaries become powerful.

By its very nature big data deals with large volumes of data, that require complex processing to derive the desired information from. This requires large yet easily accessible data storage systems and powerful data processing equipment. It also requires specifically designed algorithms to distill and interpret (perhaps visualise) the desired information. As a result, big data is a highly specialised business, that only few companies and institutions are capable of delivering.

These big data intermediaries collect, combine, interpret and process the data on behalf of their clients (say, a government) to deliver the information desired by the client. They do so using their own, secret, algorithms and processes. Such intermediaries have a huge influence on the outcome, as it depends so much on the way the data is sampled and processed (as we discussed above). This position will give them tremendous power. Especially because of the sheer volume of data involved and the hidden ways in which it is processed, it is virtually impossible to independently verify the results. Provided you even get access to the underlying raw data to begin with. This makes big data applications essentially unaccountable and unverifiable.

The question is: what do you do when the answer you get is 42?…

Garbage in equals garbage out.

When using big data it is important to realise that the answers you get are only as good as the underlying data you have at your disposal. If you put garbage in, you get garbage out. It is therefore crucial to ensure the integrity of the data you collect, and that any risk of bias has been mitigated or will adjusted for later on in the process.

Big data may set new norms.

When viewing the world through big data glasses, you see the world in a different light. This may in fact also be a good thing (in general I believe any new point of view allows one to learn something new), but does carry certain risks as well. In particular, big data has some problematic normative aspects.

First of all, big data assumes the norm that it is alright (perhaps even mandatory) to collect large amounts of personal data to optimise or personalise a service.

Secondly (and as already argued above) the big data view may only perceive a problem when there is data that can solve it. In other words, problems for which there is no data, are no problems. But this actually redefines the norm of what a problem is: it is no longer defined by having a question, having difficulty of understanding or sensing distress or vexation.

Secondly, big data shapes how problems can and should be solved. In his book “To save everything, click here” Evgeny Morozov gives the example of BinCam that records the contents of your trash can, can post the images on Facebook, and allow your friends to comment when you throw away food or fail to separate your garbage. This way caring for the environment is transformed into a game in which you collect more points to advance to a next ‘level’ of environmentalism… The old norm ‘caring’ is replaced by the new norm ‘winning’. What happens when the game becomes boring, or hard to get better at, or if a new more exciting game comes along?

Moreover, big data is also normative in that outcomes of big data analytics may shape what is considered normal and what is not. If such analysis shows a strong correlation between (to return to the example above) churches and crime, it may affect how people perceive churches. Similarly, when applying big data to personalise services or to filter content, this redefines the notion of ‘appropriate’ content or service in a particular context. If many people prefer not to see offensive pictures in their newsfeed, the news aggregator may decide to not show offensive pictures to all of its subscribers.

Information gets lost in translation.

As mentioned earlier, big data applications typically rely on a large variety of data. The data that is processed typically is not obtained directly from the source. It has been preprocessed by several intermediate parties, that each interpret, translate, combine and enrich the data based on their own (peculiar) understanding of the data and based on the information processing systems and databases they happen to work with. If the chain of preprocessing is long, the information may in the end get lost in translation.

Read more about this topic