What does "Data acquisition, cleaning and representativeness" research challenge mean for me?

In the current Big Policy Canvas Roadmap, a number of research challenges have been presented, they are the outcome of an extensive review of the literature and consultation with experts.

Of particular interest to me is the Research Challenge on Data acquisition, cleaning and representativeness.

The large majority of big data, from the most common, such as social media and search engines data, to transactions at self-check-out in hotels or supermarkets, are generated for different and specific purposes. They are not the design of a researcher who elicits their collection having in mind already a theoretical framework of reference and an analytical strategy. Instead, in the case of surveys, they are designed data-harvesting instruments. Survey designers are experts in the art of eliciting the types of values that allow the processes that generated them to be inferred, and to contribute in pre-understood ways to the statistical modelling and sample selection controls that will be used to model them. Surveys are deliberately designed to tame the effects of multiply entangled correlations. Big data, by contrast, are just a large universe conglomerate of such correlations – very often they are not carefully designed. Twitter and big national surveys have both been used to analyse public opinion, but their data are different, and so it is different what they can reveal about public opinion in each case.  From this point of view, the debate about big data enthusiasts and sceptics should be formulated differently. There are social research questions and issues for which big data are exciting and others for which ‘traditional’ social scientific methods are still more reliable and useful.

Therefore, one of the first characteristics of big data, highly relevant for the social scientist, is their ‘organic’ nature in contrast with ‘designed’ (for research). Currently, data are becoming a cheap commodity around, simply because society has created systems that automatically track transactions of all sorts. For example, Internet search engines build datasets with every entry, and Twitter generates tweet data continuously, traffic cameras digitally count cars, scanners record purchases, Internet sites capture and store mouse clicks. Collectively, human society is assembling massive amounts of behavioural data on massive amounts of its behaviours. If we think of these processes as an ecosystem, it is self-measuring in an increasingly broad scope and scale. Indeed, we might label these data as ‘organic’, a now-natural feature of this ecosystem. Therefore, big data are considered ‘organic’, and they are created by different actors in the context of producing or delivering goods or services, but not for research. This is in contrast with ‘designed’ data, those that are collected when we design experiments, questionnaires, focus groups, etc., and that do not exist until they are collected.

Researchers are not entirely new to this context in the use of data. There is a longstanding tradition of secondary datasets analysis, but there are some crucial differences as well. Secondary data in the social sciences generally indicate the reuse of existing datasets collected either by official institutions or by other researchers. While these datasets might not be collected for research purpose (but although they often are), they are usually publicly accessible, and their methodological features are rather quite transparent. There are some datasets of exceptionally high quality, for example from academic or research institutions but also as well as governmental organisations; and others might be less reliable.

Common to big data is the idea of the repurposing of data. Data that were initially collected for other initial aims are repurposed for new specific research goals set by the secondary analyst. The difference is that for big data, especially those collected by private companies, the lack of transparency about how the data are collected or coded is a problem that digital social researchers have to face. Repurposing of data requires a good understanding of the context in which the data were generated in the first place. In other words, these are not ‘natural’: they are the outcome of designers and socioeconomic processes, therefore created with some certain goals and trade-offs. It is about finding a balance between identifying the weaknesses of the repurposed data and, at the same time, finding their strengths. A good practice for social scientists, that which applies not only to big data, it is to think about the ideal dataset for their research and then compare it with what is available. It will make salient the problems and opportunities of what is available.