Data and Big Data in the Sciences
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 2 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/44031 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
00:00
Lecture/ConferenceDiagram
02:03
Computer animation
02:51
Computer animation
03:26
Meeting/InterviewComputer animation
08:09
Computer animation
14:23
Meeting/InterviewComputer animation
22:26
Computer animation
30:10
Meeting/InterviewComputer animation
Transcript: English(auto-generated)
00:05
What I'm gonna try to do here in this talk is to give kind of a relatively broad perspective on the kind of discussions that have been going on on big data especially in the scientific research and looking at
00:20
this from the perspective of the field where I work which is philosophy of science. So I will start with this graph and it doesn't really matter particularly what is going on in this graph but I'm just going to show you this as an example of the kind of graphs, pictures, artworks and I think
00:41
we've been seeing quite a lot in the last few years. So this graph in particular shows you the amount, the volume of data stored in data centres worldwide from 2015 up to 2021 and what we can see here is a rather staggering increase in the amount of data stored and the reason why I say it
01:07
doesn't really matter particularly what is going on here in this graph is that I'm going to just use this as an example because I think many graphs of these sorts we've kind of like encountered in news websites, on TV, in
01:21
articles and whatever showing a staggering increase in the amount of data in this case stored but it could also be shared, collected, uploaded, whatever and this is the kind of phenomenon that is usually discussed in terms of big data. The interesting thing though is that kind of
01:43
associated to this idea of increase in volumes of data we've been seeing a number of rather bold claims attached to this phenomenon like for instance here this is a front cover of The Economist, one of the most important
02:01
weekly magazines in the world arguably from a couple of years ago saying that data is the, basically gesturing to the idea that data is the new oil right, data is the world's most important resource right now and then more broadly what we've seen is claims like the one we can see here in this
02:25
book arguing that big data is going to bring about a revolution and this revolution is going to affect many different areas like for instance here we have farmers, doctors, insurance agents, arguably very different types of jobs but
02:41
we've kind of seen this popping up in many areas in reports, books, articles etc. So in the context of this talk I'm gonna take a little bit of a step back and ask a couple of questions so I'll try to answer these questions. So what is big data, what is big scientific data, why is it
03:07
something that we should care about and why should we discuss it from a philosophical perspective, what kind of stuff do we gain from looking at this from specific philosophical perspectives. I'm gonna do this by first
03:23
of all providing what is hopefully going to be a little bit of disambiguation on this old idea of big data, then I'm gonna look at the kind of research that has been going on in various areas of academia on this
03:41
topic of big data and then I'm gonna kind of zoom in the research that has been done in philosophy of science in the last few years including the kind of research that I have done in the context of my PhD on this idea of big data in the sciences. So to start off where does big data as a term come from?
04:05
Actually if you look at the history of the term it kind of started to be used in around the mid 90s to the early 2000s in the computing industry and at the time the idea of big data was associated with three traits, with
04:23
three categories. Data was big data insofar as we have a large volume of data, insofar as that large volume of data was collected at a very fast rate and insofar as that large and fastly produced data is actually of
04:45
a wide variety in terms of types, kind of stuff that it is about etc. And this has come to be known as the three V's definition of the big data which actually comes up pretty much everywhere all the time still today so
05:02
from kind of from the 90s it's still with us and especially I would argue in the business sector it's kind of like often the presentation of big data and how this is going to change your industry or whatever is kind of like often in these terms of the three V's. The thing is you know when
05:20
academic researchers started looking into this things got a little bit more complicated as usually happens in the sense that people started to point out that this definition is rather a problematic one to the point that Kitchener and Mercado, two sociologists argued that the three V's
05:42
is actually a meme and is something that has led to great confusion on what big data actually should be and is and should be defined to the point that Kitchener and Mercado also say that if you
06:01
look at the uses of big data as a term we can see many other dimension, many other traits that are usually attached to big data and at the same time they also say that from a review of the literature they can find at least 26 different definitions of big data. So it's kind of complicated and confused and kind of like to top this up
06:26
people have also pointed out that as a term big data can be kind of problematic because it's always going to be a relational term. So what is big in a specific context might not be in another one right. So what
06:41
is big data in a specific sector of the industry might not actually be in another one. Right so at the same time as I said what I think is quite interesting of big data is the kind of rhetoric that we've seen forming around this idea of an increase of data which it's not
07:03
quite clear exactly how to define. And in particular this rhetoric has associated the increase in data with some kind of revolutionary tale, with tales of disruptions, paradigm changes, etc. To an extent I would argue extending narratives of
07:24
disruption and change from the high tech industry to other areas of research in the sense that we've seen as I showed you in this slide before earlier we've seen this idea of a revolution due to big data kind of applied in many areas of research,
07:43
industry, business, politics, etc. In science of society more generally in the sense and to the point that basically big data has kind of become a bit of a buzzword I would argue and maybe actually increasingly now it's
08:02
something like AI that is this kind of buzzword. And I'm just going to show you a bit of a silly thing which I love very much though. There's someone who set up a tweet bot which searches for tweets and substitutes the word big data with Batman.
08:22
This is not me. I haven't done this but I wish I did. It would be one of my great achievements but kind of like to show this idea that you know of course like when you see it with Batman in place of big data it's going to look silly right. But kind of like to show the fact that maybe also with big data it kind of would have looked a little bit weird right
08:43
to kind of like show you this idea that big data has kind of become a bit of a buzzword really. Right but as I said I mean the kind of stuff that I have mostly been interested in in my own work is that this tales of revolution this this rhetoric of big data has
09:04
interestingly been applied quite a bit to the scientific area. Basically that meaning this that this idea of revolution of big data we're going to see quite increasingly in the sciences as well. And here I'm just
09:21
going to mention a couple of I mean rather famous examples of this. This is a book from a few years ago now actually where Victor Meyer Schoenberger and Kenneth Cook here talk about the big data revolution and talk about many
09:41
things. But I'm just going to quote something from them here where they say that essentially thanks to big data we're going to have more and more correlations in our data sets. But that is not really going to be a problem. And actually this is going to be much better because in many cases
10:00
correlations would be enough. We won't have to worry about why these two things are correlated. We won't have to worry about the typical kind of issues that correlations might have so spurious correlations confounding etc. And this I mean is kind of like for anyone interested in science is quite a bold claim because
10:24
kind of correlation isn't causation is one of the it's kind of like science 101 in a way right. And then I'm going to show you a possibly even more famous example here. Chris Anderson at the time a different chief of wired arguing that basically as a consequence of big
10:45
data we're going to see the depth of theory. We won't have to interpret the data anymore. We won't need to have you know do any modeling and the theoretical interpretation of the data. The data is going to speak for itself and it's going to be fine. Right. OK. So
11:03
what I've done so far is kind of like problematize a little bit the concept of big data and show you some of the more what I think are the more interesting traits of a certain rhetoric that has surrounded big data. As a result of this rhetoric though we've seen a
11:22
kind of response in academic research and I would say the rise of an interdisciplinary and quite broad debate which has come to be known as data studies where basically the kind of typical claims of
11:40
the big data rhetoric have been quite closely analyzed assessed and kind of looked for their validity essentially. And the kind of output of quite of much of this debate has been to push back against some of the elements of the big data
12:03
rhetoric. And here I'm just going to mention a few examples. For instance scholars have argued against the idea that having bigger more data is going to lead to more objective kind of research. Others have also argued that big data is not really it doesn't
12:23
really give you a complete description of a phenomenon and there's always going to be probably sampling issues right issues of representation of your data set. And also that it's not like as a consequence of big data we won't have to deal
12:41
with any ethical consideration. At the same time in the kind of like zooming in to this data studies debate as a consequence of the focus of quite a bit of the big data rhetoric on the sciences philosophers of science have kind of like stepped
13:02
into this debate. And when I talk about philosophy of science I think it's often kind of difficult to define your own discipline but I'm going to do my best by being very vague. Philosophy of science is a subfield of academic philosophy which is more in general terms interested in
13:22
studying what science is what for instance differentiates science from non-science or pseudoscience how science changes and progresses and the kind of role that science plays in society and how we can kind of improve scientific research. And it's only kind of only
13:42
makes sense that philosophers of science have started being interested in these claims of the big data rhetoric on this scientific research because they kind of like take issue with what the scientific method is which is one of the standard topics of philosophy of science. And then at the same time one of the standard
14:01
topics traditional topics of philosophy of science is how science changes scientific revolutions et cetera. So arguing that big data science is a scientific revolution is immediately something interesting for philosophers of science. And so philosophers of science have looked at this
14:20
issue as well. And I would quote some rather friends. Basically their reaction has been well is this really a scientific revolution? Well maybe but may possibly not as clearly as what as proponents of this rhetoric surrounding big data
14:40
would argue. More particularly philosophers of science have looked at specific cases of research involving large volumes of data and have for instance argued that in many cases big data doesn't really bring about a new epistemology, a new overall epistemology of
15:01
scientific research. Philosophers have also argued that big data science, data-intensive science can't really be theory free, not as simply at least as the rhetoric would say and relying only on correlations can also be
15:21
problematic. And then more generally philosophers and historians of science have also argued that in many areas the use of large data sets might not actually be such a historical novelty. So kind of quite a bit of critical reflections
15:41
on this. So I would argue that this has been kind of a first phase of the debate on big data science, kind of like a parse dense to quote Bacon here, another friend from the
16:02
1500s who argued that basically research should always have like a parse dense and parse constraints, you know like the critical perspective but also kind of like the more constructive perspective. And I would argue that especially in philosophy people have kind of started off with this critique of certain
16:20
elements of the big data rhetoric but I think now we've kind of like moved a little bit beyond that part of the debate and more towards a constructive perspective on data, meaning that basically data has become an interesting element to study from a philosophical perspective. And I'm going to mention a few possible questions that come up in this area.
16:44
One which is possibly the more fundamental question is what actually is data? What is scientific data? What counts as data? You know we see different types, very different types of data in scientific research. So can we give an
17:01
account of what data is and relatedly what kind of role does data play in scientific epistemology since you know people have kind of looked at the role of hypotheses, theories, models etc. What role, what epistemic role does data play in this work, in this epistemology of
17:23
science? And how can scientific data be best used in scientific research? In the next few slides I'm mostly going to focus on this question but I'm going to talk about these other things as well because I think they're very much
17:41
related, right? Okay I'm going to start with what arguably is a familiar picture. Well you'll tell me whether it's a familiar and intuitive picture. So we could, I would say that generally when we think of data we tend to think of
18:02
data rather in these terms. First of all data is a given, something that is there independently of our action and this kind of brings, I mean is tied to the very etymology of the word data, right, from the Latin datum, right, what is there, what is given, right? But at the same time data is
18:22
not just there, it's not a given but it's something that tells you something about the word in very broad terms. It's something that has a representational content. It's something that has information built in as it were. And then at the same time a certain traditional view of data I would say argues that this given that
18:45
tells you something about the word tells you something about the word independently of how you look at it, independently of the context. It has this fixed informational content. It's going to tell you something about the word independently of whether you look at it today or
19:02
tomorrow, something like that, to kind of like make it simple. In a way you could, and I'm scratching the metaphor here a little bit, data is kind of like a mirror in this traditional way, right, kind of like tells you something about the word and is kind of there to tell you
19:21
something about the word. I'm scratching the metaphor here as you probably realize. Well the thing is though is that I'm going to argue that this traditional picture doesn't really work if you look at the ways in which data is used in scientific research. And I'm going to mention, I'm going to talk about this from the perspective of
19:41
the kind of research that I've done in my PhD. In my PhD I've looked at the use of data in data intensive, in a data intensive context in biomedical research. In particular in research looking at relations between exposure and disease, so epidemiology essentially, where
20:02
people essentially look at, you know, at the population level, they're going to look at the relation between exposure to air pollution and development of some sort of disease, could be for instance cardiovascular disease, lung cancer, whatever. And I've looked at this because I think it's a very interesting area of research where
20:22
actually you have a large volume of data that comes from very different areas, you know. You have like data more traditionally from epidemiology, so for instance blood samples that are collected at a population level, but you also have data from like molecular biology. These blood samples analyzed at the molecular level.
20:43
You have data from environmental science on air pollution. You also have data from, for instance, wearable devices, right? You could do kind of like an experiment with that. And then this is a very, I would say, a very important and definitely highly funded area of research
21:02
from, for instance, the EU because it's supposed to give results evidence for very important policy issues like public health, right? And then like I've also looked at this because these uses have been kind of, of data have been coupled with some changes at the theoretical level. So
21:22
it's kind of been interesting for me to look at in that regard. The thing is though, to kind of come back to the traditional picture and so the how, like, does this work, does the traditional picture work in this context? I would argue it doesn't really because basically, you know, first
21:41
of all, and this is probably going to ring true to anyone kind of dealing with information management, data is kind of, it's kind of difficult to see how data could be a given in the sense that data has to be manipulated quite a bit. So it has to be in the case of looked at, it has to be archived, you know, collected,
22:02
archived, stored, labeled, it has to kind of like travel from one site to another. So it's kind of difficult to see how data could be a given in this, so far as it has to be very, very much manipulated in order to be used as evidence in
22:21
the first place. At the same time, data is used to represent many different types of phenomena. For instance, in the context of this biomedical research, we could think of, for instance, to use data on your diet as a way of studying the kind of stuff that you're exposed
22:42
to externally, right? But actually, I mean, that data about, for instance, how much pasta you eat in one week is going to become diet data, data about your exposure only in the specific context of your research that these people in biomedical
23:03
research do, because actually, that data could be used for many other things, could be used for a study of, you know, cultural practices. For instance, I don't know, culinary cultural practices in Hanover, for instance, it could be used to study your socioeconomic status, right?
23:21
What you eat is very much correlated with what the kind of socioeconomic status that you have. So actually, data is used to represent various things. Data doesn't really have this fixed informational content, doesn't really have this fixed representational content. The representation of the data, like the kind of stuff the data
23:42
represents, is going to be a consequence of specific context, of specific questions that you ask the data, as it were, specific perspectives from which you look at the data. And I'm getting at the third point here, meaning that the kind
24:02
of data has to be interpreted and the kind of theories, the kind of perspective you're coming from, the kind of material expertise, the kind of like financial and organizational infrastructure that surrounds it is going to be crucial for interpretation, in the sense that, for instance,
24:22
the data about the amount of pasta that you eat is going to be diet data in a study of exposure on the basis of specific theoretical commitments, on the basis of the theoretical commitment that diet is, for instance, an element of external exposure. So this is to say that
24:42
basically data doesn't really have this representational component, this representational content that is fixed and independent of how you look at it. Data is always going to be contextual to specific questions and research. And as a consequence of analysis of this kind,
25:04
philosophers of science, and I'm here quoting work by philosophers of science, in particular of biology, Sabina Leonelli, who's argued to move beyond this traditional mirror, representational view of data towards what she calls a relational
25:21
approach to data, where basically the idea would be that the value of data, what data can represent, how data can be meaningful to specific research context, is not really in the data itself, in its material components and features,
25:40
but it's rather in the interpretations, the processing, the kind of questions that we ask. So the value of the data is going to be determined on the kind of processing and the purposes for that processing that we do with the data. And so how can this be interesting, how can this
26:03
be useful when looking at the role of data and maybe also when looking at this approach as a way of kind of making the best out of big data? I would just mention a few things here. As a result of this different view on data,
26:21
data basically is kind of never really a role because basically data is always going to be an artifact, right? So specific interpretation, in the use of data for something, we're always going to have specific interpretations, there's going to be values,
26:41
judgments that are made, there's going to be biases. So the extraction of value and extraction of meaning from data is not going to be a neutral kind of effect. And this kind of runs counter certain aspects of the big data rhetoric. At the same time, data in this perspective, being
27:01
a relational entity, needs alignment with specific other components of research, needs alignment with infrastructure, for instance, and the kind of work that, for instance, libraries do is very much important in that regard. And this means that basically data mining is extremely important,
27:20
but the kind of like data stewardship, right? The kind of collection, basic research, the kind of infrastructure that surrounds the data is also going to be extremely important. And this runs counter some of the investments that we've seen in the last few years, also in a research context. And then lastly, since the significance of the data
27:43
arises from the relations between data, essay, specific theories that we use to interpret the data, as a result, these relations should be made as transparent as possible. It is very important to know where the data comes from, how the data has been processed, labeled, etc.
28:01
And we see, for instance, that in many community databases in the sciences, there's a very big focus on metadata and labeling as a way of making sure that it would be possible to use the data as evidence for many research contexts.
28:25
OK, I'm going to conclude and I'm looking forward to questions from you. So kind of like what I've tried to do is provide a little bit of disambiguation between data and big data,
28:40
kind of like cautioning against some of the elements of the data rhetoric. And, you know, suggesting that maybe tales of revolutions may be a little exaggerated and arguing for a different approach to data as possibly the best way of approaching data
29:00
epistemology and making the most of the big data. And I'm going to just stop here with the conclude here by kind of like going full circle with something that I showed you right at the start of my talk, this idea as data as the new oil, right? And actually, you know, if you look at it in a way,
29:23
it could be a little bit like oil, but arguably not really in the sense that the economists wanted to kind of like convey here. Not really in the sense that data is, basically data is not really like oil because it's not really like a natural resource. It's not something that is there independently of what we do.
29:44
But it's kind of like oil in the sense that the value of data has to be extracted from the data and the extraction of value is going to determine whether the use of the data is actually good or bad. Kind of like oil, extraction of oil can also be bad.
30:02
Extraction of stuff from data can also be bad. It's not necessarily positive and revolutionary. So thanks very much.
Recommendations
Series of 12 media
Series of 2 media