Preliminary analysis of crowdsourced sound data with FOSS
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 542 | |
Author | ||
Contributors | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/61909 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2023425 / 542
2
5
10
14
15
16
22
24
27
29
31
36
43
48
56
63
74
78
83
87
89
95
96
99
104
106
107
117
119
121
122
125
126
128
130
132
134
135
136
141
143
146
148
152
155
157
159
161
165
166
168
170
173
176
180
181
185
191
194
196
197
198
199
206
207
209
210
211
212
216
219
220
227
228
229
231
232
233
236
250
252
256
258
260
263
264
267
271
273
275
276
278
282
286
292
293
298
299
300
302
312
316
321
322
324
339
341
342
343
344
351
352
354
355
356
357
359
369
370
372
373
376
378
379
380
382
383
387
390
394
395
401
405
406
410
411
413
415
416
421
426
430
437
438
440
441
443
444
445
446
448
449
450
451
458
464
468
472
475
476
479
481
493
494
498
499
502
509
513
516
517
520
522
524
525
531
534
535
537
538
541
00:00
Mathematical analysisSoftwareTime zoneOpen sourceComputer animation
00:38
AreaSimulationNoiseOpen sourceIntegrated development environmentSpatial data infrastructureNoise (electronics)MappingReal numberCartesian coordinate systemSmartphoneOctaveMereologyEndliche ModelltheorieMobile appVideo gameGrass (card game)Motion captureDirection (geometry)Integrated development environmentCountingLevel (video gaming)Row (database)Projective planeMeasurementInformation privacyOpen sourceInformationComputer animation
04:21
NoiseLecture/Conference
04:52
DatabaseRow (database)Open setComputer animation
05:15
Integrated development environmentDatabaseSubsetContinuous trackSoftware testingPattern languageMathematical analysisSoftware testing2 (number)Integrated development environmentTrailMoment (mathematics)Row (database)Machine learningSpectrum (functional analysis)Noise (electronics)Computer animation
07:25
Software testingContinuous trackDatabaseSubsetNoise (electronics)TrailDifferent (Kate Ryan album)Lecture/Conference
07:58
DatabaseOrder (biology)Presentation of a groupCore dumpComputer animation
08:57
StatisticsNewton's law of universal gravitationSource codeAerodynamicsTemporal logicLocal ringMaxima and minimaLibrary (computing)Direction (geometry)MereologyDynamical systemMultiplication signIntegrated development environmentSign (mathematics)Noise (electronics)Graph (mathematics)Lecture/ConferenceComputer animation
10:39
Temporal logicSource codeAerodynamicsTurbo-CodeEvent horizonPointer (computer programming)Denial-of-service attackScripting languageCode refactoringCodeInformationIntegrated development environmentSoftwareDatabaseDistanceLaptopEvent horizonCondition numberIntegrated development environmentWorkstation <Musikinstrument>Moment (mathematics)Source codeComputer fileLink (knot theory)Row (database)Scripting languageCross-correlationTable (information)Mathematical analysisLecture/Conference
12:58
Code refactoringCodeSource codeScripting languageSoftwareIntegrated development environmentInformationDivisorInformationRun-time systemBitVirtual realityComputer animation
14:36
UsabilitySoftwareSet (mathematics)MereologyLaptopRow (database)Projective planeTrailNeuroinformatikLink (knot theory)Lecture/ConferenceComputer animation
17:00
Event horizonAerodynamicsTemporal logicSource codeLocal ringHill differential equationPointer (computer programming)Partition (number theory)Maxima and minimaGraphic designEmailGraph (mathematics)QuicksortBitTraffic reportingDecision theoryRow (database)Line (geometry)Axiom of choiceVarianceNoise (electronics)Term (mathematics)InformationVideo gameMereologyNumberPresentation of a groupSmartphoneMathematical analysisMultiplication signCausalityOrder (biology)Source codeFunctional (mathematics)Stress (mechanics)Integrated development environmentResultantComputer animationLecture/Conference
25:51
Program flowchart
Transcript: English(auto-generated)
00:06
Okay, thank you. Thank you for coming to this presentation. I'm Nicolas Roland from the Gustave Eiffel University. Thank you, please. And I will be presenting some research we did on crowdsourced zones data.
00:26
Some analysis we did with three open source software. I will be presenting the work we did with myself, Pierre Roman, and Ludovic Moison from the Gustave Eiffel University. So traffic noise is a major health concern.
00:44
In Europe, in Western Europe, it's estimated by the World Health Organization that we lost one million healthy life years each year. In France, we have estimated the social cost,
01:01
so the cost to the community of 1,147 billion euros per year. So it has a cost on, a monetary cost, but also a cost on people and their health.
01:23
So the big question is how we can find where noise is problematic. And so, of course, we can't have direct measure everywhere. We can't put a microphone everywhere.
01:41
It will be a cost nightmare, a logistic nightmare, and a privacy nightmare. Of course, it's not possible. So the traditional way is to simulate the noise from traffic counts. So we put counters on roads and counters on vehicles
02:02
and estimate the vehicles. We do that on trains, we do that on planes, on air tracks, and we simulate those traffic counts. And we produce this kind of cars with, for example, noise modeling, which is an application we developed with the Umehau Laboratory
02:28
that can compute from these counts noise maps. And this is a legal requirement by the European Commission. Another way that the Umehau, which is working on environmental acoustics,
02:45
is not to simulate, but to get actual data, real data, from contributors using a smartphone application you can install. It's working on smartphones.
03:02
It's available on Android, so it's also a free software. And it measures several things like your position, the sound spectrum, not the full spectrum, it's just the third octave. So you can't understand what people are saying if it's someone speaking,
03:26
but you can detect that someone is speaking. You also have the sound level and some kind of information. So it's part of a bigger project, like the Noise Project Project.
03:41
So we have this noise modeling application that generates noise from open source geodata, mostly French geodata and OpenStreetMap. So when you use StreetComplete to say, okay, this is grass and this is macadam, we use that data to generate more precise maps, sound maps,
04:05
and noise capture to measure and share sound environments. And all this data is given in a special data infrastructure called OnoMap. And there is also some community maps made by the users. This is a map of all the recordings we have in nearly five years.
04:27
So you can see it's worldwide. It's just not only France or Europe, it's worldwide. So the question was, what can we do with all this data we collect?
04:42
So there was an extraction in 2021 of the three first years of data collection. So it's still collecting the data, but there was an extract that contains 260,000 tracks.
05:02
So it attracts this recording worldwide. With the sound spectrum, like I said, GPS localization, and also the contributor can provide some tags. It's an open database license, so it's free to use.
05:23
So the question is how we can characterize the user environment, the sound environment of the user at the moment of the recording, with the collected data. We think of two possibilities. One is from the sound spectrum. We record, so it's an ongoing analysis.
05:42
It's not the hardest way to do that, because we have to find patterns on the recordings. And we have to use machine learnings to detect these patterns on all this data. So it's still going, but there is an easiest way,
06:04
and this is the way I use. It's by using the tags that are provided by the contributors. So in this subject, like I said, 260,000 tracks.
06:21
Half of them have tags, so we can use just half of it. 50,000 are where all those are not tests, so we want to work on this sound environment, so we discard indoors and test tagged tracks.
06:44
We also remove the very, very small ones, so less than five seconds. So we remove maybe tracks that might be accidental.
07:00
And we also work, for this just, these preliminary works on France, because we are French, and it's easier for us to understand what's happening. And it's nearly 12,000 tracks. And like I said, road noise is a major concern,
07:20
and it appears directly in our data, because the more frequent tag is road. So people are on maybe a third of our subset, there is road noise in it.
07:41
The second one is chatting, and so we have also things like wiring, animals, sounds, works. So there is 12 tracks, different tracks the user can provide. So we use a quite simple toolkit to analyze the data.
08:02
First is the PostgreSQL database, because the data is provided as a PostgreSQL dump. So in order to access it, you have to rebuild the data and the database. And the other tool we use is R,
08:21
because in the team we are mostly R users, we also have Python, but we are more familiar with R. So two tools, simple, yes? Actually not really, because we also use in R a lot of packages, like the tidyverse, the SF packages for your special reasons,
08:44
and so on. And we also, all those packages use dependencies like pandoc, markdown, reveal.js. This presentation actually is made with R and reveal.js.
09:00
We also use geospatial libraries like Proj, GOS, JLAL, and those are dependencies that are not handled by R directly. We just call them. So what we defined in this dataset? So let's talk about results.
09:21
We've got some interesting things to add. The first thing we looked at was the animal tags, because we know that bird songs can be heard mostly the first hour before dawn.
09:44
So this is a well-known dynamics in ornithology, and in the sound environment we can hear it. And we actually find it also. So in this graph you can see on the left part
10:00
is the time before sunrise on the day of recording, and after, it's the hour after. So we find this actual dynamics of birds singing one hour before dawn. So it was a good sign.
10:21
And we also find peaks of road noise between 8 to 10 a.m. And I think it's 6 to 8 p.m. And we can say it looks very much like committers' behavior.
10:44
But we can't directly link to it. We can say, ah, it's very similar too. So we look to physical events in the environment of the contributor, and we find a very good correlation between the wind force
11:05
and the presence of the wind tags in the dataset. So it works very well. We also did that with rainfall, and the correlation is not so strong.
11:23
It might be user bias. Maybe if the rainfall is too small, the user doesn't hear the rain or doesn't think to add a tag about it. And it might be also a special issue,
11:40
because the mean nearest weather station distances is 16 kilometers. So maybe the local condition might be different between the weather station and the user at the moment of the recording. So not so strong, but actually we find data.
12:02
I'm not the first one to speak about reproducible science here, actually. And it's an issue, really an issue. So for this study, we have some good points, like the data is already available. The source code, we made it available. So all SQL scripts to rebuild the database and the table we use are available.
12:29
The R notebooks we made are also available. The setup broadly is available. But there are also bad things to assess.
12:41
So some notebooks were very wide, and we went very deep on the analysis and the exploratory files. But at the end, it was very hard to reproduce, even in our team. We actually were able to do it. But for someone coming from outside, it might be difficult to understand that.
13:06
And so it needs some factoring and a little bit more commenting, more explanation. And so there is also a lack of information on software environments.
13:21
So it makes it very hard to reuse and reproduce. So what could we have used to have a better tooling? Since we use R, you can use Renv, which is our package to reproduce.
13:44
It's like a virtual environment. It works well, but it works well just for R. And we use other software, like POGES, we use JOS, Proj, GDAL. So it's not perfect.
14:03
Docker might be something that can be helpful. But like Simon said before, it's not perfect for reproducibility. And I just say Guix is on my mind for one year, actually,
14:24
to say, OK, I need to work on that. And I think it'd be a good solution. I won't talk too much, because there was a talk by Simon Tournier, just two talks before, and I go watch it. I think it might be a very good solution.
14:41
In conclusion, so we can use cold-source data for science. We can find, even for something quirky like some environment, we can use it for science. This particular data set is usable, so you can access it and find new things.
15:03
We don't have every question, so we don't have every answer. We can answer with this data set. So it's quite fun to play with it and find some, oh, we can find birds. I do believe that throughput software are key for reproducible science.
15:22
We can't make reproducible science with proprietary software. It's not possible. Repositional science is hard to achieve. You have to think it as soon as possible before starting your project. Because when you are too far, you have to refactor things, and it can be very tricky.
15:45
Maybe I'm working with, this is more sound and physics-related study, but sometimes I work with economists. I work with geographers.
16:02
They are not very keen on technologies and computers in general. So sometimes you need someone, maybe an engineer or someone in the team, that can handle this reproducible part.
16:22
So you need to get the skills. Either you get it yourself, or you have to take someone in the team that can do that for you. And notebooks are not enough. Notebooks are great to communicate and explore things,
16:40
but they are not good enough for reproducible science. So there is a link for the reactor data set. Please go to check noiseplanet.org. You can navigate on the map. You can actually see tracks and click on things to get what is recorded.
17:01
Thank you for your attention. You can join me by email or on Mastodon. This presentation is available here, and everything is accessible on GitHub. Thank you very much.
17:25
That leaves us a bit of time for questions, so please feel free to take them, repeat them, and then answer them. In the graph with Berg, you had sort of a dip at zero.
17:42
Is that a statistical artifact? Do you have an explanation to that? Being exactly at the top of the fitted line? So the question was about this particular graph, why there is a low beam at the zero and the peak is just above the zero,
18:06
because it's smoothed a little bit, and you can see there is a peak just before, and the line is just smoothing, and there is little shifting.
18:22
And you think, why is there a low? I don't know. I'm not sure. Yeah, please. As you're jumping on the same question, because we are doing crowdsourcing data, so it's obviously influenced by the users that are collecting the data for us.
18:42
How do you factor in or how do you eliminate this source of variance where it could be underlying behavior of humans that is affecting the results of the data? For example, sunrise time, so people who get awakened by birds during, before sunrise,
19:01
they would be very annoyed and they would be caught more. People who wake up at the normal times are too busy to even make the recording, then you kind of have to bias the data. Okay. So the question was about this is crowdsourcing data, so there is data provided by people willing to provide it,
19:24
and there is a bias, of course, because you may be angry at birds waking you up in the morning and you may be angry to traffic noise, and actually we don't assess the data. We take it as it is.
19:41
Maybe there will be some, I'm not part of this part of the project, but maybe there will be some work on it, and we hope that it's so much data that it will smooth bias. But, of course, it's bias like OpenStreetMap data,
20:02
and there is someone making a decision to say, okay, I will record it for a good or bad reason, or to prove a point, okay, where I'm living, it's too noisy, I make a recording, and it's okay. But we count, it's very hard to assess this kind of information.
20:25
We don't know why people record tracks, because maybe it's a pleasant environment and we want to share it, or it's not so good and it's okay for us. I hope it answers your question. Yeah, please.
20:40
Yeah, so I just wanted to ask, because I think wind is pretty hard to incorporate, because when somebody records, they're probably recording without the pop filter, which makes the sound really loud of the wind, despite of it not being so loud, because somebody comes up with a phone and records it, and the wind blows straight into it,
21:01
and it's really, really loud then, as kind of like in decibels, but it actually isn't. So do you like calculate these sort of things out, or subtract something from these wind recordings when you keep them in the noise? Okay, the question was about the wind recordings and the fact that smartphone doesn't have a pop shield
21:23
and protect the microphone from the wind. Actually, I'm not an acoustician. I'm more a JS engineer, so I don't have the exact question for that, but I do believe when you are using your microphone,
21:43
when you are talking smartphone, nowadays it can protect you a little bit from the noise, but I'm not sure from the question. Yeah, please. For building the data capture, like...
22:05
The subset? Yeah, did you build the data capture tool where people are inputting data, right? Or how was that built, and did you make sure that people could use it in a way that... Like, how did you make sure that people were comfortable
22:20
using it in the situations that you needed recordings for? So the question was... Can you simplify it? So I'm interested in what choices you made in order to have the thing look and function
22:41
how it did to capture the data, and, again, to bias. If people are not able to use it or don't like using it, does that also bias the data? Ah, okay. So you're not speaking... So the question was about how we build the analysis
23:03
and how we build it. If you are not able to use R to build the... Sorry. Actually, we have to make choices, and we are more comfortable with R.
23:23
So there is a bias, of course. And we also have some libraries, like suncalc, for example, that makes life very simpler for us. We give it a date and a position, and it gives you the sunrise and...
23:45
Sunrise and sunset. Thank you. Sunset time, for example. So it makes life easier for us. But, of course, there is a bias. Even when we build the application, there is, of course, a bias.
24:02
But I wasn't part of the team that built this application. It's more focused on what we want to get, but it's available for everyone, so do whatever you want to do with it.
24:22
Thank you. We have more time, maybe? On your first slide, you had a really big number in terms of the social costs. Only in France. It seems quite egregiously big. Do you know anything about what includes in the social costs?
24:43
What are the costs that are incorporated into this number? As Kim, it's a huge report. ADEM is a French agency, an environmental agency, so it works on noise pollution, but also air pollution and things like that.
25:00
So you are working... Sorry, I didn't repeat the question. The question is about the social cost and the amount and how it is constructed. So I'll just read quickly the report. And the social cost is mostly about health issues,
25:21
lack of sleep, and stress related to noise and things like that, and how it affects people, and how it affects their health, and how it affects less-better health
25:41
as a cause for society because you have more anxiety. I think this is in terms of the GDP. Sorry, we should switch. Thank you very much.