An approach for real-time validation of the location of biodiversity observations contributed in a citizen science project
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/68912 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
Data qualityProjective planeBitAlgebraic varietyValidity (statistics)MereologyLocal area networkUniverse (mathematics)Goodness of fitComputer animation
00:35
Integrated development environmentAlgebraic varietyAreaMultiplication signNumberField (computer science)Different (Kate Ryan album)Computer animation
01:18
Address spaceMachine learningCloud computingTriangleCircleAngleQuicksortVirtual machineAlgorithmArchitectureFunctional (mathematics)Function (mathematics)Link (knot theory)Web pageSpeciesInformationData typeGreen's functionDigital filterComputer-generated imageryStatisticsIntegrated development environmentVariable (mathematics)Mathematical modelDigital signalDatabase normalizationPrice indexSocial classImage resolutionCubeSource codeSign (mathematics)Point (geometry)Mixed realitySurfaceArtificial neural networkForestPreprocessorNeighbourhood (graph theory)SpeciesAlgebraic varietySocial classDistribution (mathematics)Point (geometry)State observerUniform resource locatorSet (mathematics)Medical imagingIntegrated development environmentAlgorithmINTEGRALWave packetCASE <Informatik>FeedbackResultantCovering spaceValidity (statistics)Projective planeFilter <Stochastik>NumberCharacteristic polynomialoutputInformationPresentation of a groupReal-time operating systemProcess (computing)Expert systemMathematical modelObject (grammar)Type theoryObservational studyOpen sourceInstance (computer science)Virtual machineForestGoodness of fitMixed realityElectric generatorSurfaceCartesian coordinate systemTotal S.A.Task (computing)Mathematical modelArtificial neural networkNeighbourhood (graph theory)Computing platformAverageVariable (mathematics)Multiplication signSoftware frameworkData qualityFunctional (mathematics)CubeDistanceCross-validation (statistics)Computer animation
10:01
Block (periodic table)Temporal logicHierarchyData structureMetreBlock (periodic table)Area4 (number)Observational studySpeciesProtein foldingVariable (mathematics)Cross-validation (statistics)Range (statistics)Set (mathematics)Square numberAutocorrelationAlgorithmIntegrated development environmentComputer animation
10:47
Posterior probabilitySocial classPerformance appraisalAlgorithmMedianAlgorithmForestCuboidBounded variationPlotterSpeciesRandom variableChemical equationMathematical modelCartesian coordinate systemUniform resource locatorArtificial neural networkState observerNetwork topologyDistribution (mathematics)ResultantComputer animationProgram flowchart
12:10
PressureForestRandom variableBinary fileLevel (video gaming)ForestAlgorithmResultantDistribution (mathematics)State observerBinary codeChemical equationSpeciesMathematical model
12:47
Computing platformSpeciesDistribution (mathematics)File formatFeedbackMathematical modelPredictionVariable (mathematics)Web pageLink (knot theory)Computer-generated imageryVirtual machineDisintegrationNeighbourhood (graph theory)ForestContext awarenessInformationRankingNumberCartesian coordinate systemMathematical modelSpeciesUniform resource locatorAlgebraic varietyFeedbackProjective planePattern languageVariable (mathematics)Form (programming)Characteristic polynomialState observerOpen setMultiplication signCASE <Informatik>Level (video gaming)InformationMobile appReal-time operating systemComputing platformType theoryElectric generatorComputer animationDiagram
15:40
FeedbackFrequencyUser interfaceFiber (mathematics)Different (Kate Ryan album)Machine learningSelf-organizationMoment (mathematics)Cross-correlationInstance (computer science)Random variableComputing platform2 (number)Object (grammar)MereologyDeterminantAlgorithmFlagOrder (biology)State observerMultiplication signKey (cryptography)Total S.A.Interactive televisionNumberPoint (geometry)Mathematical modelSoftwareForestSpeciesSocial classReal-time operating systemType theorySampling (statistics)Integrated development environmentResultantCombinational logicInterface (computing)CASE <Informatik>INTEGRALCartesian coordinate systemFeedbackAlgebraic varietyMathematical modelExpert systemValidity (statistics)Projective planeChemical equationVirtual machineUniform resource locatorPrinciple of maximum entropyDistribution (mathematics)ReliefVariable (mathematics)Online helpSource codeConnected spaceAreaData qualitySoftware testingPairwise comparisonMedical imagingArtificial neural networkInformationMusical ensembleMobile WebGroup actionSystem identificationCharacteristic polynomialSet (mathematics)Ocean currentComputer animation
Transcript: English(auto-generated)
00:01
So, good afternoon, everybody. I'm Mariam Lotfiyan. I'm a postdoctoral researcher at University of Applied Science in Western Switzerland. For my research I'm mainly focused on citizen science projects and data quality in citizen science project and public participation.
00:22
And today I'm going to present you part of our project related to biodiversity data validation in our citizen science project. First, let's see together a little bit what is citizen science. I'm sure the majority of you already know about it. But citizen
00:44
science is defined as the participation of the public in scientific projects. And initially it was mainly focused on environmental projects and biodiversity projects. But with time citizen science has expanded to different areas and now there are citizen science projects in really different fields. Of course, still the majority are focused on biodiversity, but there are
01:05
other areas that citizen science is involved. Also the number of publications in citizen science projects has increased considerably. This increase in the number of projects has resulted in large amounts of data being collected for citizen science projects, which
01:24
can be useful to address some of the challenges in new or existing projects. But also this availability of large amounts of data in citizen science projects has made citizen science to be a good partner for machine learning algorithms. Because usually one of
01:42
the big challenges for machine learning algorithms is lack of enough labeled data. So there are nowadays a lot of algorithms to get their training data, asking citizens to label data, to collect data, and to be able to train their algorithms. So of
02:03
course very interesting, but what is the benefit for citizen science projects and citizen scientists themselves? Can machine learning algorithms help in addressing some of the challenges in citizen science projects, in particular the two main challenges of motivating
02:20
the public to contribute to citizen science projects, so public engagement and also validating the data that is collected from the public. So having this introduction in mind, the objective of this research is to see how the integration of machine learning algorithms
02:40
in citizen science projects can, on one hand, simplify data validation and also increase public engagement for citizen science projects. But in particular, if we aim at providing real-time feedback to the participants with the aim of increasing their motivations to
03:01
continue their contribution to citizen science projects, but also by automating data validation, we aim at simplifying data validation tasks in citizen science projects and improving data quality and also we aim at evaluating what are the benefits and challenges that might arrive as a result of this integration. So data validation in citizen science projects
03:26
is mainly done by expert review. Of course we cannot ignore the important role of experts in citizen science projects, but relying on expert review as the main approach of data validation can have its disadvantages. For instance, of course it can be really time-consuming
03:45
because there are large amounts of data to be verified, but also the time gap between when participants make their contribution until when the expert validates the data and gives feedback to the participants can be really large and that can demotivate
04:03
participants if they don't receive anything about what they have contributed or if they receive a feedback really long time after their contribution. So that is why we have thought of implementing this biodiversity citizen science project called BioSense-ES,
04:20
which is implemented in Django framework with the goal of encouraging the public to collect biodiversity data, but having this additional functionality of automatically validating the observations that are contributed and also providing real-time feedback to the participants.
04:44
The approach of this real-time data filtering and feedback generation works as follows, that when a participant makes a contribution to the BioSense-ES application, the observation passes through this automatic filtering of the application. If it passes successfully,
05:03
it would be classified as valid observation, else it would be flagged as an unusual observation. In this case, the participant will receive a feedback with the information given on why the observation is flagged as unusual and given this information, the participant can decide
05:22
either to modify the observation and resubmit or to confirm the observation as it is, and in this case, it would be the expert who would do the final validation of the observation. So basically in this approach, on one hand, we are trying to reduce the number of data
05:43
that would be validated by experts, so accelerating this process and improving this data validation step in citizen science projects and simplifying this step, but on the other hand, by providing real-time feedback to the participants, we are aiming at keeping them
06:03
motivated to the project and keeping them engaged in the project. Also by providing them informative feedback, we aim at providing the participants learning opportunity while making contributions. For instance, they can learn about species habitat characteristics while making
06:24
contribution. So the filters that we applied in biosNCS, we did three types of filter, date filtering, image validation, and location validation. However, in this presentation, I'm focusing only on the third one, so I'm not going through the other two. And for the
06:46
location validation, what we did, it was using species distribution modeling to validate the location of biodiversity observations in such a way that we use the species observation data along with environmental variables to feed and train machine learning algorithms and using the
07:06
final generated models to validate the location of new added observations. So as I mentioned, for the input data, we used species observations. In this case, we considered only bad observations,
07:23
and for the environmental features, we used elevation, coriolan cover, and we used NDVI from Swiss DataQ. For the bird species, bird dataset, we used eBird platform. I'm sure most of you know this platform, but just in case, eBird is a citizen science platform to collect bad
07:45
observations all over the world, and the data is free and open source. We obtained the data that are already validated by experts, and we obtained the data for Switzerland, and actually, we filtered the data in such a way that for each species, we had at least 100 observations.
08:06
So we selected only those species that were at least 100 observations already collected in the eBird dataset. So as a result of this filtering, we ended up having 101 species selected, and for these observations from eBird, there are only like where the species is observed,
08:26
so called presence points, and for training our algorithms, we needed both classes like presence and absence, like where the species is not observed, but of course, obtaining true absences is complicated, and that is why we generated artificial absences or pseudo absences. So for
08:48
each species dataset, we generated 5,000 pseudo absences in such a way that we considered a distance of five kilometer from the presence points, and we randomly generated 5,000 absence
09:02
points or pseudo absences. For instance, here we can see the dataset for one of the common species that had already like 3,000 presence points, and we generated 5,000 random points for these species. The next step was to compute the environmental variables around each
09:27
observation point, presence or absence. So we considered a neighborhood of two square kilometer around each point, and we computed the environmental variables, such as landscape proportions, for instance, the ratio of artificial surfaces, or ratio of mixed forest, average
09:45
elevation, average slope, and so on, and in total, we had 19 environmental features. So we had our environmental features plus our species' labels, presence, absence, which were ready to train our algorithms, but just one step left is that to consider cross-validation
10:04
because of the special autocorrelation exists between the environmental variables, it's important to think about special cross-validation, which what we used, we used this package in R, which is called block CV, which takes into account the range of special autocorrelation
10:21
among the variables and proposes a block size, which we used a block size of 50 square kilometer for our study area, and we assigned five random folds to these special blocks, and for each species, we generated the folds, and then we had our final dataset ready to
10:44
use and train our algorithms. We trained four algorithms of naive Bayesian random forest, balanced random forest, and neural network, and we then compared the performance of these algorithms for all the 101 species that I mentioned that we filtered. So we compared
11:05
the performance of all these four algorithms for all the species, and we have here the box plots that shows the variation of the performance of these four algorithms. We can see that neural network has a higher median of AUC compared to the other three algorithms.
11:25
However, for certain species, it performed really poor, and on the other hand, balanced random forest performed relatively better for all the species. Also for some species, we observed that when the performance was less than 70% for some of the species,
11:47
balanced random forest was the one that always performed better compared to the other three algorithms of naive Bayes, random forest, and neural network. That is why we decided to use
12:00
the models trained using this algorithm to validate the location of the observations added to our biosNCS application. We obtained as a result of this model species distribution models generated binary classification map and also a map that is the map of probability
12:29
of occurrence of the species over Switzerland. So up to here, we trained our algorithms, we compared the performance, we chose balanced random forest, but how do we use
12:42
that now in biosNCS to validate new observations? We actually implemented an API using Flask, which is called the biolocation. Using this biolocation API, every time an observation is
13:01
made to biosNCS platform, the observation is sent, the name of the species plus the location is sent to biolocation, and then the model trained for that particular species is loaded, and based on the location of species, the environmental variables around that location
13:21
are computed, and the model predicts the probability of observing the species in that particular location, and this probability is then sent to the participant in the form of real-time feedback with information of this probability, also information of the
13:40
species habitat characteristics. So here we have like two forms of real-time feedback generated. The first one, if the probability of observing the species was higher than 50%, we were just giving the information, the feedback to the participant in the form of
14:00
an additional information, else the participant needed to confirm if the observation is added correctly or not. Moreover, we also proposed like user-centered suggestions in such a way that if the participant didn't know what to observe, they could also get top five high
14:23
probable species that could be observed around their location. So we did also an experiment of this application to see how participants think about this real-time feedback and how this automatic filtering works. The experiment was a Trivix experiment, so we did a pilot
14:50
test of the app during Trivix. During this Trivix, 200 people visited the application, however only 36 people registered. Among the 36 people, only 14 contributed, and also among
15:04
these 14 people, some were really active and some were contributing from time to time, which this is a very common pattern when it comes to citizen science and VGI projects, and we know the really known case of OpenStreetMap that this pattern of participation is usually
15:24
the same when it comes to such projects. But moreover, during this Trivix experiment, 230 observations were collected, which the majority of course were bad observations, but also other types of species was collected during this experiment. We also afterwards sent
15:45
a questionnaire to our participants to see their opinion about, of course, the interface of the application, how user-friendly it was, and also more important for us, it was how they thought about this real-time feedback generated, how useful it was,
16:01
or whether it actually increased their motivation to continue contributing to the project. We checked that there was a high correlation between the number of contributions, so people who have contributed more, they have rated the feedback to be more useful. Although
16:21
this correlation was not statistically significant, because probably our sample size was quite small, but still it was a preliminary result for us to see the positive impact of this real-time feedback. As a result of this auto-filtering, only 24 observations were flagged
16:41
out of 230, and we also saw that there is a correlation between the number of flagged observations and total number of observations. We checked that for each participant, and it was a statistically significant correlation. That is to say that participants who contributed more, they had lower number of flagged observations. Maybe that means that
17:06
people over time actually got to learn from their contribution and thus contributing higher quality data. To conclude, the objective of this research was to see how this combination
17:26
of machine learning and citizen science can help improving data quality, simplifying data validation, and increasing public engagement. To do that, we implemented this biosensors application, which automatically validates the location of biodiversity observations using
17:43
species distribution models. And also, we obtained this result that the more active the participants, the more useful they found the feedback, and also more motivating to continue
18:01
with their contribution. And as we observed, the number of observations, flagged observations, were 24 out of 230. So that is to say that this approach simplifies the data validation by reducing the flagged observation that needs to be later validated by experts.
18:24
Some future work that we need to do is to also try to perform this approach for other organisms besides birds, and to see how we can expand this approach of combining machine learning,
18:41
citizen science, also to other citizen science projects besides biodiversity. Another point is that in our application, we mainly focused on sustaining participation by giving real-time feedback to the participants of people that actually were contributing, but it would be interesting to see how this integration of AI in general in citizen science
19:04
can promote initial engagement and bring more people in the initial step to contribute to citizen science projects. And finally, what is very important for us is to investigate more, and to focus also on these possible challenges, and also benefits that might arrive as a result
19:23
of this combination of this AI and citizen science, focusing on different aspects of user engagement, data quality, ethics, and so on. So that is basically it. Thank you so much, and we'll be happy to answer questions. Thank you. I will start with some online questions,
19:50
then later there are more questions in the room I will ask you also. The first question is, can the user download the model to their mobile such that they can receive feedback
20:00
when there is no mobile network? For the moment, no. It works only if there is data mobile network connection, but that is something that we need to think about it later on, because if they are really contributing during forest or somewhere that is not connection, it's important to get this real-time feedback, but not for the moment.
20:23
Okay. Second question, do you think that your methods can be applied to other species where there are less available existing observations, so more rare species? That's a good question. Well, it's also a challenge that we had that even for our models, for some species,
20:47
that there are low number of data available, the models didn't really perform well. Well, still, when we generated these random points, balanced random forest performed better
21:03
because it considers under-sampling and reducing the majority class, like the absences, and it performed better, but we saw that for some other algorithms, it performed really poor.
21:20
However, one thing that we think that can be useful for rare species, maybe instead of generating, for instance, only 5,000 points, trying to capture as much as we can from the environment, so maybe like 100,000 pseudoabsence points, and then to see if that would help for
21:42
species that really have low number of observations. So this is something that we need to try to see if we get a higher performance for such cases. And then, very related, do you think it's also
22:02
suitable for other species, so now you talk mainly about birds, but for like insects? Again, very good question. We are also trying to do that. We need a data set. So for birds, it was easy, probably Tom there, he is working with birds, so he knows that it's
22:22
easier to get bird observations than maybe other organisms' data set, but we are now going to start to contribute in collaborating with an institute in Switzerland that they collect all types of observations, so they are willing to provide us their data set that they have, their participants are not active for some time, and they are willing to maybe integrate
22:43
this in their platform and to see if their participants will get more active, but it was also benefiting us to get their data set for other organisms. Before moving on, there's one more question in the chat, but does anyone in the room have a question you want to ask now, or did you already ask it in the chat? Of course,
23:03
that's also possible, but if anyone wants to raise their hand, then let's have some interaction with the room also. Go ahead. Is there any features like to add to the current software, like species determination keys,
23:28
and having second questions, and what will be the, or could be the interest for kind of ecological specialized institutes with this PGI approach? Sorry, I didn't understand your first question, could you please repeat it?
23:44
Is that about new features you could add to the application, for instance, species determination key, could it rise in the interest of the participating involvement?
24:01
To add new features to the application in order, for instance... Yeah, such as like try to help the volunteer to determine the species. Well, again, this is something that we are thinking to do in the future, but one thing that we are thinking to do is, for the moment, we are identifying species based on
24:27
their environmental characteristics, like based on the environmental characteristics, we say okay, this is high probable that this is the species or not, but we would like to
24:40
have like this maybe ensemble algorithm of one trained on images, and one on our environmental variables, and then two maybe also increase this possibility of species identification also for volunteers, and I forgot the second question.
25:06
Is that the interest for specialists, like you got the PGI approach, and what could be the interest for a kind of center of competence and specialized for the... Well, I think mainly for them, maybe it would be that, because usually these types of
25:24
biodiversity data is collected nowadays from citizen scientists, so I think the interest for them would be to maybe keep their community engaged, because sometimes they like, as I mentioned, this institute that they have a large group of communities, but they said that only some of them are contributing, and they are a large part of them that are not active
25:44
anymore, so I think the interest for them, actually, it was to see how they can bring them back. Another point is that they have a lot of data that is not validated, because to have experts checking this, they need also resources, they need also people to do that,
26:03
financial and human resources, that they don't have that much, they don't have those resources at the moment, because they have lots of data that are not validated, so this was another point of interest for them to maybe integrate this approach in their platform. I hope I could answer. Thank you. Thank you. Well, I will ask one more question.
26:25
I think it's also, if you want to leave now to the different session for the next presentation, please go ahead, but for us to fail the time, we will just go through one more question, and that question was, is giving the suggestion of the top five species in the area,
26:42
is that not biasing the outcome? And after you did that, did you get different types of results? Well, most of the, in the pilot test that we did, most of the participants actually
27:01
contributed only the species that they observed, so the suggestion was not used that much, but in any case we put it there, but it was not used that much, but I think sometimes for non-expert, let's say, ecologists or non-bird watchers, it could be useful,
27:22
because they were observing a species in some point, and then they didn't know the name, so using the suggestions, they could click on that name, they could have the information of the species with the image and everything, and they would say, ah, this is that species, so then they were adding it directly as an observation that they observed, but if it could be
27:44
biasing the collected observations, that's a good point, I mean, we didn't that much think about it, but to be honest, most of the people used it as a source of help to then continue with the normal way of adding their observations. Thank you. Someone want to have the honor to ask
28:02
the last question? Otherwise, I have one last question. You said you compared the different machine learning algorithms, like Naive Bayes and everything, did you also compare it to some kind of a baseline method, where it just randomly say valid or not valid, and see how much better? I compared with MaxEnt, which is, if that's your question, if you compared with a baseline,
28:27
I compared with the results with MaxEnt, to be honest, not for all the species, but for certain species, that performance was not good for some algorithms. I compared the results with MaxEnt, and
28:41
the results from balanced random forest was very close to what I got from MaxEnt, so this was the comparison that I did with a baseline, let's say standard. Thank you so much. Thanks to you.