We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

An approach for real-time validation of the location of biodiversity observations contributed in a citizen science project

00:00

Formal Metadata

Title
An approach for real-time validation of the location of biodiversity observations contributed in a citizen science project
Title of Series
Number of Parts
351
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
An approach for real-time validation of the location of biodiversity observations contributed in a citizen science project Motivation: Because of technological advancements, public participation in scientific projects, known as citizen science, has grown significantly in recent years (Schade and Tsinaraki 2016; Land-Zandstra et al. 2016). Contributors to citizen science projects are very diverse, coming from a variety of expertise, age groups, cultures, and so on, and thus the data contributed by them should be validated before being used in any scientific analysis. Experts typically validate data in citizen science, but this is a time-consuming process. One disadvantage of this is that volunteers will not receive feedback on their contributions and may become demotivated to continue contributing in the future. Therefore, a method for (semi)-automating validation of citizen science data is critical. One way that researchers are now focusing on is the use of machine learning (ML) algorithms to validate citizen science data. Methodology: We developed a citizen science project with the goal of collecting and automatically validating biodiversity observations while also providing participants with real-time feedback. We implemented the application with the Django framework and a PostgreSQL/PostGIS database for data preservation. In general, the focus of biodiversity citizen science applications is on automatically identifying or validating species images, with less emphasis on automatically validating the location of observations. Our application's focus, aside from image and date validation (Lotfian et al. July 15-20, 2019), is on automatically validating the location of biodiversity observations based on the environmental variables surrounding the observation point. In this project, we generated species distribution models using various machine learning algorithms (Random Forest, Balanced Random Forest, Deep Neural Network, and Naive Bayesian) and used the models to validate the location of a newly added observation. After comparing the performance of the various algorithms, we chose the one with the best performance to use in our real-time location validation application. We developed an API that validates new observations using the trained models of the chosen algorithm. The Flask framework was used to create the API. The API uses the location and species name as parameters to predict the likelihood of observing a species (for the time being, a bird species) in a given neighborhood. Moreover, the model prediction, as well as information on species habitat characteristics are then communicated to participants in the form of real-time feedback. The API has three endpoints: a POST request that takes the species name and location of observation and returns the model prediction for the probability of observing the species in a 1km neighborhood around the location of observation; a GET request that takes the location of observations and returns the top five species likely to be observed in a 1km neighborhood around the location of observation; and a GET request that returns the species common names in English. User experiment: A user experiment was carried out to investigate the impact of automatic feedback on simplifying the validation task and improving data quality, as well as the impact of real-time feedback on sustaining participation. Furthermore, a questionnaire was distributed to volunteers, who were asked about their feedback on the application interface as well as the impact of real-time feedback on their motivation to continue contributing to the application. Results: The results were divided into two parts: first, the performance of the machine learning algorithms and their comparison, and second, the results of testing the application through the user experiment. We used the AUC metric to compare the performance of the machine learning algorithms, and the results showed that while DNN had a higher median AUC (0.86) than the other three algorithms, DNN performance was very poor for some species (below 0.6). Balanced Random Forest (AUC median 0.82) performed relatively better for all species in comparison to the other three algorithms. Furthermore, for some species where the other three algorithms performed poorly (AUC less than 70%), Balanced-RF outperforms the others. The user experiment results provided us with preliminary findings that support the combination of citizen science and machine learning. According to the findings of the user experiment, participants with a higher number of contributions found real-time feedback to be more useful in learning about biodiversity and stated that it increased their motivation to contribute to the project. Besides that, as a result of automatic data validation, only 10% of observations were flagged for expert verification, resulting in a faster validation process and improved data quality by combining human and machine power. Why it should be considered: Data validation and long-term participation have always been two of the most difficult challenges in citizen science and VGI (volunteer geographic information) projects. Various studies have been conducted on biodiversity data validation, focusing primarily on observation images with automatic species identification; however, not enough attention has been paid to observation location validation, particularly automatic location validation taking into account species habitat characteristics. Furthermore, to the best of our knowledge, the combination of machine learning and citizen science for sustaining participation by providing real-time user-centered and machine generated feedback to participants has received, till now, little attention and therefore our work is new, original and completely coherent with the vision of community citizen science, where scientists and citizen scientists are supposed to learn from each other.
Keywords
Data qualityProjective planeBitAlgebraic varietyValidity (statistics)MereologyLocal area networkUniverse (mathematics)Goodness of fitComputer animation
Integrated development environmentAlgebraic varietyAreaMultiplication signNumberField (computer science)Different (Kate Ryan album)Computer animation
Address spaceMachine learningCloud computingTriangleCircleAngleQuicksortVirtual machineAlgorithmArchitectureFunctional (mathematics)Function (mathematics)Link (knot theory)Web pageSpeciesInformationData typeGreen's functionDigital filterComputer-generated imageryStatisticsIntegrated development environmentVariable (mathematics)Mathematical modelDigital signalDatabase normalizationPrice indexSocial classImage resolutionCubeSource codeSign (mathematics)Point (geometry)Mixed realitySurfaceArtificial neural networkForestPreprocessorNeighbourhood (graph theory)SpeciesAlgebraic varietySocial classDistribution (mathematics)Point (geometry)State observerUniform resource locatorSet (mathematics)Medical imagingIntegrated development environmentAlgorithmINTEGRALWave packetCASE <Informatik>FeedbackResultantCovering spaceValidity (statistics)Projective planeFilter <Stochastik>NumberCharacteristic polynomialoutputInformationPresentation of a groupReal-time operating systemProcess (computing)Expert systemMathematical modelObject (grammar)Type theoryObservational studyOpen sourceInstance (computer science)Virtual machineForestGoodness of fitMixed realityElectric generatorSurfaceCartesian coordinate systemTotal S.A.Task (computing)Mathematical modelArtificial neural networkNeighbourhood (graph theory)Computing platformAverageVariable (mathematics)Multiplication signSoftware frameworkData qualityFunctional (mathematics)CubeDistanceCross-validation (statistics)Computer animation
Block (periodic table)Temporal logicHierarchyData structureMetreBlock (periodic table)Area4 (number)Observational studySpeciesProtein foldingVariable (mathematics)Cross-validation (statistics)Range (statistics)Set (mathematics)Square numberAutocorrelationAlgorithmIntegrated development environmentComputer animation
Posterior probabilitySocial classPerformance appraisalAlgorithmMedianAlgorithmForestCuboidBounded variationPlotterSpeciesRandom variableChemical equationMathematical modelCartesian coordinate systemUniform resource locatorArtificial neural networkState observerNetwork topologyDistribution (mathematics)ResultantComputer animationProgram flowchart
PressureForestRandom variableBinary fileLevel (video gaming)ForestAlgorithmResultantDistribution (mathematics)State observerBinary codeChemical equationSpeciesMathematical model
Computing platformSpeciesDistribution (mathematics)File formatFeedbackMathematical modelPredictionVariable (mathematics)Web pageLink (knot theory)Computer-generated imageryVirtual machineDisintegrationNeighbourhood (graph theory)ForestContext awarenessInformationRankingNumberCartesian coordinate systemMathematical modelSpeciesUniform resource locatorAlgebraic varietyFeedbackProjective planePattern languageVariable (mathematics)Form (programming)Characteristic polynomialState observerOpen setMultiplication signCASE <Informatik>Level (video gaming)InformationMobile appReal-time operating systemComputing platformType theoryElectric generatorComputer animationDiagram
FeedbackFrequencyUser interfaceFiber (mathematics)Different (Kate Ryan album)Machine learningSelf-organizationMoment (mathematics)Cross-correlationInstance (computer science)Random variableComputing platform2 (number)Object (grammar)MereologyDeterminantAlgorithmFlagOrder (biology)State observerMultiplication signKey (cryptography)Total S.A.Interactive televisionNumberPoint (geometry)Mathematical modelSoftwareForestSpeciesSocial classReal-time operating systemType theorySampling (statistics)Integrated development environmentResultantCombinational logicInterface (computing)CASE <Informatik>INTEGRALCartesian coordinate systemFeedbackAlgebraic varietyMathematical modelExpert systemValidity (statistics)Projective planeChemical equationVirtual machineUniform resource locatorPrinciple of maximum entropyDistribution (mathematics)ReliefVariable (mathematics)Online helpSource codeConnected spaceAreaData qualitySoftware testingPairwise comparisonMedical imagingArtificial neural networkInformationMusical ensembleMobile WebGroup actionSystem identificationCharacteristic polynomialSet (mathematics)Ocean currentComputer animation
Transcript: English(auto-generated)
So, good afternoon, everybody. I'm Mariam Lotfiyan. I'm a postdoctoral researcher at University of Applied Science in Western Switzerland. For my research I'm mainly focused on citizen science projects and data quality in citizen science project and public participation.
And today I'm going to present you part of our project related to biodiversity data validation in our citizen science project. First, let's see together a little bit what is citizen science. I'm sure the majority of you already know about it. But citizen
science is defined as the participation of the public in scientific projects. And initially it was mainly focused on environmental projects and biodiversity projects. But with time citizen science has expanded to different areas and now there are citizen science projects in really different fields. Of course, still the majority are focused on biodiversity, but there are
other areas that citizen science is involved. Also the number of publications in citizen science projects has increased considerably. This increase in the number of projects has resulted in large amounts of data being collected for citizen science projects, which
can be useful to address some of the challenges in new or existing projects. But also this availability of large amounts of data in citizen science projects has made citizen science to be a good partner for machine learning algorithms. Because usually one of
the big challenges for machine learning algorithms is lack of enough labeled data. So there are nowadays a lot of algorithms to get their training data, asking citizens to label data, to collect data, and to be able to train their algorithms. So of
course very interesting, but what is the benefit for citizen science projects and citizen scientists themselves? Can machine learning algorithms help in addressing some of the challenges in citizen science projects, in particular the two main challenges of motivating
the public to contribute to citizen science projects, so public engagement and also validating the data that is collected from the public. So having this introduction in mind, the objective of this research is to see how the integration of machine learning algorithms
in citizen science projects can, on one hand, simplify data validation and also increase public engagement for citizen science projects. But in particular, if we aim at providing real-time feedback to the participants with the aim of increasing their motivations to
continue their contribution to citizen science projects, but also by automating data validation, we aim at simplifying data validation tasks in citizen science projects and improving data quality and also we aim at evaluating what are the benefits and challenges that might arrive as a result of this integration. So data validation in citizen science projects
is mainly done by expert review. Of course we cannot ignore the important role of experts in citizen science projects, but relying on expert review as the main approach of data validation can have its disadvantages. For instance, of course it can be really time-consuming
because there are large amounts of data to be verified, but also the time gap between when participants make their contribution until when the expert validates the data and gives feedback to the participants can be really large and that can demotivate
participants if they don't receive anything about what they have contributed or if they receive a feedback really long time after their contribution. So that is why we have thought of implementing this biodiversity citizen science project called BioSense-ES,
which is implemented in Django framework with the goal of encouraging the public to collect biodiversity data, but having this additional functionality of automatically validating the observations that are contributed and also providing real-time feedback to the participants.
The approach of this real-time data filtering and feedback generation works as follows, that when a participant makes a contribution to the BioSense-ES application, the observation passes through this automatic filtering of the application. If it passes successfully,
it would be classified as valid observation, else it would be flagged as an unusual observation. In this case, the participant will receive a feedback with the information given on why the observation is flagged as unusual and given this information, the participant can decide
either to modify the observation and resubmit or to confirm the observation as it is, and in this case, it would be the expert who would do the final validation of the observation. So basically in this approach, on one hand, we are trying to reduce the number of data
that would be validated by experts, so accelerating this process and improving this data validation step in citizen science projects and simplifying this step, but on the other hand, by providing real-time feedback to the participants, we are aiming at keeping them
motivated to the project and keeping them engaged in the project. Also by providing them informative feedback, we aim at providing the participants learning opportunity while making contributions. For instance, they can learn about species habitat characteristics while making
contribution. So the filters that we applied in biosNCS, we did three types of filter, date filtering, image validation, and location validation. However, in this presentation, I'm focusing only on the third one, so I'm not going through the other two. And for the
location validation, what we did, it was using species distribution modeling to validate the location of biodiversity observations in such a way that we use the species observation data along with environmental variables to feed and train machine learning algorithms and using the
final generated models to validate the location of new added observations. So as I mentioned, for the input data, we used species observations. In this case, we considered only bad observations,
and for the environmental features, we used elevation, coriolan cover, and we used NDVI from Swiss DataQ. For the bird species, bird dataset, we used eBird platform. I'm sure most of you know this platform, but just in case, eBird is a citizen science platform to collect bad
observations all over the world, and the data is free and open source. We obtained the data that are already validated by experts, and we obtained the data for Switzerland, and actually, we filtered the data in such a way that for each species, we had at least 100 observations.
So we selected only those species that were at least 100 observations already collected in the eBird dataset. So as a result of this filtering, we ended up having 101 species selected, and for these observations from eBird, there are only like where the species is observed,
so called presence points, and for training our algorithms, we needed both classes like presence and absence, like where the species is not observed, but of course, obtaining true absences is complicated, and that is why we generated artificial absences or pseudo absences. So for
each species dataset, we generated 5,000 pseudo absences in such a way that we considered a distance of five kilometer from the presence points, and we randomly generated 5,000 absence
points or pseudo absences. For instance, here we can see the dataset for one of the common species that had already like 3,000 presence points, and we generated 5,000 random points for these species. The next step was to compute the environmental variables around each
observation point, presence or absence. So we considered a neighborhood of two square kilometer around each point, and we computed the environmental variables, such as landscape proportions, for instance, the ratio of artificial surfaces, or ratio of mixed forest, average
elevation, average slope, and so on, and in total, we had 19 environmental features. So we had our environmental features plus our species' labels, presence, absence, which were ready to train our algorithms, but just one step left is that to consider cross-validation
because of the special autocorrelation exists between the environmental variables, it's important to think about special cross-validation, which what we used, we used this package in R, which is called block CV, which takes into account the range of special autocorrelation
among the variables and proposes a block size, which we used a block size of 50 square kilometer for our study area, and we assigned five random folds to these special blocks, and for each species, we generated the folds, and then we had our final dataset ready to
use and train our algorithms. We trained four algorithms of naive Bayesian random forest, balanced random forest, and neural network, and we then compared the performance of these algorithms for all the 101 species that I mentioned that we filtered. So we compared
the performance of all these four algorithms for all the species, and we have here the box plots that shows the variation of the performance of these four algorithms. We can see that neural network has a higher median of AUC compared to the other three algorithms.
However, for certain species, it performed really poor, and on the other hand, balanced random forest performed relatively better for all the species. Also for some species, we observed that when the performance was less than 70% for some of the species,
balanced random forest was the one that always performed better compared to the other three algorithms of naive Bayes, random forest, and neural network. That is why we decided to use
the models trained using this algorithm to validate the location of the observations added to our biosNCS application. We obtained as a result of this model species distribution models generated binary classification map and also a map that is the map of probability
of occurrence of the species over Switzerland. So up to here, we trained our algorithms, we compared the performance, we chose balanced random forest, but how do we use
that now in biosNCS to validate new observations? We actually implemented an API using Flask, which is called the biolocation. Using this biolocation API, every time an observation is
made to biosNCS platform, the observation is sent, the name of the species plus the location is sent to biolocation, and then the model trained for that particular species is loaded, and based on the location of species, the environmental variables around that location
are computed, and the model predicts the probability of observing the species in that particular location, and this probability is then sent to the participant in the form of real-time feedback with information of this probability, also information of the
species habitat characteristics. So here we have like two forms of real-time feedback generated. The first one, if the probability of observing the species was higher than 50%, we were just giving the information, the feedback to the participant in the form of
an additional information, else the participant needed to confirm if the observation is added correctly or not. Moreover, we also proposed like user-centered suggestions in such a way that if the participant didn't know what to observe, they could also get top five high
probable species that could be observed around their location. So we did also an experiment of this application to see how participants think about this real-time feedback and how this automatic filtering works. The experiment was a Trivix experiment, so we did a pilot
test of the app during Trivix. During this Trivix, 200 people visited the application, however only 36 people registered. Among the 36 people, only 14 contributed, and also among
these 14 people, some were really active and some were contributing from time to time, which this is a very common pattern when it comes to citizen science and VGI projects, and we know the really known case of OpenStreetMap that this pattern of participation is usually
the same when it comes to such projects. But moreover, during this Trivix experiment, 230 observations were collected, which the majority of course were bad observations, but also other types of species was collected during this experiment. We also afterwards sent
a questionnaire to our participants to see their opinion about, of course, the interface of the application, how user-friendly it was, and also more important for us, it was how they thought about this real-time feedback generated, how useful it was,
or whether it actually increased their motivation to continue contributing to the project. We checked that there was a high correlation between the number of contributions, so people who have contributed more, they have rated the feedback to be more useful. Although
this correlation was not statistically significant, because probably our sample size was quite small, but still it was a preliminary result for us to see the positive impact of this real-time feedback. As a result of this auto-filtering, only 24 observations were flagged
out of 230, and we also saw that there is a correlation between the number of flagged observations and total number of observations. We checked that for each participant, and it was a statistically significant correlation. That is to say that participants who contributed more, they had lower number of flagged observations. Maybe that means that
people over time actually got to learn from their contribution and thus contributing higher quality data. To conclude, the objective of this research was to see how this combination
of machine learning and citizen science can help improving data quality, simplifying data validation, and increasing public engagement. To do that, we implemented this biosensors application, which automatically validates the location of biodiversity observations using
species distribution models. And also, we obtained this result that the more active the participants, the more useful they found the feedback, and also more motivating to continue
with their contribution. And as we observed, the number of observations, flagged observations, were 24 out of 230. So that is to say that this approach simplifies the data validation by reducing the flagged observation that needs to be later validated by experts.
Some future work that we need to do is to also try to perform this approach for other organisms besides birds, and to see how we can expand this approach of combining machine learning,
citizen science, also to other citizen science projects besides biodiversity. Another point is that in our application, we mainly focused on sustaining participation by giving real-time feedback to the participants of people that actually were contributing, but it would be interesting to see how this integration of AI in general in citizen science
can promote initial engagement and bring more people in the initial step to contribute to citizen science projects. And finally, what is very important for us is to investigate more, and to focus also on these possible challenges, and also benefits that might arrive as a result
of this combination of this AI and citizen science, focusing on different aspects of user engagement, data quality, ethics, and so on. So that is basically it. Thank you so much, and we'll be happy to answer questions. Thank you. I will start with some online questions,
then later there are more questions in the room I will ask you also. The first question is, can the user download the model to their mobile such that they can receive feedback
when there is no mobile network? For the moment, no. It works only if there is data mobile network connection, but that is something that we need to think about it later on, because if they are really contributing during forest or somewhere that is not connection, it's important to get this real-time feedback, but not for the moment.
Okay. Second question, do you think that your methods can be applied to other species where there are less available existing observations, so more rare species? That's a good question. Well, it's also a challenge that we had that even for our models, for some species,
that there are low number of data available, the models didn't really perform well. Well, still, when we generated these random points, balanced random forest performed better
because it considers under-sampling and reducing the majority class, like the absences, and it performed better, but we saw that for some other algorithms, it performed really poor.
However, one thing that we think that can be useful for rare species, maybe instead of generating, for instance, only 5,000 points, trying to capture as much as we can from the environment, so maybe like 100,000 pseudoabsence points, and then to see if that would help for
species that really have low number of observations. So this is something that we need to try to see if we get a higher performance for such cases. And then, very related, do you think it's also
suitable for other species, so now you talk mainly about birds, but for like insects? Again, very good question. We are also trying to do that. We need a data set. So for birds, it was easy, probably Tom there, he is working with birds, so he knows that it's
easier to get bird observations than maybe other organisms' data set, but we are now going to start to contribute in collaborating with an institute in Switzerland that they collect all types of observations, so they are willing to provide us their data set that they have, their participants are not active for some time, and they are willing to maybe integrate
this in their platform and to see if their participants will get more active, but it was also benefiting us to get their data set for other organisms. Before moving on, there's one more question in the chat, but does anyone in the room have a question you want to ask now, or did you already ask it in the chat? Of course,
that's also possible, but if anyone wants to raise their hand, then let's have some interaction with the room also. Go ahead. Is there any features like to add to the current software, like species determination keys,
and having second questions, and what will be the, or could be the interest for kind of ecological specialized institutes with this PGI approach? Sorry, I didn't understand your first question, could you please repeat it?
Is that about new features you could add to the application, for instance, species determination key, could it rise in the interest of the participating involvement?
To add new features to the application in order, for instance... Yeah, such as like try to help the volunteer to determine the species. Well, again, this is something that we are thinking to do in the future, but one thing that we are thinking to do is, for the moment, we are identifying species based on
their environmental characteristics, like based on the environmental characteristics, we say okay, this is high probable that this is the species or not, but we would like to
have like this maybe ensemble algorithm of one trained on images, and one on our environmental variables, and then two maybe also increase this possibility of species identification also for volunteers, and I forgot the second question.
Is that the interest for specialists, like you got the PGI approach, and what could be the interest for a kind of center of competence and specialized for the... Well, I think mainly for them, maybe it would be that, because usually these types of
biodiversity data is collected nowadays from citizen scientists, so I think the interest for them would be to maybe keep their community engaged, because sometimes they like, as I mentioned, this institute that they have a large group of communities, but they said that only some of them are contributing, and they are a large part of them that are not active
anymore, so I think the interest for them, actually, it was to see how they can bring them back. Another point is that they have a lot of data that is not validated, because to have experts checking this, they need also resources, they need also people to do that,
financial and human resources, that they don't have that much, they don't have those resources at the moment, because they have lots of data that are not validated, so this was another point of interest for them to maybe integrate this approach in their platform. I hope I could answer. Thank you. Thank you. Well, I will ask one more question.
I think it's also, if you want to leave now to the different session for the next presentation, please go ahead, but for us to fail the time, we will just go through one more question, and that question was, is giving the suggestion of the top five species in the area,
is that not biasing the outcome? And after you did that, did you get different types of results? Well, most of the, in the pilot test that we did, most of the participants actually
contributed only the species that they observed, so the suggestion was not used that much, but in any case we put it there, but it was not used that much, but I think sometimes for non-expert, let's say, ecologists or non-bird watchers, it could be useful,
because they were observing a species in some point, and then they didn't know the name, so using the suggestions, they could click on that name, they could have the information of the species with the image and everything, and they would say, ah, this is that species, so then they were adding it directly as an observation that they observed, but if it could be
biasing the collected observations, that's a good point, I mean, we didn't that much think about it, but to be honest, most of the people used it as a source of help to then continue with the normal way of adding their observations. Thank you. Someone want to have the honor to ask
the last question? Otherwise, I have one last question. You said you compared the different machine learning algorithms, like Naive Bayes and everything, did you also compare it to some kind of a baseline method, where it just randomly say valid or not valid, and see how much better? I compared with MaxEnt, which is, if that's your question, if you compared with a baseline,
I compared with the results with MaxEnt, to be honest, not for all the species, but for certain species, that performance was not good for some algorithms. I compared the results with MaxEnt, and
the results from balanced random forest was very close to what I got from MaxEnt, so this was the comparison that I did with a baseline, let's say standard. Thank you so much. Thanks to you.