Linking geospatial free and open-source technologies with big data in biodiversity research
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 295 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/43313 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
00:00
HorizonVideo gameMappingDistribution (mathematics)SpacetimeObservational studyMultiplication signSpeciesLevel (video gaming)Calculus of variationsOrder (biology)Numerical analysisStudent's t-testPixelGroup actionDifferent (Kate Ryan album)CodePattern languageNatural languageVariable (mathematics)Variety (linguistics)MathematicsComputer fileSelf-organizationGame theoryExecution unitLecture/Conference
01:35
Different (Kate Ryan album)Self-organizationLevel (video gaming)MappingRainforestField (computer science)Intelligent NetworkDistribution (mathematics)ForestLogic gateType theoryNatural languageComputer animationLecture/Conference
02:26
SpeciesGrass (card game)Image resolutionDatabaseDistribution (mathematics)SpacetimeSatelliteSurfaceRoyal NavySource codePixelMathematical analysisInformation retrievalGoogolRaster graphicsVisualization (computer graphics)Field (computer science)Sampling (statistics)SummierbarkeitFreewareMathematical analysisMathematicsNumerical analysisOverlay-NetzPlotterCASE <Informatik>Medical imagingOrder (biology)AreaEstimatorSpeciesCellular automatonNatural languageDatabaseSelf-organizationLevel (video gaming)Information retrievalType theoryCalculus of variationsOpen setMappingShift operatorShape (magazine)Fundamental theorem of algebraPoint (geometry)PolygonLibrary (computing)Term (mathematics)Survival analysisFluid staticsSoftwareProcess (computing)Source codePredictabilityUniform resource locatorDifferent (Kate Ryan album)1 (number)Vector spaceVideo gameInformationSoftware frameworkAddress spaceOpen sourceLageparameterPattern languageDialectRight angleDescriptive statisticsEndliche ModelltheorieInsertion lossPixelDegree (graph theory)SatelliteSpacetimeGame theoryMultiplication signObservational studyForm (programming)Military baseGoogolComputer fileGroup actionConservation lawLink (knot theory)State of matterVisualization (computer graphics)Information securitySystem callCondensationNatural numberPlanningLogic gatePhysical systemRow (database)Data storage deviceDistribution (mathematics)Computer animation
11:22
Observational studySpacetimeThomas KuhnMaxima and minimaNatural languageComputer wormSequenceSpeciesStability theoryVisualization (computer graphics)Source codeFreewareInteractive televisionSoftwareTelephone number mappingMathematicsDistribution (mathematics)Scaling (geometry)Point (geometry)Level (video gaming)Natural languageDifferent (Kate Ryan album)Maxima and minimaProcess (computing)Visualization (computer graphics)PredictabilityStandard deviationFunktionalanalysisMappingCellular automatonMathematicsInteractive televisionCross-correlationOpen sourceSpeciesType theoryEstimatorPotenz <Mathematik>Pattern languageHypothesisPerspective (visual)Ocean currentSpacetimeNumerical analysisAverageSoftwareLibrary (computing)Software frameworkMusical ensembleNetwork topologyGraph coloring1 (number)CASE <Informatik>Order (biology)Shared memoryDialectHydraulic jumpInformationPresentation of a groupVideoconferencingMereologyVideo gameSelf-organizationGraph (mathematics)Product (business)Food energyMultiplication signInsertion lossBoundary value problemConservation lawIdentifiabilityConnectivity (graph theory)Endliche ModelltheorieStrategy gameRandomizationSequenceGame theoryCoefficient of determinationBuildingExpected valueCausalityInterpreter (computing)Information retrievalArithmetic meanDemosceneContent (media)FreewarePhysical systemDot productFamilyAreaStability theoryService (economics)Goodness of fitCovering spaceINTEGRALBlock (periodic table)Complete metric spaceTime zoneState of matterComputer animation
20:02
FreewareSource codeVisualization (computer graphics)Interactive televisionSoftwareMathematicsCone penetration testCopenhagen interpretationStack (abstract data type)Physical systemInterpreter (computing)Video gameOpen sourceInformationINTEGRALSequenceUniform resource locatorStudent's t-testShared memoryMappingWeb 2.0Universe (mathematics)Type theoryInteractive televisionMereologyEndliche ModelltheorieComputing platformSoftware frameworkVisualization (computer graphics)Row (database)Level (video gaming)Different (Kate Ryan album)Degree (graph theory)UsabilityInformation securityAreaFundamental theorem of algebraSoftwareSource codeThomas BayesPoint (geometry)Computer animation
22:28
Source codeSpacetimeFundamental theorem of algebraPlanningLecture/Conference
Transcript: English(auto-generated)
00:07
Cool. So biogeography, combining all life aspects, the diversity of life with maps, and basically the definition of biogeography is the study of the distribution of species across space and through geological time.
00:23
Okay, so life combined with maps, which directly relate to biodiversity, which is the variety and variability of life on Earth. Biodiversity has many phases. Some of you may know, some others may not. Starting from genes, okay, the genetic level,
00:40
which is the code that we all contain, all life contains, in order to adapt and evolve environmental changes. This is a typical example of a phyllogeographic study, as we call them, showing the distribution of different genetic groups in Europe for this nice beautiful flower, one of my previous studies when I was still a PhD student. So showing the distribution on a map
01:02
of the genetic diversity, okay, genetic variation. Another level, and this is most commonly used, I would say, and this is the most popular level out there, is the species richness, okay, the diversity of species on Earth, and this is a map showing the number of mammals that exist in its pixel here, in its grid cell,
01:23
starting from very low numbers at higher latitudes, as expected, and reaching the top levels in the tropics. Okay, this is a common pattern in diversity of life, so around the tropics, we have many more species, not only for mammals, but for many different organisms. Okay, and this is a higher level of biodiversity,
01:42
ecosystems or biomes, and this is the distribution of biomes on Earth, okay. From savannas to tropical rainforests to tundra and temperate forests, everything is there. So we have three different levels, starting from genes up to biomes, okay. What's common here is the maps,
02:01
the use of maps in biodiversity research, okay. So let's have a look at what type of data we use to create these maps and to map biodiversity on Earth. So for example, for genetic diversity, it makes sense that we need to use genes,
02:21
we need to use DNA. We start by sampling, okay, out in the field, these are some cool pictures I took in Mongolia where I was doing my field work, collecting tissues, collecting samples, deposit the samples in museums or herbaria, okay, where other people can find it as well, or even collecting samples from herbaria because there is huge collections out there in the world
02:42
that are waiting for biologists like me to exploit them. Then from this part, we go to DNA sequencing, modern technologies that produce thousands or millions of genes and genetic variation for live organisms. And all of this can be deposited in open databases, one of the famous ones is GeneBank from NCBI.
03:03
This database alone contains data for more than 300,000 organisms and more than eight million accessions, okay. So we're talking about big data that are easily accessible to researchers like me to download and do whatever we wish with them, address probably main biological questions.
03:24
A use of it, of these databases and this data, still one of my own studies, collecting data throughout Europe in this case, analyze all this data, thousands of genetic variants as we said, plot them on a map and then come up with models and predictions of how this diversity
03:41
may affect the survival of the species in the face of global change, in the face of climate change, okay. So useful data to address very, very timely questions. The other level, the next level, okay. We started from genes, we jump to species diversity, how do we create these maps,
04:00
these biodiversity maps, species-rich maps. Of course, we still have to start from sampling, we have to observe the species out there, okay. Thousands of researchers out there, thousands of biologists, they try to collect samples and data from different species. They deposit that in databases and this is one of the greatest databases that we have from IUCN, International Union for Conservation of Nature.
04:22
They have data for more than 80,000 species, polygons, shape files, basically, that the researchers can use to create those nice maps and to address questions as well. This is another very cool database, it's called GBIF, that contains more than one billion georeferenced occurrences of species, okay.
04:42
So people collect data, even citizens collect data, they can upload the data into this database and researchers or anyone basically can download this data and analyze them, okay. So we're talking about truly big data out there that are easily accessible to people.
05:01
The types of questions that we can address with this data, okay. So this is how we create a typical species-rich map, we have all these polygons and points and we can overlay a grid and a regular grid out there and count the number of species in its cell. Okay, we come up with this rather depressive maps, I would say, because what we see here
05:21
is the remaining wild populations out there, okay. The estimates of the remaining wild populations. Red areas indicate regions or areas with the largest degrees in wild populations, okay. And another depressive map as well is the species richness loss since the 1500s, okay.
05:41
So these maps show species richness, species diversity through time, okay. Depressing maps but this is the reality, okay. We're losing a lot of wildlife out there, okay. Without sometimes even having the chance to record it. And the third level is the ecosystem level or the biome diversity level.
06:00
Satellite come on rescue there, we use a lot of satellite data nowadays to map vegetation and to climate as well and to create these nice maps of biome or ecosystem diversity. And again, the type of questions that we can address with this type of data is still depressing. If the biome shifts or ecosystem change due to climate change, some estimates here
06:23
coming from a scientific publication, we're losing a lot, a lot is moving around the earth due to climate change. So depressing facts but we're able to address them using this type of data, okay. Retrieving data from open databases,
06:41
plot them and analyze them, okay. So just a summary of the data that we use. We're talking about thousands and sometimes even billions of records, accessions, pixels in the case of satellites, okay, waiting for us to analyze. But unfortunately, biology or in biodiversity research,
07:01
those data are mostly incomplete, okay. So for many aspects of life, we have no clue how it's distributed in the planet apart from what we see from satellites, for example, where the image is clear there. Genetic diversity, for example, requires a lot of sampling, a lot of efforts, sequencing efforts, et cetera, et cetera. So we don't have this type of information for around the world.
07:21
So we have incomplete data in biology and we also have, as we've seen, diverse data, starting from genetic databases up to satellites, okay. These are very diverse data sources and we need to have tools, we need to come up with tools and pipelines to analyze all this diversity of data, okay, big data.
07:41
Unfortunately, we have some, okay. And this is something that I want to stress, the open source and free software out there. I mainly focus in Python because I'm a Python lover myself and we do have data to analyze and retrieve this type of data. So we have a nice cool package here,
08:02
it's called BioPython, that you can do everything from retrieving genetic information from databases up to actually analyze them and come up with summary, summary statistics of genetic diversity, for example. It's very cool, very, many people are working on this framework right now. We have data that one could, we have a software
08:22
that one could use to retrieve these millions of occurrences of species in globe. Other cool, cool software that without having coordinates, for example, for certain sampling locations, only description of the place, we can use this type of software, of libraries,
08:41
to actually have lat and long for the data that we're using. Okay, either by using Google Maps or Google Earth or GeoNames, another cool database that relates locations and location names with lat long. So we have these tools that allow us to retrieve data and put them all together in order to create
09:03
those nice maps and come up with the depressive facts of life on Earth. We can also analyze them very effectively with the use of cool software, and we all know why we're here, okay. Because basically there are two main frameworks
09:21
that help a lot of researchers do their job in terms of spatial analysis, okay. One is GDAL, of course, QGIS, and other libraries that sit on top of different open source and free software that make our life much easier, okay. Myself, I use exclusively only these tools to do all the analysis that I do,
09:42
but also we have other solutions for vector type of data like Shapely, Fiona, and OGL. These are mostly Python based, and of course, open source and free software. Finally, the last step is to visualize the information that we're retrieving and analyze, and again, there are nice, nice tools out there
10:00
that help us do the job. Okay, Matplotlib, it's a Python library for static visualization, while this very, very cool tool, the D3JS written in JavaScript, helps us do interactive stuff, and I will demonstrate some of this stuff now. I mentioned before that for many aspects of biodiversity,
10:23
we have, I mean, we are in the very early stage of mapping it, of understanding its distribution on earth. Okay, for species diversity, especially for mammals, for example, a very well-studied group, we have the maps, we know where they occur. For many other, like fish, for example,
10:40
we also have an idea. For genetic diversity, though, because it's at a very low level, and it's something that cannot be observed directly when you go out in the field, but it requires a lot of effort to retrieve this type of data, we're still at the very early stages of revealing the global patterns of diversity, okay, at this level. Why do we care about genetic diversity, okay?
11:03
For non-evolutionary biologists, with people that have no biology background, they may not be aware how important genetic diversity is, but actually it's the fundamental level of biodiversity. So without genetic diversity, none of the other levels of biodiversity can exist. Okay, we need this type of variation
11:21
in order to create the rest of the diversity of life. Okay, mutations actually in the DNA that allows people to evolve, that allows species to diversify and create new life forms. Okay, and of course it's included in many, many, many different initiatives and strategies. We had a mention in the plenary talk about geoborn,
11:41
and you see genetic diversity is the basic component there. It's a major conservation target as well, so we need to preserve it. We need to do something about the loss of it. And another story here that tells us that genetic diversity is one of the fundamental boundaries
12:00
that we should not exceed in life. So we're approaching a point in time where we're losing a lot of wildlife, a lot of levels of biodiversity are being extinguished. So genetic diversity is one part of it, and with red color here, it's indicated that it may be beyond the zone of uncertainty.
12:22
It's important, it's also shown here, this is a graph of the number of scientific publications coming out the last year, so you see an exponential growth of papers, publications that refer to genetic diversity, yeah. So we started doing something about it, at least monitoring it, but we're still missing a global perspective.
12:45
Okay, so you saw the species richness maps, you saw the ecosystem diversity, we have it there and we know how it changes. But look, this is one of the first maps that ever created from our lab a couple of years ago, published in a big journal. First of all, you see the gaps that we have in our knowledge, okay,
13:02
of genetic diversity. And even the cells that we have right now, the maximum cover, at least for mammals, a 15, 20% of the taxonomic diversity there. So if a cell, for example, has in total, let's say 100 species, we may have data only for 10 of them,
13:22
or 15 maximum, okay. So there is a lot of things still to know. We can't preserve it, we can't do anything about it, without having a clue of how it's distributed on Earth. So my job now, in the last one year, is collect more data and try to come up with predictions of genetic diversity, come to try to update this map
13:41
and predict its distribution somehow, okay. Because actually collecting data for it, it requires, as we said, a lot of effort and we don't have this type of data available now. This is an updated map. Black points show the distribution of a particular gene.
14:01
Okay, so focusing on a particular gene. Here, just to showcase our approach, is a mitochondrial gene. Mitochondrial is one part of the genome, is an organ in the cell that is possible for energy, in production of energy in life. And this gene contained in the mitochondrion,
14:21
and this is what we have so far for terrestrial mammals, okay, for this gene. It's obvious here that for regions of the world, like Europe, North America, Japan especially, we do have data. We don't have data for very important regions of the world like Amazon that is burning now. So you can imagine.
14:40
We haven't recorded, but we're losing it. So this is again a depressive fact, but it's a reality, okay. And this is how this data are distributed in the mammalian tree of life. I don't know if you are familiar with phylogenetic trees. It's a way of showing the evolutionary relationships in life. This is the tree of life for mammals, and these are the main orders of mammals,
15:01
and this is how the data are distributed in the tree atilis. We have a good distribution across the tree of life for data. Okay, and again to create these basic maps, we use the geonames, this huge genetic database, and the tools that I just presented to obtain this data and georeference them,
15:21
so put them on a map. Okay, how do we go about create a map of genetic diversity? Okay, so we saw these maps. We have the distribution of species, and let's assume that those dots, the color dots, they represent one sequence. Okay, so the red ones would represent a genetic sequence,
15:42
a gene basically for a rodent. A blue one would represent for a carnivore, okay, a wolf in this case. What we do, we collect all this data, we align them together, we count the number of mutations here, okay, it changes how diverse these genes are. We get the summary and average. We get the same for the other species.
16:02
We average them, and what we have is an average estimate of genetic diversity in its cell. Okay, grid cell. And of course to do that, we still use open source and free software. Okay. Okay, so yeah.
16:20
So apart from mapping it, and we still have incomplete data, we need to come up with ways of predicting it at least. So is there anything that can correlate with genetic diversity that could help us create these complete maps, these full maps? Okay, of course genetic diversity relies a lot on randomness, okay, it's not something, a fixed thing that we can easily predict.
16:40
But at least from biological hypothesis, we have some expectations, okay. For example, we would expect that species diversity, and I won't analyze the hypothesis behind that, that we would expect that species diversity, for example, would correlate well with genetic diversity in space. Okay, we would expect that past climate stability, especially after the last glacial maxima, for example, the last 20,000 years,
17:01
would also correlate well with current patterns of genetic diversity. Okay, the more stable the climate, the more it allows for new diversity to arise. So we would expect also these things to correlate. And lastly, human footprint. Okay, we know humans have a huge impact on life,
17:20
and where it's possible for a lot of extinction, so we would expect wherever there is a lot of human activity, less genetic diversity. So we have three major expectations, and we can use this type of data to see if at least they correlate well with the data that we have so far. Okay, let's jump directly to that. I think I don't have a lot of time, but I will present a nice framework
17:41
that I created for visualizing and all this information that I just presented. Okay, so what we have here, my mouse doesn't move quite fast, so I will just go back to the presentation. I think I will present a video there
18:01
instead of doing it interactively. Okay, we truly believe in data visualization and data sharing in our lab, so that's why I created this framework using the D3JS library interactive framework that allows us to explore different aspects of genetic diversity, at least based on the data that we have so far. Okay, so this is at the latitudinal band scale,
18:22
genetic diversity across latitudinal bands. What we have here is the correlations between the predictors, as we mentioned. We can change predictors here and get also a sense of the predictors that we're using, the correlates. We can change scales. We can go to the grid cell scale and this is the updated map that we have so far.
18:41
Okay, we can filter the data out based on data availability. We come up with less data but more meaningful predictions of genetic diversity. So these frameworks, we can change genes, of course. This is for another mitochondrial gene, less data and some standard functionality like zoom-in, drug, et cetera.
19:00
Okay, this is simplification of things. Of course, we can make it much more complicated. Okay, but this will allow also users to retrieve the data that we're using and analyze them in different ways if they wish. Okay, it's not published yet because we're waiting for the publication of the article first. But once the publication is out, the framework will be out as well.
19:21
Okay, so the basic thing that we're trying to do is combine the current patterns of genetic diversity, what we have so far, with species richness, for example, in this case, identify correlates, build models based on these relationships and maybe come up someday with a nice map of genetic diversity on Earth
19:40
like the maps that we have for species richness, for example, or ecosystem diversity out there. Okay, so this is the ultimate goal and we believe that's very important to have. To conclude, of course, free and open source technologies are available in biodiversity research. I've demonstrated some examples
20:01
and I think the more we develop, the better we are able to analyze, retrieve and analyze all this big data that are out there and to come up with nice biological interpretations of life. There is still an ever-growing need for efficient tools. Okay, there are tools existing out there but we still need the integration of these tools.
20:22
Okay, common frameworks that will allow users in a user-friendly way to retrieve and analyze data. Okay, I believe and we believe that interactive data visualization and sharing increases the transparency in scientific research. Okay, we believe in, apart from open source software, we believe in open data as well and having that on the web
20:41
and we will see many talks focusing on that. And finally, apart from research, I think these tools are very important for education. This is a nice picture from a lab that we had this year at the university teaching students how to use QGIS, for example, in creating biodiversity maps.
21:01
Okay, and students were very excited because QGIS, for example, is a very nice tool and very user-friendly tool for students as well. And with that, I would like to thank you all for being here and of course, the community, the foundation of the foundation that gave me the opportunity to be here and talk to you and all the different platforms and tools
21:22
that allow us to do all this cool, slightly depressive but still cool work out there. Thank you. Five minutes for questions. Anyone? Yes.
21:41
You're recorded. Do you assign some kind of trustworthiness levels to the different kinds of data you have? So that's part of the model or does everything weigh the same? It's part of each step. So filtering, for example, yeah. Filtering, it's a fundamental part of each step of data retrieval, okay. So you have to evaluate how reliable the data are,
22:04
the type of information that they contain. One thing that I forgot to mention, for example, out of the sequence that we retrieved from GeneBank, only 30% for the particular gene, for example, contain location information, okay. 70% of them don't have any information that will allow us to attach coordinates then and plug them on the map, okay.
22:22
And even for those that do contain the information, still this information may not be very reliable at some point, okay. GBIF data space securiencies, you need to double check again the sources of this data, if they're reliable or not. So I think filtering step is something fundamental and one needs to build pipelines,
22:40
unfortunately custom pipelines for that so far, but some of the tools that I presented, they also have automated pipelines for filtering out unreliable, let's call them data.