Open Source Spatial Tools for Biodiversity and Environmental Data
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 50 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/40856 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSS4G SotM Oceania 20185 / 50
6
10
14
16
26
46
48
00:00
Wireless LANCellular automatonMIDISoftware developerData managementComputing platformChannel capacityTrailProtein foldingMultiplication signMathematical analysisPosition operatorXML
00:38
SpeciesAreaOpen sourceExplosionCore dumpData analysisProcess (computing)Integrated development environmentRepository (publishing)Data managementComputing platformMotion captureAugmented realityNumerical taxonomyBuildingPrice indexDirectory serviceCodierung <Programmierung>Video gameVisualization (computer graphics)Software testingInformationReverse engineeringData AugmentationService (economics)User profileObject (grammar)Field (computer science)Distribution (mathematics)Electronic visual displayMathematical analysisSoftwareMetadataGroup actionMoment (mathematics)SpeciesNumerical taxonomyOnline helpSlide ruleVisualization (computer graphics)Software testingElement (mathematics)QuicksortRow (database)Distribution (mathematics)InformationDigitizingIntegrated development environmentMultiplication signTesselationComputing platformStructural loadDifferent (Kate Ryan album)Memory managementSubject indexingMathematical analysisStandard deviationQuantumService (economics)PhysicistTerm (mathematics)Context awarenessCollisionDatabaseFunctional (mathematics)Uniform resource locatorOpen sourcePlotterPrinciple of maximum entropyWeb 2.0SubgroupOpen setStaff (military)WordDivision (mathematics)SoftwareStrategy gamePoint (geometry)VirtualizationData managementTrailProfil (magazine)Frame problemData analysisCartesian coordinate systemProgrammer (hardware)CodeWhiteboardBuildingSoftware developerProcess (computing)Revision controlScalabilityMereologyElectronic data processingBitData qualityVirtual machineMedical imagingRight angleSet (mathematics)System identificationDebuggerType theoryAnalytic setDigital photographySpecial unitary groupState observerAuthorizationUniqueness quantificationElectronic mailing listSelf-organizationEmailArithmetic meanComputer fileVariable (mathematics)Data miningWeb serviceDecimalFront and back endsNetwork topologySystem callMachine learningStapeldateiField (computer science)Quantum mechanicsService-oriented architectureEndliche ModelltheorieCore dumpProbability distributionElectronic visual displayVideo gameMetadataIdentifiabilityComputer programmingAreaScaling (geometry)Fitness functionDescriptive statisticsLoop (music)
Transcript: English(auto-generated)
00:03
Okay, thanks everyone. I guess that's the introduction. I'm an analyst developer and I've been with the Atlas now for about four years. I first developed Zoetrac which is a platform for animal tracking data and both Zoetrac and I were taken into the folds of
00:24
the Atlas some time ago and I've wheedled my way into this management position and that's the capacity that I speak to you in today. So I work with all of our ecological analysis tools.
00:40
So a quick introduction to the ALA. Who here has heard of or used the ALA? Okay, so we're kind of all friends. The ALA is a big database of all plants and animal information. So we're an aggregate of biodiversity data. We pull it together from multiple sources and then we make it freely available for reuse. So we're funded by
01:06
NCRIS. The important words there are collaborative infrastructure which means that we're driven by an open source software development strategy and open data policies. We're hosted by CSIRO.
01:23
There's about 30 of us. Most of our staff are based in Canberra at Black Mountain and there's a handful of us littered around Melbourne. We're the Australian node of GBIF and you saw GBIF before in Jane's talk. The GBIF has several international nodes and we're partnered with a whole heap of museums
01:44
and collections people. So the original idea for the Atlas came from all the museums and collections wanting to have a central database such that they could look up their species and location information. That was around about 10 years ago. We're still partnered closely with
02:00
all of those guys. I'm based at the Melbourne Museum myself. Our open source software development has been really successful. We've got countries all over the world now picking up our software and using it. About a dozen and there's about 10 more in negotiation. Just last
02:20
week Austria came on board and Sweden signed up recently as well. We're working on building that Living Atlas's community basically so now that it just doesn't become a whole heap of forks of our software and we can end up having a common code base and get the good work that those other countries are doing collaborating and bringing stuff back to us.
02:45
So my talk will go through these elements here on the left hand side. Data capture, processing, what we do to the data, some of our discovery tools and mostly our data analysis and visualization. A couple of our visualization platforms. Just quickly with data capture we have
03:05
a data management team that sits there and pulls data in from all sorts of different places. We have automated and manual loads. We speak Darwin Core. Quick show of hands for who knows what Darwin Core is. It's more or less a biodiversity data standard so for us that means
03:24
it's a bunch of 186 terms we can tell people make these your file headers and we roughly know and understand what those things mean. So it contains things like species name, species scientific name or decimal latitude, decimal longitude. We've got other platforms that pull in data
03:43
or non-occurrence data you might say or some occurrence data. We've got a BioCollect platform for supporting citizen science and field data collection. A profiles app that gives us more descriptive information about species. Zoetrack is my application for managing and visualizing
04:01
animal tracking data and DigiVol we support as well which is a digitization and transcription platform. No down this way. Okay our data processing. I just wanted to take you through the sorts of things we do to that data once it comes in. We run it through a massive engine.
04:23
We augment each record with a whole heap of information about taxonomy, about environment and the spatial context and then we run a whole heap of data quality tests on that data. So our first our first stop is taxonomy. We've got a piece of software that we built called
04:41
the large taxon collider and the guy's a quantum physicist and he we work with all of the different Australian taxonomic authorities and we try to come up with a big giant list, unique list of all Australian species. It handles things like updaters and mergers and synonyms and all
05:03
of those all of those sorts of things that happen in taxonomic names and for those of you who aren't aware of what a blood bath taxonomy can be it's quite an exercise. So we get that taxonomic information we come up with the rights what we think is the right scientific
05:21
name for that organism and we actually add to the record the whole taxonomic tree basically so it's easy to look up. We host around about 500 spatial layers, different environmental layers, contextual layers. I had no idea how to describe all of these on a slide so I'm sorry about how
05:40
much information is there but an important part of our data processing is that we take each location and we intersect it with each and every one of those layers. We grab the value back and put it up next to the record so that we've got that easy to use later on. And then we run our data quality tests. We have around about a hundred data quality tests. They're
06:03
in the process of being internationally standardised for biodiversity through the Taxonomic Databases Working Group which is the Darwin Corps, the Darwin Corps people. So Jane alluded to before the idea that there's lots of data quality issues within the atlas
06:23
and that's entirely true because we rarely actually throw data out and we don't set ourselves up to be the judges of what makes a good record. Instead we try to run these so that we can help people assess the fitness for purpose of a record for their use. We run these hundred data quality
06:44
tests that they can then pick and choose what might be useful for their purpose. So species distribution modelling obviously needs a really high quality record and you would do quite a bit of filtering before you found your right data set for such a
07:01
scientifically important goal. But if you were just mucking around with data you might just want everything. So those data quality tests do things like check names, check whether that record is where we would expect that species to be among other things
07:22
and there's I think around half of them are location based testing. So what we've got in the end can be a record that's up to a thousand fields wide and we put that into a Cassandra database and index it some index with solar.
07:41
So we've got a lot of information that's come along with our original record. Most people, most people, I don't know if it's most people, we've got a web front end for navigating our data. We call it the Biodiversity Information Explorer. You can search species location, you can go in through your collections or by data set. Probably our biggest strength is
08:07
our web API. So the Atlas is a service oriented architecture. So we have our database out the back that we call Biocache and we've got lots and lots of different front ends, not just our
08:22
main front end but also other front ends like the Australian Virtual Herbarium. The thing is that our database and the web service layer that sits in the middle to support all of our infrastructure is also publicly available. So we publish our API and anyone can use any of the
08:41
tools that we use internally. So our API is on api.ala.org.au and these are just the groupings here on this slide of the different services that we've got. We've got around about a hundred I think that are exposed. The spatial portal at spatial.ala is our visualization and
09:10
analytics tool for dealing with all this data. So its purpose is to manipulate, analyze, display, import, export spatial data. So it's got every tool under the sun I can think of to be able
09:26
to work with species, with areas, the layers and those dozen or so facets. So you can come up with visualizations as such. All the river red gums in occurrence data colored by the type of
09:48
observation that they are like the specimens from the museums as opposed to human observations. We've colored there the NBIS sub-grouping that those occurrences lie in and
10:04
overlaid it with a temperature layer. So those are the sorts of visualizations that we can do within the spatial portal. It's got lots of other tools out the back end. There's scatter plot
10:20
analyses for working with the continuous variables that come in through the environmental layers. There's crosstabs for the discrete variable analysis. There's prediction software in their MaxEnt and I think there's around about, we have identified, there's around about 17
10:47
analysis tools in the spatial portal. Then we have ALA for R. ALA for R was written around about 2014 by a chap called Ben Raymond at the
11:02
Antarctic Division. ALA for R hits those web services and puts them in a nice spatial points data frame so that they're right there to use by tools like leaflet and ggplot and it's really trivial to get that going if you're an R programmer. I think it covers
11:25
most of the API. It covers the more well-used API services. So just for my last couple of words, I just wanted to talk to you about some of the
11:43
issues that we're having at the moment around spatial stuff. We're coming up 10 years old. Next year we'll be celebrating 10 years of going live which is pretty cool and I guess pardon? I think it was originally Phospho-G was here, was it in Melbourne before or?
12:07
So we got a lot of help I hear last time we came with dynamically producing tiles and these sort of services. So we are appreciative of the conference and and we're sort of turning from this innovative startupy sort of culture into a BAU house and
12:26
we've got a lot of work to do. We haven't looked outward a lot I think and you know we have to look into things like these WXS services and what's going on, but making sure that we're keeping up with new tech. Probably a bigger problem though at the moment is our 500
12:46
spatial layers. They come from all different agencies. We want 200 more. We've got 200 more waiting in the pipelines. They're all different. They all have different licensing arrangements. They all have different metadata, different coverage, different styles, scales, all sorts of
13:04
all sorts of things and I'm sure there's a few of you in this room who are familiar with those sorts of problems and I guess we're wondering who else is dealing with these sorts of problems and is there a call for a central agency that can manage these sorts of things that can then help us with web services like we have to intersect those layers and send us back
13:28
you know a value. So we have those services ourselves. We produce the service. You send a lat long and your layer name or you can get values back for all the 500 layers if you like and bring back those values. We have a batch version of the same
13:44
and we feel it wouldn't be great if someone else could do that. So you need scalable infrastructure, standardized layers, standardized vocabularies, all that sort of boring stuff that isn't such sexy work to do but has such great returns.
14:04
So that's it from me. Thanks very much everyone and yeah are there any questions? Yes, a bit off topic but have you guys been using your sightings and species database
14:22
to do like machine learning? Yes, I guess well we're teaming up with iNaturalist who are really in an exciting place with deep learning so they've been working with you know Google and all those sorts of people to do species identification on the images. So that's great
14:43
for well-known species but once you get to the tail it's where you're going to get a few images that gets harder and harder but yes we're really looking at deep learning and machine learning and trying to work out where can we get stuff. That would be great to know people speaking photos wondering what species it is. Yeah well check out iNaturalist, it's got a
15:05
really great species suggestion functionality and we're creating an Australian note of iNaturalist on believing that project. Cool, any questions? Okay thanks very much.