We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

GeoMesa: Scalable Geospatial Analytics

00:00

Formal Metadata

Title
GeoMesa: Scalable Geospatial Analytics
Title of Series
Number of Parts
14
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2014
Production PlaceWashington, DC

Content Metadata

Subject Area
Genre
Abstract
The proliferation of smart phones with embedded geolocation sensors has led to an explosion of geospatial data in all domains. Every mobile app now asks users to enable location services and generates copious geotagged data. Existing solutions for managing this data rely on traditional approaches using geospatial relational RDBMS platforms. GeoMesa is an open source scalable spatio-temporal index built on top of the Accumulo distributed column family database that provides efficient OGC standards based access and query capabilities of very large datasets. GeoMesa provides WMS or WFS services over HTTP for data access as well as an API based on Geotools. Spatial analytics in GeoMesa can leverage Hadoop to perform computations in parallel on a cloud. Sensitive personal information inherent in consumer geolocated data can be protected using Accumulo's cell level security. This talk will cover the indexing structure in GeoMesa and how it enables scalable geospatial analytics in a cloud platform.
Zeitliches DatenbanksystemStandard deviationOpen sourceFunctional (mathematics)Event horizonGeometryPoint (geometry)Point cloudSocial classData compressionVisualization (computer graphics)BitResultantOpen sourceStandard deviationData storage deviceFamilyPhysical systemCartesian coordinate systemInterface (computing)DatabasePlug-in (computing)Projective planeUniform resource locatorService (economics)Analytic setCloud computingNeuroinformatikMereologyAudiovisualisierungNumberWordComputer scientistMultiplication signCASE <Informatik>Natural languageRelational database
TwitterSatelliteReal numberInformationVelocityDistribution (mathematics)Numbering schemeQuery languagePlanningAnwendungsschichtScalabilityKey (cryptography)Price indexDatabaseServer (computing)SpacetimeCurveServer (computing)MultiplicationTablet computerData structureKolmogorov complexityChemical equationDimensional analysisTable (information)Subject indexingImage resolutionQuery languageInformation securityParallel portEndliche ModelltheorieNumbering schemePoint (geometry)Order (biology)Distribution (mathematics)Term (mathematics)DatabaseSpacetimeDerivation (linguistics)Vector spaceNumberPort scannerPower (physics)Data storage deviceSemiconductor memoryDistributed computingRelational databaseMiniDiscKey (cryptography)Physical systemCartesian coordinate systemCoprocessorWeb serviceInteractive televisionNeuroinformatikElectric generatorProcess (computing)Streaming mediaSingle-precision floating-point formatSemantics (computer science)FamilyVirtual machineMathematicsStructural loadExtension (kinesiology)Element (mathematics)SatelliteSmoothingThree-dimensional spaceMobile WebConcurrency (computer science)Different (Kate Ryan album)Ocean currentVideoconferencingMultiplication signLevel (video gaming)Analytic setMagnetic stripe card2 (number)Operator (mathematics)TwitterSet (mathematics)PlanningSoftware developerBand matrixPartition (number theory)Space-filling curvePattern languageShared memoryComputer animationProgram flowchart
SpacetimeCurveMultiplicationImplementationElement (mathematics)Kolmogorov complexityPolygonOrder (biology)Subject indexingSpacetimeGeometryImage resolutionSpring (hydrology)Hash functionLine (geometry)Query languageSpace-filling curveString (computer science)Diagram
Query languagePlanningServer (computing)Process (computing)Web serviceWorld Wide Web ConsortiumGeometryPartition (number theory)Magnetic stripe cardDensity matrixSparse matrixAssociative propertyMilitary operationClient (computing)SpacetimeInterpolationData bufferInformation securityDisintegrationEndliche ModelltheorieAuthenticationAuthorizationCodierung <Programmierung>Binary fileAxonometric projectionSubject indexingStatisticsSuite (music)Parallel portLevel (video gaming)Electronic mailing listLink (knot theory)Query languageResultantMultiplication signMagnetic stripe cardNeuroinformatikComputer fileInterpolationAnalytic setEvent horizonCASE <Informatik>Instance (computer science)SpacetimeSoftwarePopulation densityWebsiteAssociative propertyProgramming paradigmInformationReduction of orderKolmogorov complexityElectronic mailing listInformation securityPredictabilityServer (computing)Series (mathematics)GeometryDensity matrixOperator (mathematics)Endliche ModelltheorieAlgebraWeb serviceSubsetType theoryState of matterTablet computerUniform resource locatorLink (knot theory)Service (economics)Interactive televisionTransformation (genetics)AreaGroup actionFIESTA <Programm>AuthorizationIterationEmailDifferent (Kate Ryan album)AuthenticationProcess (computing)Binary fileDarstellungsmatrixCache (computing)Projective planeTheory of relativityVariety (linguistics)Linear regressionTrailExpressionSparse matrixTesselationStatistics
Transcript: English(auto-generated)
Hi everyone. My name is Anthony Fox. I work for a company out of Charlottesville, Virginia called Commonwealth Computer Research. And we developed GeoMesa, which we open-sourced about a year ago and connected with the LocationTech team. And now GeoMesa is under the LocationTech
umbrella. I'm going to talk today about GeoMesa, what it is, how it came to be. I'm also going to talk a little bit about distributed databases, which is what GeoMesa is. And I'm going to dive into how we do a little bit of our indexing to enable geospatial data in a non-relational database. And eventually I'm going to get to a few analytics
that I'll show how we leverage GeoMesa to build on top of. So first of all, what is GeoMesa? Many of you have either been, by your customers,
been dictated to move to a cloud, or some of you have an actual justification and a need to move to a cloud. In our case, it's a little bit of both. We were directed to migrate an analytic that had intense computational requirements to a cloud-based system.
And the tools that we had come to rely on, the rich geospatial functionality that's available in PostGIS, was just not available to us on the cloud. So we developed enough of the geospatial capability that we needed to support our analytic, and then quickly
realized that it was useful on its own. So we decoupled it from the analytic, and we open sourced it. So GeoMesa is the result of that. It is a distributed spatiotemporal database. In particular, it's built on the Accumulo column family database, which I'm going to go into more details about. But more importantly, the goal of GeoMesa
is to be a runtime-only dependency of projects. So it implements all the standard GeoTools data store API. And it also exposes data in these distributed databases via standardized services like OGC that we expose via GeoServer plugins.
The point being that you should never have to import a GeoMesa class into your application. You should just have to import the relevant GeoTools interfaces and work with those directly. And the geospatial computations are transparently executed on the cloud. Again, it's location tech open source. Some of the visualizations
you're seeing here are associated with the GDELT dataset. GDELT is the global database of events, language, and tone. It's put out by UPenn and UT Austin, I believe. And it's 250 million geocoded events since 1979, about 100 gigs of
uncompressed data. And just to give you some performance numbers, we can ingest that data into the system on a small virtualized cloud in approximately 15 minutes. And then we can do analytics against them. And these are visualizations against
that dataset. So a little bit of justification about cloud-based geo. Why would you need this? Well, you don't have to look very far to see high-velocity spatiotemporal data. Twitter does about 100 to 150,000 tweets per second.
A small percentage of those are geotagged, but a small percentage of a very large number is still a reasonably large number. Foursquare claims to do 1 million check-ins a day. Everybody's doing geolocated clickstream, so marketers and advertisers are very interested in geolocated clickstreams. I'm going to show you how you can actually correlate
geolocated clickstreams via an analytic that we've developed. Satellite imagery, pretty obvious. Vehicle and traffic sensors, so FAA and car traffic sensors, generate tons of high-velocity data. So you have a need for a distributed
database because you have more data that can fit on a single machine's disk or can be processed within a single shared memory space. So the customer dictates that you use a column family database. So Google published a paper in about
2006 or 2007 on a system called Bigtable, and quickly after that, a number of derivative databases started to crop up. Amongst those derivative databases are HBase. It's built directly on the Bigtable model.
Cassandra and Accumulo. Accumulo focuses on security, so it does high-resolution cell-level security, which we're starting to support in GeoMesa. These distributed databases, particularly these key-value stores, have
very flexible schemas, so it's easy to get up and going with them, but it's not schema-less. Your application always imposes a schema on the data. You have to understand your data and you have to do something with it. What this means is that the complexity of query planning is pushed up into your application layer, so you have to make these
interesting trade-offs where you have potentially a multi-tenancy database and each application is dictating how they're doing scans and how they're using resources. So you have to balance between flexible schemas and potentially conflicting data access patterns. It's horizontally scalable. This is a very nice
feature, and I'm going to talk about how we leverage that. In particular, Accumulo has the notion of tablet servers. A single table is spread across many tablet servers, and we can actually take advantage of the processing and I.O.
bandwidth of multiple tablet servers. Accumulo will handle failover and balancing and rebalancing and all the nitty-gritty distributed complexity. But here's another trade-off, which is that in PostGIS, you have very nice
and sophisticated R-trees and quad-trees and your traditional spatial indexes, which work incredibly well on relatively sized datasets. Well, when you go to one of these column family databases, you don't have many
indexes. You have to design your table such that your table has a secondary index. The only index that's actually available to you is an implicit lexicographic ordering of the keys in your key value store. So this is a critical element that has to be thought through very carefully when pushing
geospatial data, because obviously geospatial data is multi-dimensional. When you add time, you're looking at three-dimensional data. So how do we go about leveraging distributed databases for the qualities that they have that are advantageous? Well, we can do partitioning very easily. Partitioning
almost comes for free, whereas with relational databases you have to do a lot of work at the application layer to do partitioning. With distributed databases, partitioning is effectively given to you, and we're able to distribute queries across multiple machines. So these are concurrent queries by different
applications or users, and they can be spread or load-balanced across these resources. We use striping to distribute computations within a single query. So a single query, we can actually split it up into all of the tablet servers. So if you have 100 tablet servers,
then you can potentially have 100 processors that are executing on your query. So this makes some of the very large data sets operational and interactive. The trade-off that you're making with that is that when you stripe the data
across all of your resources, you incur a cost when you have concurrent queries for many users, because all of your resources are being brought to bear on every single query. So we have a flexible element of the indexing structure that allows you to tune the amount
of parallelism you're going to get per table. Accumulo has an extension point called server-side iterators, which we leverage in GeoMESA to provide geospatial and CQL and eCQL query semantics at the data, rather
than as secondary scans. We can also embed custom analytics, I'm going to show some examples of that, inside these server-side iterators, so they essentially become ad hoc and interactive, MapReduce-like computations. That's the last bullet point. Okay, so I only have
about five minutes left, so I'm going to go through this really quickly at this point. The key to working with key-value data stores that have an implicit lexicographic ordering is to use space-filling curves. Space-filling curves are a one-dimensional data structure that allows you to project multi-dimensions
into a linear space. I have a nice little video here that took me a while, so I'm going to play this one real quick. This shows how the space-filling curve fills up the space that we're interested in filling. Obviously, the z-axis is time, but what you see in that second
front bar right there is that striping. Each portion of the data is striped across all of the tablets in a structured way to process these
queries. This is complex polygons and line strings. You essentially decompose the polygon into multi-resolution geo-hashes. Geo-hash is the implementation of space-filling curves that we're using. It's a z-order curve.
Then you store each decomposed geo-hash as an element in your index space. Query planning amounts to computing the stripes that are candidates for results of your query. Five minutes left, so I'm going to get to analytics real quick.
The analytics that we've developed, we've implemented them all as WPS services. They're deployed in GeoServer, so they're discoverable. Some of the analytics in particular, this one is a spatiotemporal prediction analytic. It predicts events in space and time. One of the interesting things about this analytic is
that it's Santiago, Chile, and it's robbery events in Santiago. We're predicting robbery events. Actually, it's hard to see, but just in that little area below the red, below the seven, I believe, is a stadium. We recently
transitioned from a linear model to a non-linear model that's able to detect that the threat from robberies is higher around the stadium than inside of the stadium, which was kind of exciting. All of that's implemented as server-side iterators in this. Any type of
associative computation can be cast as a MapReduce job. This isn't native MapReduce. It's MapReduce implemented within iterators. For instance, density computations can be done very rapidly by computing a sparse density matrix or any kind of transformation
matrix. Within each tablet server, you have a hundred workers that are brought to bear on your computation, and the reduce side applies the associative operation, which in the case of densities is just a summation. But you can express many different types of
computations this way, which is kind of nice. Another interesting analytic that we've developed as a WPS that's implemented within the server-side iterators is this interpolated time-space query. Basically, you would like to see who you might have interacted with on a trip that you've made.
It's time interpolation as well as space interpolation. We like to think of it as tweeting the New Jersey Turnpike, so you have a person that is tweeting as they're traveling on the New Jersey Turnpike, and you would like to know who might be on the megabus with them
or who they might have stopped and interacted with. So you have to do a complex series of queries that shift time through each gap, and you have to interpolate a track based on the road network or some other kind of underlying layer. So
possible interactions is the result of this analytic that's executed by WPS, and with gap-filling and snap-to-road tracks, we can get more information. A quick roadmap about GeoMesa. We're pushing for a 1.0
release in June. We are implementing full PKI-based authentication and authorization so that we integrate with the cell-level security of Accumulo. We have already implemented Avro binary encoding and relational projections that allow us to subset the data and rapidly return just what we need for each query.
In the fall, we're looking at integrating deeply with GeoServer and the Hadoop ecosystem. So WPSs will be executed across a variety of compute paradigms like Storm and MapReduce and Spark,
and tile caching can be pushed into HDFS or within Accumulo. And we're also looking at doing data and query statistics and how those might improve our query planning performance. Quick couple of links here.
The LocationTech.org website lists some information about GeoMesa. We have a GeoMesa.org website that has tutorials and other demonstrations, and there's a user's mailing list and a dev mailing list. And I'm happy to take any questions. I don't know if I have any time.