GeoMesa: Scalable Geospatial Analytics
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 14 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/15339 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2014 | |
Production Place | Washington, DC |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
LocationTech Summit 20145 / 14
9
12
00:00
Zeitliches DatenbanksystemStandard deviationOpen sourceFunctional (mathematics)Event horizonGeometryPoint (geometry)Point cloudSocial classData compressionVisualization (computer graphics)BitResultantOpen sourceStandard deviationData storage deviceFamilyPhysical systemCartesian coordinate systemInterface (computing)DatabasePlug-in (computing)Projective planeUniform resource locatorService (economics)Analytic setCloud computingNeuroinformatikMereologyAudiovisualisierungNumberWordComputer scientistMultiplication signCASE <Informatik>Natural languageRelational database
03:24
TwitterSatelliteReal numberInformationVelocityDistribution (mathematics)Numbering schemeQuery languagePlanningAnwendungsschichtScalabilityKey (cryptography)Price indexDatabaseServer (computing)SpacetimeCurveServer (computing)MultiplicationTablet computerData structureKolmogorov complexityChemical equationDimensional analysisTable (information)Subject indexingImage resolutionQuery languageInformation securityParallel portEndliche ModelltheorieNumbering schemePoint (geometry)Order (biology)Distribution (mathematics)Term (mathematics)DatabaseSpacetimeDerivation (linguistics)Vector spaceNumberPort scannerPower (physics)Data storage deviceSemiconductor memoryDistributed computingRelational databaseMiniDiscKey (cryptography)Physical systemCartesian coordinate systemCoprocessorWeb serviceInteractive televisionNeuroinformatikElectric generatorProcess (computing)Streaming mediaSingle-precision floating-point formatSemantics (computer science)FamilyVirtual machineMathematicsStructural loadExtension (kinesiology)Element (mathematics)SatelliteSmoothingThree-dimensional spaceMobile WebConcurrency (computer science)Different (Kate Ryan album)Ocean currentVideoconferencingMultiplication signLevel (video gaming)Analytic setMagnetic stripe card2 (number)Operator (mathematics)TwitterSet (mathematics)PlanningSoftware developerBand matrixPartition (number theory)Space-filling curvePattern languageShared memoryComputer animationProgram flowchart
10:43
SpacetimeCurveMultiplicationImplementationElement (mathematics)Kolmogorov complexityPolygonOrder (biology)Subject indexingSpacetimeGeometryImage resolutionSpring (hydrology)Hash functionLine (geometry)Query languageSpace-filling curveString (computer science)Diagram
11:09
Query languagePlanningServer (computing)Process (computing)Web serviceWorld Wide Web ConsortiumGeometryPartition (number theory)Magnetic stripe cardDensity matrixSparse matrixAssociative propertyMilitary operationClient (computing)SpacetimeInterpolationData bufferInformation securityDisintegrationEndliche ModelltheorieAuthenticationAuthorizationCodierung <Programmierung>Binary fileAxonometric projectionSubject indexingStatisticsSuite (music)Parallel portLevel (video gaming)Electronic mailing listLink (knot theory)Query languageResultantMultiplication signMagnetic stripe cardNeuroinformatikComputer fileInterpolationAnalytic setEvent horizonCASE <Informatik>Instance (computer science)SpacetimeSoftwarePopulation densityWebsiteAssociative propertyProgramming paradigmInformationReduction of orderKolmogorov complexityElectronic mailing listInformation securityPredictabilityServer (computing)Series (mathematics)GeometryDensity matrixOperator (mathematics)Endliche ModelltheorieAlgebraWeb serviceSubsetType theoryState of matterTablet computerUniform resource locatorLink (knot theory)Service (economics)Interactive televisionTransformation (genetics)AreaGroup actionFIESTA <Programm>AuthorizationIterationEmailDifferent (Kate Ryan album)AuthenticationProcess (computing)Binary fileDarstellungsmatrixCache (computing)Projective planeTheory of relativityVariety (linguistics)Linear regressionTrailExpressionSparse matrixTesselationStatistics
Transcript: English(auto-generated)
00:00
Hi everyone. My name is Anthony Fox. I work for a company out of Charlottesville, Virginia called Commonwealth Computer Research. And we developed GeoMesa, which we open-sourced about a year ago and connected with the LocationTech team. And now GeoMesa is under the LocationTech
00:20
umbrella. I'm going to talk today about GeoMesa, what it is, how it came to be. I'm also going to talk a little bit about distributed databases, which is what GeoMesa is. And I'm going to dive into how we do a little bit of our indexing to enable geospatial data in a non-relational database. And eventually I'm going to get to a few analytics
00:46
that I'll show how we leverage GeoMesa to build on top of. So first of all, what is GeoMesa? Many of you have either been, by your customers,
01:02
been dictated to move to a cloud, or some of you have an actual justification and a need to move to a cloud. In our case, it's a little bit of both. We were directed to migrate an analytic that had intense computational requirements to a cloud-based system.
01:23
And the tools that we had come to rely on, the rich geospatial functionality that's available in PostGIS, was just not available to us on the cloud. So we developed enough of the geospatial capability that we needed to support our analytic, and then quickly
01:42
realized that it was useful on its own. So we decoupled it from the analytic, and we open sourced it. So GeoMesa is the result of that. It is a distributed spatiotemporal database. In particular, it's built on the Accumulo column family database, which I'm going to go into more details about. But more importantly, the goal of GeoMesa
02:02
is to be a runtime-only dependency of projects. So it implements all the standard GeoTools data store API. And it also exposes data in these distributed databases via standardized services like OGC that we expose via GeoServer plugins.
02:24
The point being that you should never have to import a GeoMesa class into your application. You should just have to import the relevant GeoTools interfaces and work with those directly. And the geospatial computations are transparently executed on the cloud. Again, it's location tech open source. Some of the visualizations
02:46
you're seeing here are associated with the GDELT dataset. GDELT is the global database of events, language, and tone. It's put out by UPenn and UT Austin, I believe. And it's 250 million geocoded events since 1979, about 100 gigs of
03:07
uncompressed data. And just to give you some performance numbers, we can ingest that data into the system on a small virtualized cloud in approximately 15 minutes. And then we can do analytics against them. And these are visualizations against
03:21
that dataset. So a little bit of justification about cloud-based geo. Why would you need this? Well, you don't have to look very far to see high-velocity spatiotemporal data. Twitter does about 100 to 150,000 tweets per second.
03:43
A small percentage of those are geotagged, but a small percentage of a very large number is still a reasonably large number. Foursquare claims to do 1 million check-ins a day. Everybody's doing geolocated clickstream, so marketers and advertisers are very interested in geolocated clickstreams. I'm going to show you how you can actually correlate
04:04
geolocated clickstreams via an analytic that we've developed. Satellite imagery, pretty obvious. Vehicle and traffic sensors, so FAA and car traffic sensors, generate tons of high-velocity data. So you have a need for a distributed
04:25
database because you have more data that can fit on a single machine's disk or can be processed within a single shared memory space. So the customer dictates that you use a column family database. So Google published a paper in about
04:44
2006 or 2007 on a system called Bigtable, and quickly after that, a number of derivative databases started to crop up. Amongst those derivative databases are HBase. It's built directly on the Bigtable model.
05:02
Cassandra and Accumulo. Accumulo focuses on security, so it does high-resolution cell-level security, which we're starting to support in GeoMesa. These distributed databases, particularly these key-value stores, have
05:23
very flexible schemas, so it's easy to get up and going with them, but it's not schema-less. Your application always imposes a schema on the data. You have to understand your data and you have to do something with it. What this means is that the complexity of query planning is pushed up into your application layer, so you have to make these
05:43
interesting trade-offs where you have potentially a multi-tenancy database and each application is dictating how they're doing scans and how they're using resources. So you have to balance between flexible schemas and potentially conflicting data access patterns. It's horizontally scalable. This is a very nice
06:07
feature, and I'm going to talk about how we leverage that. In particular, Accumulo has the notion of tablet servers. A single table is spread across many tablet servers, and we can actually take advantage of the processing and I.O.
06:25
bandwidth of multiple tablet servers. Accumulo will handle failover and balancing and rebalancing and all the nitty-gritty distributed complexity. But here's another trade-off, which is that in PostGIS, you have very nice
06:44
and sophisticated R-trees and quad-trees and your traditional spatial indexes, which work incredibly well on relatively sized datasets. Well, when you go to one of these column family databases, you don't have many
07:03
indexes. You have to design your table such that your table has a secondary index. The only index that's actually available to you is an implicit lexicographic ordering of the keys in your key value store. So this is a critical element that has to be thought through very carefully when pushing
07:22
geospatial data, because obviously geospatial data is multi-dimensional. When you add time, you're looking at three-dimensional data. So how do we go about leveraging distributed databases for the qualities that they have that are advantageous? Well, we can do partitioning very easily. Partitioning
07:43
almost comes for free, whereas with relational databases you have to do a lot of work at the application layer to do partitioning. With distributed databases, partitioning is effectively given to you, and we're able to distribute queries across multiple machines. So these are concurrent queries by different
08:04
applications or users, and they can be spread or load-balanced across these resources. We use striping to distribute computations within a single query. So a single query, we can actually split it up into all of the tablet servers. So if you have 100 tablet servers,
08:22
then you can potentially have 100 processors that are executing on your query. So this makes some of the very large data sets operational and interactive. The trade-off that you're making with that is that when you stripe the data
08:43
across all of your resources, you incur a cost when you have concurrent queries for many users, because all of your resources are being brought to bear on every single query. So we have a flexible element of the indexing structure that allows you to tune the amount
09:03
of parallelism you're going to get per table. Accumulo has an extension point called server-side iterators, which we leverage in GeoMESA to provide geospatial and CQL and eCQL query semantics at the data, rather
09:23
than as secondary scans. We can also embed custom analytics, I'm going to show some examples of that, inside these server-side iterators, so they essentially become ad hoc and interactive, MapReduce-like computations. That's the last bullet point. Okay, so I only have
09:42
about five minutes left, so I'm going to go through this really quickly at this point. The key to working with key-value data stores that have an implicit lexicographic ordering is to use space-filling curves. Space-filling curves are a one-dimensional data structure that allows you to project multi-dimensions
10:02
into a linear space. I have a nice little video here that took me a while, so I'm going to play this one real quick. This shows how the space-filling curve fills up the space that we're interested in filling. Obviously, the z-axis is time, but what you see in that second
10:22
front bar right there is that striping. Each portion of the data is striped across all of the tablets in a structured way to process these
10:41
queries. This is complex polygons and line strings. You essentially decompose the polygon into multi-resolution geo-hashes. Geo-hash is the implementation of space-filling curves that we're using. It's a z-order curve.
11:01
Then you store each decomposed geo-hash as an element in your index space. Query planning amounts to computing the stripes that are candidates for results of your query. Five minutes left, so I'm going to get to analytics real quick.
11:23
The analytics that we've developed, we've implemented them all as WPS services. They're deployed in GeoServer, so they're discoverable. Some of the analytics in particular, this one is a spatiotemporal prediction analytic. It predicts events in space and time. One of the interesting things about this analytic is
11:42
that it's Santiago, Chile, and it's robbery events in Santiago. We're predicting robbery events. Actually, it's hard to see, but just in that little area below the red, below the seven, I believe, is a stadium. We recently
12:02
transitioned from a linear model to a non-linear model that's able to detect that the threat from robberies is higher around the stadium than inside of the stadium, which was kind of exciting. All of that's implemented as server-side iterators in this. Any type of
12:23
associative computation can be cast as a MapReduce job. This isn't native MapReduce. It's MapReduce implemented within iterators. For instance, density computations can be done very rapidly by computing a sparse density matrix or any kind of transformation
12:43
matrix. Within each tablet server, you have a hundred workers that are brought to bear on your computation, and the reduce side applies the associative operation, which in the case of densities is just a summation. But you can express many different types of
13:03
computations this way, which is kind of nice. Another interesting analytic that we've developed as a WPS that's implemented within the server-side iterators is this interpolated time-space query. Basically, you would like to see who you might have interacted with on a trip that you've made.
13:23
It's time interpolation as well as space interpolation. We like to think of it as tweeting the New Jersey Turnpike, so you have a person that is tweeting as they're traveling on the New Jersey Turnpike, and you would like to know who might be on the megabus with them
13:41
or who they might have stopped and interacted with. So you have to do a complex series of queries that shift time through each gap, and you have to interpolate a track based on the road network or some other kind of underlying layer. So
14:01
possible interactions is the result of this analytic that's executed by WPS, and with gap-filling and snap-to-road tracks, we can get more information. A quick roadmap about GeoMesa. We're pushing for a 1.0
14:20
release in June. We are implementing full PKI-based authentication and authorization so that we integrate with the cell-level security of Accumulo. We have already implemented Avro binary encoding and relational projections that allow us to subset the data and rapidly return just what we need for each query.
14:41
In the fall, we're looking at integrating deeply with GeoServer and the Hadoop ecosystem. So WPSs will be executed across a variety of compute paradigms like Storm and MapReduce and Spark,
15:01
and tile caching can be pushed into HDFS or within Accumulo. And we're also looking at doing data and query statistics and how those might improve our query planning performance. Quick couple of links here.
15:22
The LocationTech.org website lists some information about GeoMesa. We have a GeoMesa.org website that has tutorials and other demonstrations, and there's a user's mailing list and a dev mailing list. And I'm happy to take any questions. I don't know if I have any time.