Accelerating GeoSpatial Data Analytics With Pivotal Greenplum Database
Seoul, South Korea

As a typical big data application, geospatial analysis nowadays has been receiving extensive attention from both academic and industrial domains. Along collecting massive geospatial data, more and more manufacturers as well as research institutions find that the analysis over geospatial data in existing legacy architecture cannot be scalable. The reason is typical two-fold. On one hand, extending traditional databases to support modern complex geospatial data analytics is rather challenging. On the other hand, integrating the emerging techniques in other big data applications to traditional databases may suffer from compatibility issue, resulting in the poor performance or even painful debugging tasks. Specifically, most of today��s general-purpose relational databases (e.g., Oracle, Microsoft SQL Server, together with their geospatial components) are particularly designed as OLTP systems. Their shared-disk or shared-everything architectures are especially optimized for high-throughput transaction execution while sacrificing analytical query performance. In contrast to the exiting relational database systems, Pivotal offers the Greenplum Database (GPDB), which is an extensible relational database platform that uses a shared-nothing, massive parallel processing (MPP) based architecture to vastly accelerate the online analytical processing (OLAP) over geospatial big data. Even better, GPDB can seamlessly integrate in-database analytical processing with our extended analytics stacks, such as heterogeneous Hadoop environments and in-memory data grid. Recent reports from Gartner highly scored Pivotal GPDB on data warehousing and analytics. We design and develop geospatial analytics toolkits on GPDB in terms of three aspects. First, we migrate the latest PostGIS project into GPDB so that GPDB is able to run as a spatial database system for regular GIS users. Second, we extend the spatial component with various types of advanced geospatial functions, such as geospatial group-by, similarity search and network-constrained scenarios. Third, we are making effort to support associable retrievals of data across geospatial and other data domains, i.e, queries involving in both geospatial information as well as other non-spatial information, like RDF (which is known as GeoSPARQL queries), Text (which is known as spatial keyword search), time (which is known as trajectory search) etc. Above all we aim to integrate full breath of big data developers on geospatial analytics. This talk will briefly introduce (1) the architecture of Pivotal GPDB that provides automatic high-performance parallelization of geospatial data loading and data processing, (2) GPDB��s extensive and growing library of in-database geospatial analytic functions, and (3) the capability to build up a comprehensive geospatial data analytics platform around Pivotal GPDB. I will provide examples of how data science teams may transform billions of geo-tagged customer records to tackle the real-world problem of identity resolution in one minute. I will also discuss our plan of making Pivotal Greenplum Database open-source in the coming quarters.
is some new cases yeah that's within and question so what we have few so you would you OK they give very much I was right on time you hear from any questions from the floor 1st in training on and I've got a couple that the polymorphic storages really interesting right and you've you're combining with that the rows store the column storing insane implementation what kind
of queries and what kind of geospatial data does that really perform well on from your experience so don't data partitions no you were talking about what early on your trying but article no sequel but I was asking were particularly but the rows store versus column store which you get a single polymorphic storage which I guess you implement together and depending on the query here depending on the data you decide which way to cut in goal how does it work exactly of this the way I given that develop a lot of implication tool a Beijing intensity and they each day to do with Janero Boletín of 5 you will hundreds of megabases IA will crater to it a separate table for each day and then fullerene for for the nearest of a mean 4 than and they have will still Uriel and therefore the sum of good code a bit how we'll just push it a poor profiles of it because you know that the database uh and uh 60 464 evoking a basis of the best the size of full of relational database you full of Saturday's Toby unit will partitioning through a small piece yeah and 1 last thing I was going to ask you use answered at the end about the in memory storage and you're working so most of your work you're doing just in memory or you're running up this is that and was the speed difference this uh ethical about pool parcel of free memory wiser than and why is that gym of fire wiser spatial if table that yeah a sphere is a good question because we can't store the height in the memory that so we no way considered will go dope the solution at least where a lot of the programmer will specify which data after the beep p in the in their memory if so you'll you'll that is if you come and kid it's depend on the program but you know that the programmer is the which kind there tussling with that so we consider tool use some uh catches strategy it will be due because the later in what if if there's nothing else than I'd like to thank you again for your talk the that