Trajectory: a novel geospatial data model of Pivotal Greenplum database - TIB AV-Portal

Trajectory: a novel geospatial data model of Pivotal Greenplum database

00:00

1

Formal Metadata

Title

Trajectory: a novel geospatial data model of Pivotal Greenplum database

Alternative Title

Trajectory: a novel geospatial data model of GPDB

Title of Series

Part Number

7

Number of Parts

110

Author

License

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/30918 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area	Computer Science
Genre	Conference/Talk

FOSDEM 20167 / 110

1

21:42

Open Source Design & Uni Students

2

14:58

Designing Accessible Applications

3

28:41

Developers looking for designers Show off your Project

4

28:39

How can I contribute

5

21:09

Geocoding the World with openaddresses Io

6

26:31

Introduction to MySQL GIS

7

26:46

Trajectory: a novel geospatial data model of Pivotal Greenplum database

8

24:15

APIs all the way down or free software as IoT enabler

9

11:19

Build an IoT platform on Matrix

10

38:57

Building an IoT Empire

11

22:27

Code Orchestration

12

23:01

Create Offline IoT Experiences with Beacons

13

23:09

Introduction to IoT.js

14

25:11

Lepton in a nutshell: an Operating system for deeply embedded IoT devices

15

19:36

Open-Source 6LoWPAN IoT BSP

16

28:16

PostgreSQL features for Internet of Things

17

25:13

Smart JS a tale of two platforms

18

22:32

How to Deploy a Secure, High-Available Hadoop Plattform

19

30:56

20

44:46

21

28:42

Automating Big Data Benchmarking for Different Architectures with ALOJA

22

30:24

Benchmarking graph databases with gMark

23

05:01

24

05:01

Extracting Data From Your Open Source Communities

25

19:31

FlinkML: Large-scale Machine

26

38:42

GRADOOP: Scalable Graph Analytics with Apache Flink

27

23:45

28

40:21

Modeling a Philosophical Inquiry from MySQL to a graph database

29

24:10

Multi-host containerised HPC cluster

30

25:53

31

20:10

ORCA Query Optimization as a Service

32

16:37

Parallel Inception

33

40:10

Real time scalable graph analytics

34

05:35

Reproducible and user-controlled package management with GNU Guix

35

04:57

Scylla a Cassandra compatible NoSQL database at 2 million requests S

36

21:19

Why Flow instead of State

37

05:22

Taxi trip analysis DEBS grand challenge with Apache Geode incubating

38

24:01

The openCypher Project

39

24:21

Timely Dataflow in Rust

40

26:41

Using Hadoop as a SQL Data Warehouse

41

22:02

XALT: User Environment Tracking

42

23:48

Bug hunting with Apache Lucene

43

44:15

Hunting the bug from Hell

44

42:10

Java 9: Juggling the Jigsaw

45

21:21

Marlin renderer

46

20:15

The AAarch32 Project

47

30:49

Thermostat for Developers

48

23:28

Analyze for statements

49

40:28

Clusternaut Orchestrating Percona Xtradb Cluster with Kubernetes

50

27:59

Galera Replication Demistified

51

22:11

MariaDB Connect Storage Engine & new JSON support

52

24:47

More on Gdb for MySQL DBAs

53

10:24

MySQL Group Replication or how good theory gets into better practice

54

22:11

MySQL operations with Docker

55

44:24

Booking.com: MySQL Parallel Replication

56

21:54

Performance Schema and Sys Schema

57

25:07

Reliable crash detection and failover with Orchestrator

58

21:28

Rolling Out GTIDs at Dropbox

59

22:17

The Query Rewrite Plugin Interface

60

23:15

TokuDB in 15 Minutes What You Need To Know

61

23:22

An overview openconnect VPN

62

23:46

Can we run C code and be safe?

63

25:40

F Droid building the private unblockable App Store

64

24:58

Lessons learned running SSL at scale

65

21:55

Ramping up Security at an Open Source startup Lessons learned

66

05:46

Security and privacy in your embedded systems

67

23:47

Testing Cryptography in wolfSSL and wolfCrypt

68

25:32

Modern Security Model for Linux Operating Systems

69

25:58

70

25:14

Xen Project Security Response War Stories

71

52:55

Reproducible builds ecosystem

72

11:01

Closing Fosdem 2016

73

32:18

Cockpit a Linux Session in your Browser

74

52:08

Enterprise desktop at home with FreeIPA and GNOME

75

54:12

Gluster roadmap

76

49:56

77

49:35

How to design a Linux kernel interface

78

16:50

79

32:23

80

46:33

Open sourcing RIPE Atlas

81

49:48

Putting 8 Million People on the Map

82

43:14

Re-thinking Linux Distributions

83

49:02

Userspace Linux I/O towards Petascale Storage

84

46:31

MyRocks: RocksDB Storage Engine for MySQL

85

51:36

Scaling and Securing LibreOffice Online

86

49:42

Systemd and Where We Want to Take the Basic Linux Userspace in 2016

87

44:15

The Future of Opendocument Odf

88

08:28

Welcome to Fosdem 2016

89

50:28

What Do Code Reviews at Microsoft and in Open Source Projects Have In Common

90

07:29

A gentle introduction to functional package management with Gnu Guix

91

28:51

Continuous Integration with Lua

92

20:31

Design and Implementation of the MoonGen Packet Generator

93

16:55

Elasticsearch Lua

94

19:16

Foreign packages in Gnu Guix

95

20:01

Good news, everybody!

96

15:28

97

17:03

98

18:53

How awesome ended up with Lua and not Guile

99

14:29

LGSL Numerical algorithms for Lua

100

28:57

Lmod Building a Community around an Environment Modules Tool

101

15:09

Lua: language for the Web?

102

17:09

Tarantool an in memory Nosql database and execution grid

103

29:11

The Future of small languages

104

20:43

Web Development in Lua

105

22:52

Developing the Prosody Xmpp server in Lua

106

28:50

Your Distro is a Scheme Library

107

24:13

IoT Meets Security

108

20:41

Introducing new SQL syntax and improving performance with preparse Query Rewrite Plugins

109

26:05

Hanythingondemand - Hadoop clusters on Hpc clusters

110

19:16

Big Data meets Fast Data

Automatic playback

Speech

Text

Image

00:00

Data modelDiscrete element methodDatabaseSuite (music)User interfaceMoving averageACIDArchitectureParallel portStructural loadData storage deviceTable (information)Subject indexingFlow separationQuery languageProduct (business)Different (Kate Ryan album)CodeDatabaseStapeldateiPhysical systemPlanningSubject indexingTrajectoryRow (database)Mathematical optimizationSource codeTable (information)Military basePhase transitionAxiom of choiceScaling (geometry)Communications protocolSoftware developerMultilaterationGeometryReal numberMultiplication signKernel (computing)CASE <Informatik>BuildingFunctional (mathematics)Open sourceGroup actionRelational databaseDatabase normalizationData managementPolygonMereologyPoint (geometry)Formal languageAreaBitInheritance (object-oriented programming)Data typeFile systemWeb pageStructural loadXMLComputer animation

08:57

Pell's equationWordRaster graphicsWireless LANMathematical analysisTrajectoryObject (grammar)ReliefAliasingData modelPattern languageMetropolitan area networkUser interfaceComputer networkDiscrete element methodSocial classSoftwareÜberlastkontrolleUniform resource locatorMessage passingTwitterProjective planeMultimediaHeat transferDialectNumberRadio-frequency identificationRule of inferencePattern languageComputer-assisted translationTraffic reportingCellular automatonEncryptionObject (grammar)Dimensional analysisMultiplication signMathematical analysisBuildingBranch (computer science)Source codeFunctional (mathematics)PlastikkarteQuery languageGeometryTrajectoryGoodness of fitWebsiteContinuous functionVideoconferencingQuantumSmartphoneLevel (video gaming)Web 2.0Point cloudOffice suiteSet (mathematics)Service (economics)Analytic continuationTable (information)Random matrixComputer animation

17:49

Discrete element methodTrajectoryDatabaseMaxima and minimaCellular automatonAsynchronous Transfer ModeData structureData modelOperator (mathematics)Sample (statistics)Query languageMetadataFlow separationDirection (geometry)Data storage deviceFunctional (mathematics)DatabaseGreatest elementAlgorithmClassical physicsPoint (geometry)SpacetimeDifferent (Kate Ryan album)Cross-correlationTrajectoryData typePlanningBlack boxArtificial neural networkPattern languageExtension (kinesiology)ResultantMultiplication signDemo (music)Dimensional analysisPredictabilityRegular graphConnectivity (graph theory)ProgrammierstilData structureMathematical analysisRun time (program lifecycle phase)Customer relationship managementGeometryProcess (computing)PrototypeOffice suite

26:41

Core dumpGoogolComputer animation

Transcript: English(auto-generated)

00:05

Yes, everyone, it's my pleasure to introduce our work in this phase. So it's very similar to the last talk. We tried to achieve the deep spatial data management in deep phase. Now it's a traditional deep phase. We got a good plan.

00:22

And the plan comes from Google. And in this talk, first of all, I give a quick overview of the deep spatial database. And then I'd like to introduce our kind of plan on deep spatial, especially on trajectory. Today I want to introduce a little bit because this is related to our design of trajectory.

00:49

People guys, they really like the open source. We have released many products in Apache and BSP license.

01:00

For example, some guys have used Verdes and Green. Now we have released several other data management systems. If you want to do some real-time deep search, you can use Dreamfire. And if you want to do some interactive queries, you can use Dreamform and Voice.

01:23

And if you want to do some batch jobs, Doomfist is another choice. Okay. In our deep phase suite, we've been using Doomfist for more than 10 years.

01:41

And now more than once. These two bases are usually in price. And it's been open sourced five months ago. Now we can find it from GitHub. Just now I captured the development of this database to find many commits.

02:03

In about one week, we can have about 20 to 40 commits every week. So it's very hard. If you are interested in this database, you can start from the sandbox, you can download it from the website.

02:25

We also provide the tutorial. You can follow it one by one.

02:41

Actually, we have made a lot of innovation. Because Doomfarm is... Before 10 years ago, we developed Doomfarm based on protocol scale 8.2. So it's some of the kernel. Now we are up to the new kernel.

03:01

But we need more time, maybe this year. By the end of this year, we can finish it. We can update the kernel from 8.2 to PG9. Because I know that many guys are talking about the geospatial part.

03:20

One group comes from the database, and another group comes from Earth Science. They are working on mapping, on building. But when it comes to data, it's hard to deal with each area. For me, when I was a PG student, I wanted to attend some geospatial conferences. I found that the guys from the database,

03:44

they are not little about the GIS, but they are doing some practical questions in the research. So let's skip this page on the kernel of Doomfarm.

04:07

Here is an example of how to query in Doomfarm. If we want to retrieve the price of a beer in Brussels,

04:22

we can submit a super query like this. Doomfarm is an MPP, India-based database cluster. We need to collect the data from different segments, so we need to create the interconnect between segments.

04:44

We need to make sure the source is put. It depends on the optimizer. We can generate the query plan very smartly, and it can reduce some traffic between the segments.

05:01

Also, we can have many ways of loading the data in the database. For example, we can save some hard data in the oriented schema, and some code data in the column, and also some real user data in the file system.

05:29

If you are interested in the database, we have published so many papers. Let's move to the geospatial. Now we have some plans to develop the geospatial in this open-source database.

05:45

We have integrated geometry and the geography in Doomfarm. Now we are working on the cluster and trajectory. I will get to this later.

06:01

For geometry, I think many of us are familiar with this. I skipped it. Also, geography, yes, we can perform the query in the database with SQL language and a point like polygon and something like that.

06:21

Also, we can retable the relation between the data. I think I need to talk about something about the index. Index is in Doomfarm, and in Doomfarm, we are using the Gist index to support two-dimensional data type.

06:41

It is developed by Oleg and other guys. But I think it is not easy. Two months ago, when I talked with a data scientist in Japan, he found a simple query. The query will cost him a very, very long time, maybe more than 10 minutes to perform the simple query.

07:03

I don't know why, because he has two tables. One is a big table. It is about millions of those. Another is a small table. It has only 2,000 rows. If he wants to do the special job,

07:21

and he creates an index on both tables, and matches them, and performs the query, it still costs him more than 10 minutes. So that means we need to use the index very carefully. In my opinion, there are many cases we need to create an index.

07:40

For example, if a data update is not very frequently, the index will split and move very frequently, and the database is really, really busy on this, and we don't have time to deal with the query. So once again, you can load all the data together,

08:00

and then you create the index. You need to drop the index, load the data, and create the index again. So it will be much, much faster. And also, if you have too many redundancy data, I mean, the value of the column, they are similar or equal,

08:20

so it's not needed efficiently to use the index. Also, for some cases, the function cost is hard to evaluate. The query planner will generate, maybe not that good, I mean the query plan, so maybe you need to control the query plan by yourself.

08:41

So I think for me, I don't believe that's generated by defaulted by the query optimizer. Okay, it's interesting. Several months before I watched the television,

09:01

I found that they said, you go from China to North Pole to South Pole, and the ship is just like the number eight, but I thought it's a strange way. People follow this twist, and I googled it and found that the character

09:23

is just like the red one. It's very interesting. So we need to make sure the projection is very fast. You know, we achieved the geospatial in MPB engine, it's hard to achieve this, because the SRID, I mean the spatial reference ID,

09:45

it will be invoked very frequently in the function, but if your SRID are stored on the separate segments, we need to retrieve them frequently from other segments. I mean, they need to access the data from the network,

10:04

so the function will be very slow, so we need to make many modifications to support the SRID in MPB engine, and also we have sponsored to universities,

10:21

they are doing some research on this. For example, one is working on the typical table. We will copy the table on every segment, so to foster the period at least. Also, now we support a lot of the data in their branch.

10:43

In the example of the island in China, in South China Sea, in two years before, this island is very small, it's just like two houses. They just have one building on this island,

11:01

but now you see it's very big, they can fly the airport on this island. I mean, in this case, the geometry is not available, so we need a roster to describe the island.

11:24

We come to some interesting things there. We can analysis some data to find intersections, this data and geometry data. Below is an example. On the left is the temperature distribution,

11:41

and we also can intersect with some geometry data and to analysis it with some simple SQL query. Also, for the data data, like the quantum cloud, we can analysis with some roster function,

12:04

but I think we need to support the quantum cloud in the future in some natural way, not with the roster. Okay, let's talk about the trajectory. I've seen many guys ask about that

12:21

because we are using it every day. Trajectory is a spatial location of moving objects over time. That means, for example, every guy is using a smartphone and he can record your location as time goes.

12:42

In mathematics, it's a continuous function, but in practical, it's just a set of data, the same thing. It has two basic dimensions, one is time and another is the position, and also other dimensions like the speed

13:02

and the direction, something like that. And the next question is where we can get a trajectory. The first is a taxi. I know in China, most traffic information

13:21

is generated from the taxi. The company gathers GPS from the taxi and generates the traffic, so the company is nervous to see the update. And also, we can use some navigation and GPS encrypt on some animals.

13:42

The cat pit is used in Australia and from the satellite and from the check-in data, from the VIXR. Some multimedia data, they have geo tags. For example, if you record some video using Sony camera

14:01

or if you take a picture of the place, we enable the GPS and update it to the website. The company can view the media, also can create a video and Wi-Fi.

14:22

I think Wi-Fi is a good data source for location objects. It's very easy to capture the behavior of the objects. For example, I know many web map services try to capture your behavior mainly based on the Wi-Fi.

14:44

For example, if you connect to the Wi-Fi for a longer time, that means you are staying in the office or you are staying in your home. And if you change your SNID very frequently, you are working in some public region.

15:02

It's very useful to send you some useful message. And also, it's very cheap. Another interesting thing is if you are using a cell phone, it will send you a location, a bottleneck, or PR.

15:21

It's easier to find out your frequent location with your iPhone. So iPhone is the best way to find out about your privacy. PR is the data about the taxi in Beijing.

15:41

From this data, you can find out we can learn how to build a smart city with a trajectory if we can't make it very well. Actually, we have done some research with MSI, Microsoft Research, and we tried to find out some traffic congestion

16:01

due to the network design. We found that in some months, they are to eject the traffic jam very frequently in this area. But for next year, it's hard to find it. Then we find the reason they have built it

16:22

to subway across this region so it can help us to find out some solutions to improve the traffic. Are you using any machine? Yes. First, we need to class the data to find out some popular regions

16:42

and use some frequent pattern associated rule to find the pattern between the nearby regions to find out which transfer pattern is very hard for people if we need to use so many methods.

17:05

This means the trajectory is not only from the GPS autocad, from other data like the picture and the message, we upload to Twitter. Also, we can click from RFID and NFC and some sensors.

17:26

Also, the credit card is another source to detect your location. For example, my wife used to monitor my behavior by reading through a bill very carefully.

17:42

And the trajectory data can be moved in free space in Euclidean and also can be described in some continuous space. For example, we can translate our GPS data into the correlation of the neural network.

18:02

So we call it neural network attention trajectory. So that means we can extract the data in different spaces. Before we developed the geospatial in Dunhuang, we were able to do many research on trajectory, for example, in prediction.

18:23

Every morning you drive from your home to your office from 8 o'clock and around the office at 8.30. It's your pattern. If you give me your data more than one month,

18:41

I can detect this pattern. So tomorrow is Monday. If you start your car, I know you will go to your office with a very high probability. So it's called a prediction. And also we try to make some analysis on the semantic.

19:03

The GPS data is hard to read. So we need to translate it to some textual domain, like you stay at home for four hours and then your office for another four hours, something like this.

19:26

What we found is that the existing databases, many guys are working and have developed some prototypes, but none of them are achieving the general purpose database.

19:42

So you really don't like to install two databases, one is a general purpose database like Google Cloud, like Oracle, like PostgreSQL, and you still need to install some of the prototypes. So a good idea is we can develop the trigger

20:02

as a component of some general purpose database. And the good news is that I found the PostGIS have done some job on trajectory from three months ago. It's called, yes, they have only implemented one function

20:24

called a colloquial point of approach. It's a classic algorithm in trajectory. So we tried to develop the trajectory with some easy SQL APIs to analyze the data.

20:48

And actually we abstracted the trajectory into three layers. On the bottom we need to know how to store data,

21:03

and in the middle we need to make sure how to organize the data, and from the top we need to specify the function to create.

21:21

We developed a trajectory very different from PostGIS because we believe the data can wait, so it's very hard to load the data into the database. For example, if you have GPS data more than five terabytes, you need to spend two hours to load the data into the database.

21:44

It's very time consuming. So we need to load the data into the database, just store data in the GPS log, something in the black box. And we can use the GPS data, it's one kind of FDW, I mean external table,

22:04

to load the data into the database. That's it. So we can query from the database directly. We need to load the GPS data into the database,

22:22

and it will be fast. And also we can use some other ways, like the GPS and other tools to load the data into the database. So both ways are efficient for us. And we also defined several data types to use in the trajectory query.

22:42

And what's interesting is chip. Chip is an intermediate data structure for trajectory. It's just stored metadata of our trajectory, and our query will be performed on chip. And the chip is very small, so it's very easy to control between segments

23:02

of all the segments of the good plan. And it will reduce the traffic consuming on the interconnect. And also we have this function on trajectory.

23:21

I want to give you some demo, but I find that I need more time. Maybe next time I'll do some demo. I don't have enough time. For example, the function different from post GIS,

23:41

we have the details. We need to consider the temporal tension rather than the spatial one, so it will reduce the different results of post GIS. Okay, that's all.

24:01

Thank you. Yes, I was wondering, the trajectory part,

24:21

it's only available in the pivotal database, because if I understand it correctly, it's like pivotal is an extension to Postgres. Oh yes, it's a close question. I want to do that. I want to, after we reduce the proportional tension, I will move it to post GIS.

24:42

I will maybe add a separate component to the GIS. Okay, so it would be possible if you have, let's say, a smaller dataset to just use a regular Postgres? Yes. Okay.

25:01

Why don't you use the M dimension you have in Postgres already? In geometry you have the X, Y, Z, but also the M and you can put whatever you want into it. So, just a question, why don't you rely on this feature to manage time,

25:24

for example, in your trajectory? Yeah, totally we don't develop it directly. The first, we need to achieve it in runtime. It's MPP architecture, so it's very different from Postgres Q.

25:42

And especially for, I have mentioned three examples different between MPP and PG. One is SID, it's hard to achieve this, and also the interconnect and also some other things. It's a major reason for this.

26:00

And the second is, now we need to, we treat it as a new bit model, if we develop it in a project, it will be affected by the code style. So we need to,

26:20

we are planning to release it by separately, then we consider it to be in there. Okay. I'll switch to the next speaker. Thanks again. Thank you. Thank you.