Trajectory: a novel geospatial data model of Pivotal Greenplum database
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Alternative Title |
| |
Title of Series | ||
Part Number | 7 | |
Number of Parts | 110 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/30918 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | |
Genre |
FOSDEM 20167 / 110
4
6
10
11
13
15
17
19
20
23
25
27
30
32
36
38
39
41
42
43
44
45
46
47
48
50
52
54
58
61
62
69
71
72
75
76
78
79
80
82
87
88
91
93
94
95
96
97
101
103
104
106
107
110
00:00
Data modelDiscrete element methodDatabaseSuite (music)User interfaceMoving averageACIDArchitectureParallel portStructural loadData storage deviceTable (information)Subject indexingFlow separationQuery languageProduct (business)Different (Kate Ryan album)CodeDatabaseStapeldateiPhysical systemPlanningSubject indexingTrajectoryRow (database)Mathematical optimizationSource codeTable (information)Military basePhase transitionAxiom of choiceScaling (geometry)Communications protocolSoftware developerMultilaterationGeometryReal numberMultiplication signKernel (computing)CASE <Informatik>BuildingFunctional (mathematics)Open sourceGroup actionRelational databaseDatabase normalizationData managementPolygonMereologyPoint (geometry)Formal languageAreaBitInheritance (object-oriented programming)Data typeFile systemWeb pageStructural loadXMLComputer animation
08:57
Pell's equationWordRaster graphicsWireless LANMathematical analysisTrajectoryObject (grammar)ReliefAliasingData modelPattern languageMetropolitan area networkUser interfaceComputer networkDiscrete element methodSocial classSoftwareÜberlastkontrolleUniform resource locatorMessage passingTwitterProjective planeMultimediaHeat transferDialectNumberRadio-frequency identificationRule of inferencePattern languageComputer-assisted translationTraffic reportingCellular automatonEncryptionObject (grammar)Dimensional analysisMultiplication signMathematical analysisBuildingBranch (computer science)Source codeFunctional (mathematics)PlastikkarteQuery languageGeometryTrajectoryGoodness of fitWebsiteContinuous functionVideoconferencingQuantumSmartphoneLevel (video gaming)Web 2.0Point cloudOffice suiteSet (mathematics)Service (economics)Analytic continuationTable (information)Random matrixComputer animation
17:49
Discrete element methodTrajectoryDatabaseMaxima and minimaCellular automatonAsynchronous Transfer ModeData structureData modelOperator (mathematics)Sample (statistics)Query languageMetadataFlow separationDirection (geometry)Data storage deviceFunctional (mathematics)DatabaseGreatest elementAlgorithmClassical physicsPoint (geometry)SpacetimeDifferent (Kate Ryan album)Cross-correlationTrajectoryData typePlanningBlack boxArtificial neural networkPattern languageExtension (kinesiology)ResultantMultiplication signDemo (music)Dimensional analysisPredictabilityRegular graphConnectivity (graph theory)ProgrammierstilData structureMathematical analysisRun time (program lifecycle phase)Customer relationship managementGeometryProcess (computing)PrototypeOffice suite
26:41
Core dumpGoogolComputer animation
Transcript: English(auto-generated)
00:05
Yes, everyone, it's my pleasure to introduce our work in this phase. So it's very similar to the last talk. We tried to achieve the deep spatial data management in deep phase. Now it's a traditional deep phase. We got a good plan.
00:22
And the plan comes from Google. And in this talk, first of all, I give a quick overview of the deep spatial database. And then I'd like to introduce our kind of plan on deep spatial, especially on trajectory. Today I want to introduce a little bit because this is related to our design of trajectory.
00:49
People guys, they really like the open source. We have released many products in Apache and BSP license.
01:00
For example, some guys have used Verdes and Green. Now we have released several other data management systems. If you want to do some real-time deep search, you can use Dreamfire. And if you want to do some interactive queries, you can use Dreamform and Voice.
01:23
And if you want to do some batch jobs, Doomfist is another choice. Okay. In our deep phase suite, we've been using Doomfist for more than 10 years.
01:41
And now more than once. These two bases are usually in price. And it's been open sourced five months ago. Now we can find it from GitHub. Just now I captured the development of this database to find many commits.
02:03
In about one week, we can have about 20 to 40 commits every week. So it's very hard. If you are interested in this database, you can start from the sandbox, you can download it from the website.
02:25
We also provide the tutorial. You can follow it one by one.
02:41
Actually, we have made a lot of innovation. Because Doomfarm is... Before 10 years ago, we developed Doomfarm based on protocol scale 8.2. So it's some of the kernel. Now we are up to the new kernel.
03:01
But we need more time, maybe this year. By the end of this year, we can finish it. We can update the kernel from 8.2 to PG9. Because I know that many guys are talking about the geospatial part.
03:20
One group comes from the database, and another group comes from Earth Science. They are working on mapping, on building. But when it comes to data, it's hard to deal with each area. For me, when I was a PG student, I wanted to attend some geospatial conferences. I found that the guys from the database,
03:44
they are not little about the GIS, but they are doing some practical questions in the research. So let's skip this page on the kernel of Doomfarm.
04:07
Here is an example of how to query in Doomfarm. If we want to retrieve the price of a beer in Brussels,
04:22
we can submit a super query like this. Doomfarm is an MPP, India-based database cluster. We need to collect the data from different segments, so we need to create the interconnect between segments.
04:44
We need to make sure the source is put. It depends on the optimizer. We can generate the query plan very smartly, and it can reduce some traffic between the segments.
05:01
Also, we can have many ways of loading the data in the database. For example, we can save some hard data in the oriented schema, and some code data in the column, and also some real user data in the file system.
05:29
If you are interested in the database, we have published so many papers. Let's move to the geospatial. Now we have some plans to develop the geospatial in this open-source database.
05:45
We have integrated geometry and the geography in Doomfarm. Now we are working on the cluster and trajectory. I will get to this later.
06:01
For geometry, I think many of us are familiar with this. I skipped it. Also, geography, yes, we can perform the query in the database with SQL language and a point like polygon and something like that.
06:21
Also, we can retable the relation between the data. I think I need to talk about something about the index. Index is in Doomfarm, and in Doomfarm, we are using the Gist index to support two-dimensional data type.
06:41
It is developed by Oleg and other guys. But I think it is not easy. Two months ago, when I talked with a data scientist in Japan, he found a simple query. The query will cost him a very, very long time, maybe more than 10 minutes to perform the simple query.
07:03
I don't know why, because he has two tables. One is a big table. It is about millions of those. Another is a small table. It has only 2,000 rows. If he wants to do the special job,
07:21
and he creates an index on both tables, and matches them, and performs the query, it still costs him more than 10 minutes. So that means we need to use the index very carefully. In my opinion, there are many cases we need to create an index.
07:40
For example, if a data update is not very frequently, the index will split and move very frequently, and the database is really, really busy on this, and we don't have time to deal with the query. So once again, you can load all the data together,
08:00
and then you create the index. You need to drop the index, load the data, and create the index again. So it will be much, much faster. And also, if you have too many redundancy data, I mean, the value of the column, they are similar or equal,
08:20
so it's not needed efficiently to use the index. Also, for some cases, the function cost is hard to evaluate. The query planner will generate, maybe not that good, I mean the query plan, so maybe you need to control the query plan by yourself.
08:41
So I think for me, I don't believe that's generated by defaulted by the query optimizer. Okay, it's interesting. Several months before I watched the television,
09:01
I found that they said, you go from China to North Pole to South Pole, and the ship is just like the number eight, but I thought it's a strange way. People follow this twist, and I googled it and found that the character
09:23
is just like the red one. It's very interesting. So we need to make sure the projection is very fast. You know, we achieved the geospatial in MPB engine, it's hard to achieve this, because the SRID, I mean the spatial reference ID,
09:45
it will be invoked very frequently in the function, but if your SRID are stored on the separate segments, we need to retrieve them frequently from other segments. I mean, they need to access the data from the network,
10:04
so the function will be very slow, so we need to make many modifications to support the SRID in MPB engine, and also we have sponsored to universities,
10:21
they are doing some research on this. For example, one is working on the typical table. We will copy the table on every segment, so to foster the period at least. Also, now we support a lot of the data in their branch.
10:43
In the example of the island in China, in South China Sea, in two years before, this island is very small, it's just like two houses. They just have one building on this island,
11:01
but now you see it's very big, they can fly the airport on this island. I mean, in this case, the geometry is not available, so we need a roster to describe the island.
11:24
We come to some interesting things there. We can analysis some data to find intersections, this data and geometry data. Below is an example. On the left is the temperature distribution,
11:41
and we also can intersect with some geometry data and to analysis it with some simple SQL query. Also, for the data data, like the quantum cloud, we can analysis with some roster function,
12:04
but I think we need to support the quantum cloud in the future in some natural way, not with the roster. Okay, let's talk about the trajectory. I've seen many guys ask about that
12:21
because we are using it every day. Trajectory is a spatial location of moving objects over time. That means, for example, every guy is using a smartphone and he can record your location as time goes.
12:42
In mathematics, it's a continuous function, but in practical, it's just a set of data, the same thing. It has two basic dimensions, one is time and another is the position, and also other dimensions like the speed
13:02
and the direction, something like that. And the next question is where we can get a trajectory. The first is a taxi. I know in China, most traffic information
13:21
is generated from the taxi. The company gathers GPS from the taxi and generates the traffic, so the company is nervous to see the update. And also, we can use some navigation and GPS encrypt on some animals.
13:42
The cat pit is used in Australia and from the satellite and from the check-in data, from the VIXR. Some multimedia data, they have geo tags. For example, if you record some video using Sony camera
14:01
or if you take a picture of the place, we enable the GPS and update it to the website. The company can view the media, also can create a video and Wi-Fi.
14:22
I think Wi-Fi is a good data source for location objects. It's very easy to capture the behavior of the objects. For example, I know many web map services try to capture your behavior mainly based on the Wi-Fi.
14:44
For example, if you connect to the Wi-Fi for a longer time, that means you are staying in the office or you are staying in your home. And if you change your SNID very frequently, you are working in some public region.
15:02
It's very useful to send you some useful message. And also, it's very cheap. Another interesting thing is if you are using a cell phone, it will send you a location, a bottleneck, or PR.
15:21
It's easier to find out your frequent location with your iPhone. So iPhone is the best way to find out about your privacy. PR is the data about the taxi in Beijing.
15:41
From this data, you can find out we can learn how to build a smart city with a trajectory if we can't make it very well. Actually, we have done some research with MSI, Microsoft Research, and we tried to find out some traffic congestion
16:01
due to the network design. We found that in some months, they are to eject the traffic jam very frequently in this area. But for next year, it's hard to find it. Then we find the reason they have built it
16:22
to subway across this region so it can help us to find out some solutions to improve the traffic. Are you using any machine? Yes. First, we need to class the data to find out some popular regions
16:42
and use some frequent pattern associated rule to find the pattern between the nearby regions to find out which transfer pattern is very hard for people if we need to use so many methods.
17:05
This means the trajectory is not only from the GPS autocad, from other data like the picture and the message, we upload to Twitter. Also, we can click from RFID and NFC and some sensors.
17:26
Also, the credit card is another source to detect your location. For example, my wife used to monitor my behavior by reading through a bill very carefully.
17:42
And the trajectory data can be moved in free space in Euclidean and also can be described in some continuous space. For example, we can translate our GPS data into the correlation of the neural network.
18:02
So we call it neural network attention trajectory. So that means we can extract the data in different spaces. Before we developed the geospatial in Dunhuang, we were able to do many research on trajectory, for example, in prediction.
18:23
Every morning you drive from your home to your office from 8 o'clock and around the office at 8.30. It's your pattern. If you give me your data more than one month,
18:41
I can detect this pattern. So tomorrow is Monday. If you start your car, I know you will go to your office with a very high probability. So it's called a prediction. And also we try to make some analysis on the semantic.
19:03
The GPS data is hard to read. So we need to translate it to some textual domain, like you stay at home for four hours and then your office for another four hours, something like this.
19:26
What we found is that the existing databases, many guys are working and have developed some prototypes, but none of them are achieving the general purpose database.
19:42
So you really don't like to install two databases, one is a general purpose database like Google Cloud, like Oracle, like PostgreSQL, and you still need to install some of the prototypes. So a good idea is we can develop the trigger
20:02
as a component of some general purpose database. And the good news is that I found the PostGIS have done some job on trajectory from three months ago. It's called, yes, they have only implemented one function
20:24
called a colloquial point of approach. It's a classic algorithm in trajectory. So we tried to develop the trajectory with some easy SQL APIs to analyze the data.
20:48
And actually we abstracted the trajectory into three layers. On the bottom we need to know how to store data,
21:03
and in the middle we need to make sure how to organize the data, and from the top we need to specify the function to create.
21:21
We developed a trajectory very different from PostGIS because we believe the data can wait, so it's very hard to load the data into the database. For example, if you have GPS data more than five terabytes, you need to spend two hours to load the data into the database.
21:44
It's very time consuming. So we need to load the data into the database, just store data in the GPS log, something in the black box. And we can use the GPS data, it's one kind of FDW, I mean external table,
22:04
to load the data into the database. That's it. So we can query from the database directly. We need to load the GPS data into the database,
22:22
and it will be fast. And also we can use some other ways, like the GPS and other tools to load the data into the database. So both ways are efficient for us. And we also defined several data types to use in the trajectory query.
22:42
And what's interesting is chip. Chip is an intermediate data structure for trajectory. It's just stored metadata of our trajectory, and our query will be performed on chip. And the chip is very small, so it's very easy to control between segments
23:02
of all the segments of the good plan. And it will reduce the traffic consuming on the interconnect. And also we have this function on trajectory.
23:21
I want to give you some demo, but I find that I need more time. Maybe next time I'll do some demo. I don't have enough time. For example, the function different from post GIS,
23:41
we have the details. We need to consider the temporal tension rather than the spatial one, so it will reduce the different results of post GIS. Okay, that's all.
24:01
Thank you. Yes, I was wondering, the trajectory part,
24:21
it's only available in the pivotal database, because if I understand it correctly, it's like pivotal is an extension to Postgres. Oh yes, it's a close question. I want to do that. I want to, after we reduce the proportional tension, I will move it to post GIS.
24:42
I will maybe add a separate component to the GIS. Okay, so it would be possible if you have, let's say, a smaller dataset to just use a regular Postgres? Yes. Okay.
25:01
Why don't you use the M dimension you have in Postgres already? In geometry you have the X, Y, Z, but also the M and you can put whatever you want into it. So, just a question, why don't you rely on this feature to manage time,
25:24
for example, in your trajectory? Yeah, totally we don't develop it directly. The first, we need to achieve it in runtime. It's MPP architecture, so it's very different from Postgres Q.
25:42
And especially for, I have mentioned three examples different between MPP and PG. One is SID, it's hard to achieve this, and also the interconnect and also some other things. It's a major reason for this.
26:00
And the second is, now we need to, we treat it as a new bit model, if we develop it in a project, it will be affected by the code style. So we need to,
26:20
we are planning to release it by separately, then we consider it to be in there. Okay. I'll switch to the next speaker. Thanks again. Thank you. Thank you.