GeoCouch: Operating multidimensional data at scale with Couchbase
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 183 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/32104 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer | ||
Production Year | 2015 | |
Production Place | Seoul, South Korea |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSS4G Seoul 2015159 / 183
7
8
47
53
54
65
73
74
79
82
84
92
102
103
105
124
126
127
130
141
142
143
156
161
162
170
176
178
181
183
00:00
Exterior algebraAxiom of choiceForm (programming)System administratorPhase transitionVector space modelString (computer science)Server (computing)Mechanism designStaff (military)GeometryCore dumpAreaMultiplication signProcess (computing)Scaling (geometry)Endliche ModelltheorieLine (geometry)PlanningIterationCuboidDatabasePolygonData storage devicePhysical systemBit rateRepresentational state transferWordSoftware developerData modelOpen sourceCategory of beingBuildingBitSpacetimeDemo (music)System callProduct (business)SequelEnterprise architectureCASE <Informatik>Projective planeFilm editingData compressionTemporal logicQuery languageRadiusGoodness of fitOperations support systemComputer animation
05:49
WindowForestArithmetic meanAreaVirtual machineFlow separationService (economics)Green's functionRadical (chemistry)Server (computing)System administratorUser interfaceInstance (computer science)Computer animation
06:19
Cartesian coordinate systemSystem administratorVideo game consoleMultiplication signPolygonSet (mathematics)AreaPhysical lawDomain nameOpen setNP-hardNumberDemo (music)RoutingServer (computing)
08:00
Self-balancing binary search treeServer (computing)SummierbarkeitOperations support systemWordMultiplication signPasswordQuery language
08:50
Operations support systemServer (computing)Set (mathematics)Computer animation
09:08
Server (computing)System administratorService (economics)Video game consoleVector space modelComputer animation
09:41
DialectContext awarenessSelf-balancing binary search treeComputer animation
10:14
MathematicsAdditionServer (computing)Kernel (computing)Multiplication signComputer animation
10:47
Server (computing)Software maintenance
11:06
Category of beingPhysical systemNumberDimensional analysisCuboidOperations support systemWordOverhead (computing)Type theory2 (number)Telephone number mappingServer (computing)Self-balancing binary search treeTraffic reportingEuler anglesAdditionDifferent (Kate Ryan album)MereologySoftware bugService (economics)IntegerRoboticsRight angleSolid geometryVolume (thermodynamics)Computer animation
15:29
Computer fontCuboidPoint (geometry)Constructor (object-oriented programming)Bit rateMereologyRight angleExpressionGame controllerSummierbarkeit2 (number)EmailIncidence algebraUser interfaceMathematicsField (computer science)Java appletBuildingMaxima and minimaStatement (computer science)Design by contractOrder (biology)Category of beingPolygonFunctional (mathematics)Line (geometry)Parameter (computer programming)String (computer science)Error messageEndliche ModelltheorieSet (mathematics)Type theoryGeometryIntegerPresentation of a groupCurvePhysical law1 (number)outputSubsetInformation securityServer (computing)Dimensional analysisAreaException handlingLattice (order)Food energySign (mathematics)Physical systemDatabaseMultiplication signAttribute grammarCASE <Informatik>Range (statistics)Sheaf (mathematics)Query languageBitWeb 2.0Open setExtension (kinesiology)Process (computing)Connected spaceElectronic mailing listProper mapInheritance (object-oriented programming)Power (physics)
23:17
Computer animation
Transcript: English(auto-generated)
00:04
All right, it's time to get started for my talk. So my talk is about geocouch operating multi-dimensional data at scale with Couchbase. It's an insanely complicated title, but at the end of the talk, you will hopefully understand what it is about.
00:23
So I'm happy to be here, and first, a few words about me. I'm Volker, and I'm a developer. I mostly code in Erlang, JavaScript, Rust, and Python. And yeah, I love open source. This is also why I'm here. And probably the most successful project that I did
00:42
is geocouch, which I'm also talking about. And yeah, I work for Couchbase. Couchbase is a bit confusing because it's the company name as well as the product name and one important thing is it is not the same thing as Apache CouchDB. It's a similar database.
01:02
It has the same data model, but just think of it as when you compare MySQL to Postgres, this would be like CouchDB to Couchbase. CouchDB is well-adapted in the enterprise world, but hardly anyone in the open source world
01:23
is aware of Couchbase. Although it's fully open source, it's licensed under the Apache license, so you can just check out the code, compile it, and run it as much as you want. All right, and then, so what is actually Couchbase?
01:42
So I'm, of course, talking here about the product, and I don't talk about the company. So the product is a database. It's a NoSQL database, and it's a so-called document-oriented one, which means you store your data as JSON documents normally. It can also store binary data,
02:01
because from inherited some mechanics from its past as a persistence for memcache, therefore it can also store just binary data. But in this talk, I concentrate on the JSON stuff, because, well, your geo data will probably be encoded as geo JSON.
02:21
Strength of Couchbases, because many people ask me, well, why should I use Couchbase or geocouch if I just can use post JS? Well, if your system works with post JS, just use it. Don't switch to anything else. You're good.
02:40
The strengths for Couchbases, when it's about scaling up. So your data is so big that you need to have it on several servers, distributed, then you might look into alternatives, and then Couchbase might be your choice. Another thing that is really strong from a strength that I see at Couchbase is the administration in the DevOps.
03:01
There are several different distributed databases these days, but the important thing is how can you administer the database if, for example, something goes wrong? And together with this, there are also RESTful APIs, which means is it also easy to build your own tools around some API endpoints?
03:22
And the good news is that also the internal Couchbase tools we bundle also use the RESTful API. The advantage is that it's really the APIs are not just like somehow put in so that it sounds cool, but it's really we use those APIs so we can be sure they are tested, they work, and yeah, you can use them yourself
03:40
or use the tools we bundle. As we are in the geoconference, so the spatial features, this is what I work on at Couchbase. On the indexing side, you can just store any geo JSON you like and index it. So you can have polygons, geometry collections, line strings, whatever you like.
04:02
And for the query side, it's only bounding boxes, but it's multi-dimensional bounding boxes, and that's the cool thing about it. So I don't support any, currently not any other query things. The plan is, of course, in the future if something like carry-nearest neighbor and radius and so on, currently it's only bounding boxes,
04:21
but this already solves a lot of the problems because you probably would use Couchbase more as a data dump, so you really store your data there, get it out, and then you're processing on top of the data that you got. And the multi-dimensional thing, I will dive in a bit deeper why it is so cool and what it means. So you cannot only store your geometry,
04:42
which might be an area, polygon, but you can also store additional attributes, which might be, for example, a date. So you have spatial temporal data and queries that's still kind of normal these days, but you can even more. For example, you can have something like ages or categories. So for example, if your use case is something like,
05:02
give me all these building in this area, which have, like give me all the buildings in Seoul, and which have a certain height, and were built after 2008, for example. Then you would have a four-dimensional query. And this is what you can do with geocount.
05:25
So as I already said, the general idea is the core database is kept really small, as opposed to postures, for example, where you do a lot of things and processing within the database. And if you want to do some processing, you would do it on top and not directly in the database.
05:41
All right, but now it's time for a demo, and this is the majority of the talk, and that will be exciting because demos always fail. So let's see. Thank you. I've already started a cluster of Couchbase. So what you can see here is four terminal windows, and it basically stands for four server instances. I run them locally on my machine,
06:02
but you can imagine them as four separate servers. Green means they are running fine and everything's good. Because what I will talk about is really like, how to manage the things, and why does the administration is so cool, and yeah, what happens if things go wrong? So the servers are running well, so we check out the web interface for it.
06:21
So that's the administration console of Couchbase. So I go on servers and look, okay. All right. I did a previously run, and so I will just show you how to add a new server. So I have now four servers running, and I want to add another, like, okay, sorry.
06:43
I need to now switch the demo a bit, but that was something that was expected. So my demo application is just showing route work of Seoul. So when I prepared the talk, I was really impressed how good the open data portal from Seoul is. They have huge amounts of data, it's really great.
07:03
It's a bit hard for me because it was all Korean, but I figured it out, so I looked for data set, and this is what I have, it's road work in Seoul. So I request it, it takes some time, but the majority of the time it's just open layers rendering the stuff. So what we can see is,
07:21
so if you can't see at the back that good, the number on the side is the number of features. So we have about 15,000 polygons here. This is my data set. And it is about, I think about when roads were repaired,
07:41
where the actual LED area was, where they were repaired, and when it started, when it stopped, and what the pavement was, and so on. So this is the data set, what it's about. And I have put the index, and I just created the full data set. So now, as I told you, we have a three node cluster,
08:00
as you can see here, but I've started four nodes. So what I do is I just add the fourth server to it. Let me do this, it's a highly secure password.
08:22
So now I added the server to the cluster, but it's not really included yet, because I can also add, for example, five servers at a time, and then I do an operation called rebalance, which means reshuffle the data that is currently on the servers. So that is, again, evenly distributed across the cluster.
08:43
So now you can see we have four servers here. They are all running fine, they are green. And let's query the whole thing again, and you will see it's again the full data set. So it was a seamless operation, nothing changed,
09:01
everything is fine. All right, now we come into the problem. So what happens if a server goes down? It's easy, it's just killing one. So as you can see, it's red, which means one server is down. We look on the admin console, you can see, oh, one server is down.
09:21
So now it happens what you would expect, I request the data, and you won't get back the full data set, because one server is down. That's obviously not what you want. You don't want to lose any data. The good news is, the server was configured, that you always have a replica of your data. So what the administrator needs to do,
09:42
or what your tool might do, is to click on failover, which means failover means that I'm aware this node is broken, please activate the replicas. So if you're going to failover, I need to confirm. And now, if I request again,
10:00
you will see that we have the full data set back again, because we have activated the replicas. The problem is now that if you have a big cluster, the data might not evenly be distributed across the cluster. So what you do is you rebalance again, to make sure it is redistributed again across all servers,
10:21
and you see nothing changed, it's still working fine. So this was about, yeah, things going wrong. What can also happen is that not a server really fails, but you just want to shut down one server. One reason would be you want to do a kernel upgrade,
10:43
for example, then you need to shut down the server, and what you would do is you click on remove, and then the server gets out of the cluster, can you do a rebalance, get the server out, do the maintenance work on the server, and get it back in again.
11:03
A customer of ours, a huge one called Amadeus, it does, I think, 80% of the flight seconds and flight searches worldwide. What he does is he told me that, I think they have about 20 servers, they can really, when they do upgrades, really rebalance one server out,
11:22
and the system is still running, they don't have any additional peaks or something, it runs smoothly further, and if then the upgrade is done and the rebalance in, also during the operation of the rebalance, everything is fine, so they don't have any downtime at all, or any peaks. So, this was about, yeah, for tolerance
11:42
and if things go wrong, what I will show you now is how to query the whole thing, because, well, we've built up the index, and so this index that I built up is four dimensional, so what I just requested is give me all the data, if you don't have at any param videos, it will just query the whole thing.
12:00
So now, so the first dimension, and I will put in wildcards, I call them wildcards, so basically it will still request everything, but to get an idea of what the data is in, so we have the longitude and the latitude, then we have the date, and the last one is interesting, because I wasn't really sure what it is, so I put the original word in here,
12:20
I think it's something like pavement or something, so if anyone knows this word in English, just shout it out, but it's something like the type of concrete that was used or something, but it's some category, so in the data set it has numbers one, two, four. All right, so now I query again the full data set,
12:41
just to make sure everything works, I just query again, and now it is exciting, so now for example the longitude, I want only the data from 127,
13:00
but then I wanted everything from here, but everything else, so for those people that are from Seoul, might know what happens, so it's basically right across the city, so now I only request this data, and of course you would then normally, if you request the data you would do a bounding box, so let's fill in the full thing,
13:20
and also put in the latitude, and you will see this will be just like a normal bounding box request on your two dimensional data, so now you can see it's only 6,000 items anymore with your query, and now we also put a date,
13:40
so for example let's say only you want to get the data from the street work from 2008 on, so it's even less, and you could do things of course like you want only the data from 2008 to 2009,
14:04
and you can play with the wildcards again, and say you want for example all the data until 2009, this is how you query, and last but not least is the pavement type, so as I said it's a category, so when you index the data with the,
14:22
we call them spatial indexes, what you need to do is you need to map it to a number somehow, but categories can normally easily map to integers, like an enum if you're a coder, so I've mapped it to an integer, so let's say I only want to see the roadwork with the category two pavement,
14:46
this is now more exciting because you can really see the difference, all those with category three, oh there's only one, all right then let's take category one, which will hopefully be more, yes,
15:02
all right so this was about querying, and you can imagine it could be more dimensions, but I would rather probably go to for F67 dimensions, if you use more you probably want to use just some other system I guess, because then the overhead would probably be too big, but I haven't tested it yet, so I will definitely test it also with 20 dimensions
15:22
just to see what's happening, but I haven't yet, and now I will show you again how to build up the index, so how do you actually create the index, so now it might sound super scary, but it actually isn't, so what you do in Couchbase is you write a JavaScript function,
15:42
and you can use really the full power of JavaScript, and this one gets just applied to every document, so every geo-json you store in the database is the input of the function, and it's applied to every function, the important part is now it's applied whenever you update a document,
16:00
so it means it isn't run on a query time, because this would be super slow, if it would have 5 million items, you would run JavaScript for every 5 million items every time you query it would be super slow, it's just run once, and whenever you update a document also, you can also query again, then your index will be updated,
16:21
and only those changes will run through the function. What we see here is what you always should do is, the beginning, put an if statement, which takes for all the attributes that you further down in the function want to use, because if you have errors in the function, that document won't end up in the index,
16:42
and this way you check, okay, everything's fine, and you don't have any exceptions in your function, so this is the scary part, which looks scary, but it's super easy, it's just a simple if statement for all the properties, and now it's a bit cut off, let me quickly change the font size,
17:02
then, so, the construction is the construction time, and again, what I do with indexing,
17:21
is I index spatially with bounding boxes, you can also index time ranges, so it's not only a point in time, because the construction has a start and an end time, but the construction itself was a type range, so I just store the time range, so this is just an integer for the start and the end,
17:40
then I put another if statement in, if this just says the construction time beginning should be smaller than the end, knowing it's clear that you can't start after you finished, but that's often the problem with open data, you have just errors, and it was also the case in this data set,
18:01
that some days were wrong, and I wanted to filter them out, so this is how I filtered them out, it just won't be indexed, and now I use the paving, and this was with the category, and finally I emit, that's a custom function from Couchbase,
18:22
emit means put it into the index, and it has two parameters, the first one is an array, and the second one which is the dimension, so every item is a dimension, except in this case for the first one, so if you put in a geo JSON document, it will just automatically index it and make two dimensions out of it,
18:40
so the geometry is two dimensions, the third dimension is the construction time, and the fourth dimension is the paving, and then you can also emit any value you want, and there I have emitted the area of the array, of the polygon, so you could for example in the web application, and if you click on a feature, you will see the signs.
19:02
All right, that's for my presentation, thanks for the attention. Are there any questions? So even if this was about Couchbase, you can of course ask me also questions about CouchDB, because well I'm also home in the CouchDB community,
19:23
so yeah, all the Couch ecosystems we can also talk about, or PouchDB, or TouchDB, or whatever you like. So you said you have a good API, when you're looking at managing the cluster,
19:42
like you demonstrated when there's problems, are there good ways to manage failovers and whatnot, through an API in an automated fashion, so obviously no human has to fix the server? Yeah, so for example this,
20:01
so also if you click on the web interface, it also uses the HTTP API basically, so when I clicked failover, I could do the same thing with cURL, and also then of course you can watch the cluster, for example, and say okay, when I find out that the node isn't responding anymore, I want to do a failover, for example.
20:24
Does this answer your question? Yeah, so you might have something as simple as like a cron job or something, monitoring the system? Yeah. Okay, excellent. Do you have any monitoring tools already in place? So we don't have any monitoring tools,
20:41
what we have a built-in feature, it's called auto failover, so for this case exactly when something goes down and the cluster sees oh it's down, you might want to fail it automatically, but that's very limited because it can only failover one server I think, because the problem is,
21:00
normally if something goes wrong and something goes down, you really want to have it under full control of what you're doing next, because it could for example also have just a hiccup, and just the connection is just for example this amount down, so I think it's recheck every 20 seconds or something, and it's just a hiccup, and in those 20 seconds it goes down,
21:21
and then it comes back again, but then you have done the failover, and stuff like this, so you probably just want to have a notification, but there's also a feature that you get an email notification for example, if something goes wrong, and then you can deal with it yourself, all your tools, and it's also very customer specific,
21:40
so how you deal with failures, so therefore you normally, but to keep it short, we don't have any monitoring tools yet, as far as I know. That's really cool. Thank you. I was wondering, what's the extent of geometry representation,
22:02
like point-like polygon, curves maybe, 3D maybe, so what kind of geometries can you store? So the geometry type is every GeoJSON geometry type that exists, like polygons, multi-polygon, like polygons, points, lines, strings, polygons,
22:21
and the multi-versions, and even geometry collections, but what currently is the case, to dig deeper, is currently they are indexed only with their bounding boxes so I currently don't do any polygon intersection, so you might get, so if you're, for example, the good example would be,
22:42
Norway for example I guess, is the bounding box is quite big, and if you then do a query, and the bounding box intersects, you will get it back, although the polygon itself doesn't intersect, but that's something that's on the to-do list of course, because yeah, you want really proper polygon, polygon intersection for your queries and so on, but storing their works,
23:01
but yeah. Okay, thank you. Any further questions? All right then, thanks again for the attention.