We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

GeoCouch: An N-dimensional Index For Apache CouchDB And Couchbase

00:00

Formal Metadata

Title
GeoCouch: An N-dimensional Index For Apache CouchDB And Couchbase
Title of Series
Number of Parts
95
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language
Production PlaceNottingham

Content Metadata

Subject Area
Genre
Abstract
Databases that support spatial queries are often limited to three dimensions, but the requirements increase. You might want to query in more dimensions, for time ranges or other attributes like trajectories. Documents are represented as JSON. The values that will be stored in the index can be extracted from anywhere within such a JSON document. Even conversions like reprojections are possible. Apache CouchDB and Couchbase are document databases, hence belong to the non-relational space which is also known as “NoSQL”. One of the strengths of Apache CouchDB is the (multi-master) replication. You can keep the data from several different instances easily in sync, even if you change the data on different instances. The replication isn't limited to Apache CouchDB, but it's a whole ecosystem. It's even possible to sync with your web browser and store it in its offline storage. This way the user can access the data offline, without the need to be always connected to the server. In contrast Couchbase has its strong point in working at scale. The data gets automatically sharded across machines. Adding and removing servers at a later stage can be performed through a simple web interface. If a server goes down the system can still work without any interruptions. GeoCouch, Apache CouchDB and Couchbase are open source and licensed under the Apache License 2.0.
Scale (map)Server (computing)Operator (mathematics)Multiplication signSmoothingBackupPhysical systemSystem administratorCartesian coordinate systemVirtual machineCASE <Informatik>DatabaseProxy serverState of matterCellular automatonSystem callVideo gameScalabilityQuicksortComputer animation
Point (geometry)Video gameComputer animation
Server (computing)Different (Kate Ryan album)Projective planePoint (geometry)Cartesian coordinate systemSpreadsheetTouch typingSocial classCASE <Informatik>Multiplication signNetwork topology
Scale (map)ScalabilitySpacetimeSubject indexingSatellite
Internet forumPoint (geometry)CASE <Informatik>Computer animation
Type theoryGeometryData storage deviceGeometryQuery languageLine (geometry)Type theoryPoint (geometry)Whiteboard
CuboidPolygonPoint (geometry)Web 2.0AlgorithmCodeCartesian coordinate systemCuboidGeometryPolygonStudent's t-testDatabaseBranch (computer science)MappingWordSet (mathematics)CausalityRight angleWeb pageRevision controlPrice indexOrder (biology)Program flowchart
Subject indexingNumberGeometryWave packetQuery languageDimensional analysisOpen setPoint (geometry)Multiplication signMedical imagingMUDMatching (graph theory)Suite (music)Web pageJSONUML
Erlang distributionAlgorithmScale (map)Library (computing)GeometryString (computer science)UnicodeCalculationNetwork topologyNetwork topologyGeometryAlgorithmGoodness of fitSource codeSubject indexingLibrary (computing)CalculationUniverse (mathematics)Revision controlProjective planeHypercubeInsertion lossMoving averagePerspective (visual)Regulator geneSystem callHybrid computerString (computer science)CodeMUD4 (number)Point (geometry)Coefficient of determinationMetropolitan area networkContent (media)SimulationBuildingQuicksortScaling (geometry)Electronic program guideGroup actionCASE <Informatik>Incidence algebraComputer animation
Transcript: English(auto-generated)
And it's about scaling. Many people say about scaling things up, it's always complicated. And I just want to make an example because, of course, you can also scale up your Postgres database. And you just have several instances, and have a proxy in front, and have your data shouted, and so on. But the really problem about the scaling up
is the operations. What happens if one server goes down? So let's take this example. You have three documents. You distribute it on your cluster. And in this case, they equally just distribute on those three servers. And, of course, you want to have so these are the original documents. You, of course, want to have copies. You want to have replicas on the servers
as well in case something goes wrong. So now one server goes down. And, of course, you can't access C anymore. The document's C because it's down. So you just go to the admin interface, click one button, and say activate the replicas. So now the C replica gets activated.
And now you have access to this document again. You have to imagine it's a live application. You have thousands of users accessing it. And it basically just keeps on running smoothly. It might be just a short downtime when a server goes down. But once you say activate the replicas, it runs smoothly further, and you don't have any downtime. The system just keeps on running.
And then, of course, you say, well, I now have not a backup of C anymore. And if something bad happens, I might lose data. So therefore, you just click another button. It gets reshuffled again. So now we have two servers. You have the copies on the different machines. So again, if one server would go down, you still have access to all the data.
And the operation is just smoothly running all the time. But now, to the point, this talk is about geocouch. I don't want to say what I want to say this slide, so I keep on going. Geocouch works with Apache CouchDB and with Couchbase.
But of course, as CouchDB has different goals than Couchbase has, also I see the goals for Geocouch in different spots. So for Apache CouchDB, I see the point of publishing fast. As I said, you just store your documents as JSON. And I was on a project where they had the data
in Excel spreadsheets. So you just extract it somewhere, get it into JSON. And then you put it in your CouchDB, build your index on it, build an open-source application that would access the JSON. You can even put the open-source application in your CouchDB,
and you're done. That's it. You don't need any GeoServer or any other application server in between. This is all you need to do. This is what I mean with publish fast. And this is where I want to see Geocouch being used, because it's so often that government agencies just won't have a quick way to publish the data they already have.
But setting up the whole stack just takes a lot of time, a lot of knowledge. And yeah, Couchbase, as I said, is about scaling up. So this is really, I want to be a distributed, highly scalability geodatabase.
This is basically what I said in the example in the beginning. If you have satellite imagery, big data, and you want to index this, this is where you would use Couchbase for it. And what features does Geocouch support? It's, again, funny that we had the talk previously from MongoDB, because it's kind of the same.
Geocouch doesn't support many features, but I agree with you to say that it's probably 80% of the use cases. And it's for the same reasons, because it's just easier to scale up if you don't have so many features, and it's just simpler, therefore faster, and so on. So we have our geometry types, so we
can store polygons, points, line strings, everything that GeoJSON supports. And on the query side, you, of course, have bounding box search. And this is already probably 50% of the applications. You just have a web mapping application and want to show something from a database. Then there's polygon search.
This is currently not in the Couchbase version, but in the Apache CouchDB version, it's on a private branch. So what I want to say is it's finished, it's done, and it works, but it might be some effort to get it running if you want to play with it. In the background, it uses Geos, because Geos
does the hard work. It's existing working, and there's no point of re-implementing it myself. So then there is continuous neighbor search. This one also works, but it's kind of a sad story, because it's implemented, but I haven't published it yet, because it was actually a student that worked with me together on getting it done.
But he ported some code from Post.js to make it work on a sphere, but he just ported some algorithms from Post.js. Post.js is GPL, and you might consider it a derived work. So I'm not sure about if I can use it or not, but I just don't want to get into legal trouble.
So if you don't have can-use-neighbor search, contact me, get the code. I'm happy to give it to you, but I just haven't published yet, because I couldn't be bothered to think about legal stuff. It will be a problem, because you're under Apache license, right? Yes. Yeah, you'd have to then re-release your code on your GPL. Yeah, but the point is that he ported it from C to Erlang.
So is this still a derived work, or did he just read the algorithms? And so it's, ugh, yeah. Anyway, so it's, yeah. Yeah. And of course, what the talk really is about is, also sorry, in the beginning, is the multi-dimensional search. This is curr, yes, ah, OK. And this is currently working with Apache CouchDB only.
I will make a release soon, so I wanted to make it for the phosphor g, but in the train, I haven't had enough time. I've got to slept in the train. So there will be a release soon for the new Apache CouchDB 1.4 version, which was released one or two weeks ago,
which will then contain a multi-dimensional search. And I might just put in the geometry search as well. What multi-dimensional search means is really you can build up indexes with any numeric value you like. So let's say one dimension is the geometry, which
is only two dimensions. So this example comes from a trade office. Then they want to say, OK, I want all bakeries, which is another dimension, that open in 2010, which is another dimension, which has a certain size. So we would have a six-dimensional query. And yeah, this is what it supports then.
So from a technology perspective, GeoCouch is mostly Erlang. We're currently porting things to C for performance reasons. The algorithms is for the geeks of you. So the single inserts use the revised R tree, R star tree,
sorry, which is from the same guy who did the R star tree, which is basically the R tree algorithm you normally use. It's used in OrgaSpatial. It's used in Post.js. So this is the way to go. But this is an even better version. And for bulk loading, I use a paper called sort-based query-adaptive loading of R trees.
And I've just put it there so you can click on it and get the paper and read all about the GeoSpatial stuff. It's a really interesting thing. And this is what I currently implement for the C version of GeoCouch. And finally, the future. As I said, it's about scale. It's about performance.
So I think I'm really at a point, so I've worked on the project already for five years. And now I think I have a good understanding of R trees. So finally, the performance. My goal is to be faster than Post.js. It should be quite simple. The reason is not because I'm so smart, because again, Post.js does a lot more than GeoCouch does.
And if you do a lot more, of course you have a lot more overhead. So it shouldn't be too hard to be faster. I wanted to use an LSM R tree. There is already a paper from 10 years ago about it. And there's an upcoming paper. It was promised to be out in October from a university in California that does the LSM R tree.
It's from the source code. It looks promising. I'm keen to read the paper. This is what I really want to implement it in. Then for the geometry stuff, I use Geos. But I'm happy about it, as many of the people, because this LGPL is always an issue. And I really want, hopefully, can gather people
to just create another geometry library that uses some PSD or MAT license. It doesn't have the limitations, because yeah, it just sucks. And one thing is that the multi-dimensional index also supports strings.
So it can also, for example, search for a rest data. This is, as far as I'm concerned, an unsolved problem. I've never seen a hypercube which supports strings. So if anyone heard of a hypercube that supports Unicode, let me know. And of course, as it's called Geocouch, it should do serial calculations.
Thanks for your attention. Thank you. Thank you.