We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Getting The Best Performance For GeoJSON Map Visualizations: PostGIS Vs CouchDB Backend

00:00

Formale Metadaten

Titel
Getting The Best Performance For GeoJSON Map Visualizations: PostGIS Vs CouchDB Backend
Serientitel
Anzahl der Teile
95
Autor
Lizenz
CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
ProduktionsortNottingham

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
In order to deliver rich user experience to user, features (attribute data and geometries) have to be sent to the client for mouse-over visual effects, synchronization between charts, tables and maps, and on-the-fly classifications. GeoJSON is one of the most popular encodings for the transfer of features for client-side map visualization. The performance of client visualizations depends on a number of factors: message size, client memory allocation, bandwidth, and the speed of the database back-end amongst the main ones. Large GeoJSON-encoded datasets can substantially slow down loading and stylization times, and also crash the browser when too many geometries are requested. A combination of techniques can be used to reduce the size of the data (polygon generalization, compression, etc). The choice of an open-source DBMS for geo-spatial applications used to be easy: PostGIS is powerful, well-supported, robust and fast RDBMS ? On the other hand, unstructured data, such as (Geo)JSON, may be better served by document-oriented DBMS such as Apache CouchDB. The performance of PostGIS and CouchDB in producing GeoJSON polygons with different combination of factors that are known to affect performance was tested: compression of GeoJSON (zip) to reduce transmission times, different levels of geometry generalization (reducing the number of vertices in transferred geometries), precision reduction (the reduction of numbers of decimal digits encoding coordinates), and the use of a topological JSON encoding of geometries (TopoJSON) to avoid redundancy of edges transferred. We present the results of a benchmark exercise testing the performance of an OpenLayers interface backed by a persistence layer implemented using PostGIS and CouchD. Test data were collected using an automated test application based on Selenium, which allowed to gather repeated observations for every combination of factors and build statistical models of performance. These statistical models help to pick the best combination of techniques and DBMS, and to gauge the relative contribution of every technique to the overall performance.
Persönliche IdentifikationsnummerMAPSchmelze <Betrieb>Physikalisches SystemSpeicherabzugGruppenkeimMetropolitan area networkVorzeichen <Mathematik>SchnittmengeServerClientVektorraumVisuelles SystemSoundverarbeitungDateiformatDatensichtgerätPolygonPartitionsfunktionStatistikGeradeBrowserÜberlagerung <Mathematik>Dienst <Informatik>GraphExogene VariableFaktorenanalyseBandmatrixKanal <Bildverarbeitung>KommunikationsprotokollKomplex <Algebra>Prozess <Informatik>TeilbarkeitOverhead <Kommunikationstechnik>ChiffrierungMittelwertElektronische PublikationAlgorithmusAbstandPunktRippen <Informatik>SchwellwertverfahrenKreisbogenDifferenteAbstimmung <Frequenz>AnalysisMinimalgradTestdatenDickeMeterPolygonFunktionalBitKomplex <Algebra>TopologieShape <Informatik>Zusammenhängender GraphFunktion <Mathematik>ZehnGeradeAlgorithmusGüte der AnpassungKnotenmengePunktSchwellwertverfahrenMittelwertMAPServerExogene VariableSchaltnetzKommunikationsprotokollTeilbarkeitKoordinatenVektor <Datentyp>BimodulBefehl <Informatik>AppletFlächeninhaltEinsPräprozessorGerichteter GraphZweiSichtenkonzeptStatistikBrowserCoxeter-GruppeMathematikDateiformatDigitalisierungTermQuellencodierungProjektive GeometrieSchnittmengeMereologieDiophantische GeometrieGraphComputerspielPhysikalisches SystemCodierungQuick-SortSoftwareTabelleReverse EngineeringMinkowski-MetrikLesen <Datenverarbeitung>EntscheidungstheorieLeistung <Physik>MultiplikationsoperatorWeb logSchlüsselverwaltungAutomatische IndexierungGruppenoperationDigitale PhotographieBenutzerschnittstellenverwaltungssystemSpieltheorieSchnitt <Mathematik>DatenbankLastRuhmasseProzess <Informatik>AggregatzustandVerknüpfungsgliedTopologischer VektorraumKontextbezogenes SystemFaktor <Algebra>DatenkompressionWellenpaketGebäude <Mathematik>Web-SeiteCAN-BusTorusDivisionProgramm/Quellcode
Gewicht <Ausgleichsrechnung>GruppenoperationLastTeilbarkeitPersonal Area NetworkLokales MinimumÜberlastkontrolleStatistischer TestHypergraphGleichmäßige KonvergenzSignifikanztestOISCDemo <Programm>HoaxDifferenteEinfache GenauigkeitCloud ComputingE-MailExogene VariableKontrollstrukturNotepad-ComputerFaktorenanalyseKonditionszahlClientBrowserFlächeninhaltPrädikatenlogik erster StufeWeb SiteDiophantische GeometriePunktMereologieMaß <Mathematik>StreuungsmaßEntscheidungstheorieReelle ZahlDezimalzahlSchnittmengeStichprobeVariableDistributionenraumDatenbankDateiformatQuaderAbfrageGeometriePolygonSummierbarkeitDreiNeunzehnFunktion <Mathematik>SichtenkonzeptServerAutomatische IndexierungPhysikalisches SystemW3C-StandardBitfehlerhäufigkeitPhysikalische TheorieQuantilfunktionZoomMAPLineare RegressionGraphCASE <Informatik>DatenmodellFreewareSoundverarbeitungAnalysisPi <Zahl>TabelleBenutzerfreundlichkeitTopologieGeradeTypentheorieMulti-Tier-ArchitekturKommunikationsprotokollComputerspielAnwendungssoftwareQuellencodierungFluktuation <Statistik>TVD-VerfahrenRechnernetzGeräuschZufallszahlenSampler <Musikinstrument>TopologieSpieltheorieSoundverarbeitungPunktVorhersagbarkeitSichtenkonzeptUnrundheitGeometrieAggregatzustandBenutzerschnittstellenverwaltungssystemGrenzschichtablösungRelativitätstheorieBildschirmfensterMultiplikationsoperatorCASE <Informatik>Endliche ModelltheorieResultanteSchnittmengeAlgorithmische ProgrammierspracheBus <Informatik>DatenkompressionGraphSignifikanztestVerknüpfungsgliedDean-ZahlMereologieLuenberger-BeobachterHypergraphTeilbarkeitBandmatrixRauschenStatistischer TestDiophantische GeometrieFunktionalGruppenoperationFigurierte ZahlEntscheidungstheorieDatenbankVererbungshierarchieDifferenteBitProzess <Informatik>FrequenzWarteschlangeDreiecksfreier GraphPolygonnetzDichte <Physik>Gesetz <Physik>KommunikationsprotokollQuaderGeradeAbfrageOrdnungsreduktionQuantilfunktionSystemprogrammDigitalisierungServerKartesische KoordinatenTypentheorieVariableBrowserMailing-ListeWeißes RauschenProgrammierumgebungCachingSchaltnetzMAPNichtlinearer OperatorPolygonZweiAnalysisZoomPhasenumwandlungVarianzE-MailLokales MinimumImplementierungStrömungsrichtungTopostheorieSchätzfunktionProgramm/Quellcode
PunktGraphSoundverarbeitungQuellencodierungDatenmodellGraphGeometrieDispersion <Welle>VarianzGamecontrollerKommunikationsprotokollGreen-FunktionResultanteBestimmtheitsmaßZweiBitTeilbarkeitLineare RegressionProjektive GeometrieMereologieEndliche ModelltheoriePunktDiophantische GeometrieDatenkompressionSoftwareProgrammierumgebungFormale SpracheAggregatzustandQuick-SortTopologieLesen <Datenverarbeitung>Programm/QuellcodeComputeranimation
MultiplikationsoperatorSkalarproduktZahlzeichenExogene VariableMittelwertPrognoseverfahrenDatenmodellGeometrieDreiEin-AusgabeTotal <Mathematik>MinimumGraphAnalysisSignifikanztestDiophantische GeometriePunktKommunikationsprotokollQuellencodierungDatenbankDateiformatStreuungsmaßFaktorenanalyseSignifikanztestDokumentenserverZweiMinimalgradStrömungsrichtungVorhersagbarkeitVersionsverwaltungPhysikalisches SystemGraphKommunikationsprotokollTeilbarkeitGeometrieQuellencodierungExogene VariableMittelwertMultiplikationsoperatorParametersystemTermAlgorithmische ProgrammierspracheDefinite-Clause-GrammarAutorisierungDruckverlaufTopologieLogistische VerteilungPunktGesetz <Physik>SichtenkonzeptProzess <Informatik>HilfesystemProgramm/Quellcode
Transkript: Englisch(automatisch erzeugt)
Well, this is the end of my presentation. I mean, so I'm going to talk to you about performance of Joe Jason in the browser. And but you know, there are more important topics in life. And so actually, I'm going to talk to you about death, famine, war, taxation.
No, this is a joke. I think I just, I will never get the soul cast award ever, but yeah. OK. So a bit of background. I work at the University of Melbourne, the O-RING project.
And we are building a system. I'm part of the e-reserve group, which means that we do e-science. Basically, it's IT applied to science. And we're trying to build sort of a laboratory and a browser
for urban researchers, which is pretty vague because there's no such thing as an urban researcher. But the idea is that people like epidemiologists, urban planners, traffic analysts, people
share an interest in the same set of data and tools because they are, all of them are working on the same space, like, yeah, urban space, built up areas. So here we are building the software to do exactly this, so to provide them with data collected from various sources across Australia
and tools like R modules and Java modules and whatever. And so they can combine tools and data in their browser. They can upload data. And everything is supposed to work smoothly.
Now, me personally, I had this issue because it was decided at the beginning of the project to use GeoJSON vector graphics on the client. Why? To give the best possible user experience. So you can change code of math on the fly.
You can use brushing. You can have tour tips. You can highlight polygons, stuff. Good. Problem is that you may end up with something like this. So there are 2,200 polygons across Australia
for these particular statistical areas. This is just a subdivision of Australia into homogeneous statistical areas at level 2. As you may see, Australia is a big country. But all the population is here,
which means that you can have polygons like these. Pretty big, but pretty simple. So just a few points. At the same time, you have very small polygons, very detailed polygons, small ones. And of course, you need to find a way
to have all those polygons sent to the browser in an efficient manner. So this is the problem statement. Now, being a statistician by training, I wanted to build a model of this.
So let's start thinking about what the factors affecting performance could be. Rather, the factors affecting performance that I could control. So there were, of course, the size of response in bytes. I suppose it was one of the most important things, was one of the factors.
The server DBMS performance, of course. The protocol used. Actually, we were forced to use HTTPS for some reasons. But I really would have liked to understand what I was losing in terms of performance using
HTTPS as opposed to plain HTTP. Now, this sounds like a statistical model, but this is just for the size. Now, if you take the size itself, it can be thought as a combination of these factors.
Compression, because that's what we tried to compress with, to gzip, compress the output. Decorating execution, because we tried that as well. Just to reduce, instead of adding, we're dealing with geography coordinates. So just reduce the number of digits after the decimal point
and see what happens. The formative response, geojson topojson. The number of features and the number of points, because you can have a few polygons, but very detailed, or very many polygons, but not much detail on them.
So basically, it's just the number of polygons. Oh, by the way, with that, only with polygons. OK? So with lines, points, we didn't really try. But I think polygons are the most complex feature that you can send to a browser.
OK. So a little bit more about the factors. Of course, topojson is supposed to be some FASTA, because it reduces the size of the output.
HDBS is included as a factor, because OK, we'll see it later, but I didn't think that it was a factor affecting. Because after the end checking, usually the connection is kept alive, because HDB11.
So it should be fast enough. That's my understanding. OK. OK, average of both sides is tens to hundreds of kilobytes, just to give you an idea. We tested two data DBMSs, CalcDB and PostGIS, whatever.
Oh, and of course, the number of points. This can be reduced as well. So it's not a given, because we can use generalization just to reduce the complexity of a polygon. How? OK, I think that you should be familiar with that. So this is the venerable Douglas-Pucher algorithm.
Basically, you simplify a line. And of course, a polygon, in a sense, is a set of lines. By just drawing successive, connecting successive points, vertices along the line.
And setting a threshold. If this bit here, from this segment to this point, is less than a threshold, this point gets deleted. So this is a way to simplify a polygon, to drop points without altering too much the shape of the polygon.
Now, that's fine if you have one polygon. If you have, as we had, polygons which are contiguous to each other, then the generalization in one polygon may be different from the polygon, the generalization of the polygon adjacent to it. So you will end up having gaps or over-lappings.
You don't want to have that. So that's why we use the as-is-simplified-preserved topology function of PostGIS. Oh, of course, we know the pre-processing, special pre-processing was done in PostGIS. And I think Topology JSON, I know many of you are familiar with it. But basically, it's GeoJSON using topology,
as the name may suggest. And so you can define a polygon. Because in TopoJSON, every polygon is defined by itself, like an island, which means that when you have two adjacent polygons,
you are actually replicating data, replicating arcs. So if you take another view as a polygon, as a collection of arcs, you can share the same arc between adjacent polygons. Like in this case, you define a polygon as a collection of arcs.
And there is a vector of arcs. So you can have another polygon. And you can reuse the same arc. So maybe the second polygon point to, say, this arc here is reduced size. So generalization.
We chose, for no particular reason, two level of generalization based on the degrees, because we add all geographical coordinates, which translated into 1 kilometer, roughly, and 5 kilometers
at the highest level of generalization. Then we added two more levels, so more detail levels. But I didn't use those data for this experiment. Yeah. Now, the test procedure we used,
we had one intern at our university, which patiently simulated what the user does. So with Selenium, it recorded like 200-pound zoom operations. Then that was duplicated in order
to have roughly 1,000 actions. Then those 1,000 actions, every action being a pound or a zoom or something, was played back using Selenium. And it was played back using different combination of factors, different database, compression yes, compression no.
So that we ended up with about 17,000 different observations. Oh, yeah, and of course, we built OpenLayers2 as more front-end with OpenLayers2 and as more back-end with Node.js, connected to CouchDB and Poskys.
So in order to reduce the number, as much as possible, the number of variability in our observations, we run this test nighttime or during weekends. We had a dedicated VM with Windows VM with Firefox on it.
We reduced the bandwidth to half a megabit just to test our listing environment. And of course, there was no caching on the browser because every time my little application server sent
something to the client, it said that there was no caching, so the headers. So no caching was allowed, neither on the client, neither on the server. Yeah, that's it. So we tried to reduce the noise to a bare minimum.
Still, we had a bit of weird results. So different times for the same operation, weird. So I did a little bit of cleaning. So actually, I noticed that the variability variance
was much higher when there were more than 400 geometries returned, so I did a little bit of cleaning. And I was at 14. Yeah, I dropped 14% of points because otherwise, I wouldn't be able to do proper modeling.
Despite these, results were, yeah. So this is time, the frequency, or time ADC with a peak here in the 0, 3 seconds or something. But there is still a long queue, which I didn't expect, to be honest. Size, yeah, it's kind of, it's not exactly,
OK, that's supposed to be expected because the density of the geometry is different from one part of Australia to the other. Number of geometries, yeah, OK, same as this. And these are measured throughputs, geometries per second, which is what we are interested in,
finally, because we just want to put as many polygons as possible in the shortest amount of time. And this is roughly a Gaussian, which is hurting. Now, first and foremost, we wanted to use Topo JSON,
but it was not supported by the layers too. By the way, now it does, but we did this work in a few months ago. So I will use the model to give you an estimate on how much Topo JSON could help reduce, improve performance, OK, using the statistical model, which
we developed. But no real data on it. Well, I shall rewrite it using either D3 or OpenLayers 3, I think. OK, so yeah, database factor.
First, we ruled out user CouchDB. Actually, we use CouchDB, and we're pretty happy with that, but not for this kind of stuff because the current implementation of geocouch is slower than PostGIS. How much slower? Well, a fair bit, I would say, from 50% to 150%.
This was done just using bounding box queries on the same data loaded in PostGIS and in CouchDB. And yeah, we did a little bit of testing. But so from now on, we focused only on PostGIS
because we found out that geocouch was not yet up to speed. Actually, I tried a few things to make geocouch work faster. So I played with list functions. I tried different views and different type of views,
storing the geometry in a different way. But it didn't work out. So yeah, geojson versus topojson. What I did is just to grab some data in geojson then to convert to topojson using a common line utility.
And I found out that consistently, at least for our data, basically the size reduction was dramatic. So it was reduced to 30% of the original size. So if we added some polygons, which were, say, 100 kilobytes,
then you may expect that to be reduced to 30 kilobytes, which is pretty good. So for the statisticians among you, this is a standardized quantile. And you see that the little experiment
and they are roughly Gaussian. So I'm pretty happy with these results. And so now size model. So this is the way statisticians model the world. So I presume that size was influenced by precision.
Precision means the number of digits after the decimal point. We tried with 4 and 15. 15 is a full precision. 4 is a reduced one. But still, at some zoom levels, it's still good enough because the user
won't notice the difference. Precision, generalization. You remember the introduction of number of points for a polygon. Plus e, e is whatever I haven't considered yet. So it's supposed to be a white noise. OK, so first modeling the size.
And then I will use a size to model performance with two models. Now, a little bit of a, oh, yeah. Then I use size per geometry because that was more useful for promoting performance. So I'm not actually modeling size, but size per geometry.
Because size, of course, it depends on the number of geometries you have. So using the analysis of variance, what you get is this effect. This is the mean effect. And this is an effect which is the same throughout all
the conventional factors. Then, if you have precision of 4, you add phase NA to 422. If you have a generalization of 0, 0, 1, then you add to 278, 422, this one. So you have basically three things to add.
And in order to have the expected, the predicted value for that particular combination of factors, like this. What is the expected size of 100 geometries with a precision of 4, a generalization level of 0, 0, 1? It is 100 geometries multiplied by, as I said
before, roughly 174 kilobytes. What if I want to have a precision of 15? Well, we get the bigger size, of course. Actually, a 45% increase. So it's a predictive model.
Performance model, almost done. Performance model is geometries per second. So it is based on factors now are of course,
size per geometry, and protocol, which is HDB versus HDBS, and the compression, and of course, white noise. Now, the results are interesting,
but this is hurting as well. Because we found out that the, I use a linear model for this, not linearize the variance, actually. But the compression is not relevant. It's far too high, that number over there. The protocol plays a part.
And of course, the size per geometry plays a part. So basically, these are predicted values, the green line, HDBS, and the blue line, HDB, which means that performance in geometries per second throughput decreases when the size per geometry is increased.
OK, that's kind of obvious. And HDBS plays a part in it, which has kind of surprised me, because I didn't expect that. So we are losing something when using HDBS, we as our project. And we are losing this bit of throughput. So now I can quantify that.
I can say, look, we decided to use HDBS, fine, but remember that we are using this much, with this much in terms of performance. And another thing that is hurting a bit is that you may notice there is still a lot of variability, variance. You see here, points are very much dispersed.
So I don't know what happened, to be honest. I tried to get all the factors in there, but there is still something which I haven't considered. Could be a natural latency. Could be, I don't know, declined. I don't know, really. I tried to get rid of all possible external factors,
to have a controlled environment, but still. Because you may notice, the R-square is just 0.17 for this model, which is pretty low. So yeah, performance model. So these are the predictions of this model.
Basically, I'm using the size. First, I compute the size for the geom. Then I use it to compute the throughput, OK, with the parameters. So if I have a generalization of 0, 0, 5, for this precision, I will get, for it, 96 geometries, which is the average response
size in terms of geometry of polygons, which we have from our data. You will get a time of 0, 3 seconds. If you use another conventional factor, like 0, 0, 1, and 15 degrees precision, then you will get an 8 times worse performance, 2.6 seconds.
This gives you an idea how important these factors are. Now, last slide, predicting the impact of topojson. Same model as before.
The only thing that decides for geometry, we know it is reduced by 70%. So I introduce the underlying factor there. And what I get is that it will roughly double the performance. So by using topojson instead of geojson,
I'm expecting our system to be double as fast. This is the prediction. I hope that it will turn out to be true. So here's the slide. OK, generalization has a positive impact,
precision as well. Protocol plays a part, despite my first thought. Compression has no impact, so it's pointless to gzip things. And database, of course, has a positive impact because it's
relevant, because POSGI is way faster than it currently should be, or at least the current version of KerasDB. And topojson is expected to give us a performance boost. And that's it. Questions?
Are the tests available somewhere so I could test on my whole machine? Yeah, sure. Actually, they are on GitHub, but using private repository. I'm going to make them available. Yes, sure.
Test data on R scripts, they are a lot. So yeah, yeah, yeah, sure. Other questions? OK.