We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Benchmarking R and Python for spatial data processing

00:00

Formale Metadaten

Titel
Benchmarking R and Python for spatial data processing
Serientitel
Anzahl der Teile
17
Autor
Lizenz
CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent
ProduktionsortWageningen

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
In this workshop were compared the most popular packages for raster and vector data processing in R and Python. The differences between them were also checked, and a test was made to know which has the best performance.
Schlagwörter
Bitmap-GraphikDokumentenserverProzess <Informatik>CodeSoftwareGammafunktionSkriptspracheNotebook-ComputerLokales MinimumOperations ResearchPhysikalisches SystemBrowserProgrammierumgebungKonstanteHardwareVersionsverwaltungBenchmarkE-MailMaßerweiterungDatenanalyseWurm <Informatik>GruppenoperationZweiNotebook-ComputerMultiple Regressiont-TestElektronische PublikationVirtuelle MaschineAlgorithmusDokumentenserverYouTubeComputeranimation
SurjektivitätBetrag <Mathematik>Bitmap-GraphikParametersystemInterface <Schaltung>BrowserFunktion <Mathematik>Inverser LimesLokales MinimumIkosaederCodeZoomSpielkonsoleLastLie-GruppeProzess <Informatik>Hill-DifferentialgleichungProgrammierumgebungVideo GenieTermPhysikalisches SystemBenchmarkE-MailDatentypMachsches PrinzipGEDCOMDistributionenraumSpieltheorieBinomialbaumGammafunktionKernel <Informatik>Elektronische PublikationLemma <Logik>Chi-Quadrat-VerteilungData Encryption StandardIndexberechnungMIDI <Musikelektronik>EichtheorieRadikal <Mathematik>Trennschärfe <Statistik>StichprobenumfangDatenfeldPhysikalisches SystemPunktspektrumBildschirmfensterResultanteFunktionalSkriptspracheProzess <Informatik>CodeDifferenteWeb-SeiteAlgorithmusFolge <Mathematik>Delisches ProblemDistributionenraumBenchmarkVariableSoftwaretestMultiplikationsoperatorInformationAutomatische IndexierungPaarvergleichAnalogieschlussHilfesystemRechter WinkelPunktFront-End <Software>Einfache GenauigkeitEINKAUF <Programm>DatenverwaltungVektorraumNotebook-ComputerSchiefe WahrscheinlichkeitsverteilungXMLComputeranimationProgramm/Quellcode
VerschlingungQuellcodeExplosion <Stochastik>SoftwareentwicklerDateiformatBenchmarkCodeROM <Informatik>BetriebsmittelverwaltungPaarvergleichArray <Informatik>Physikalisches SystemElektronisches ForumRegulärer Ausdruck <Textverarbeitung>Surreale ZahlTypentheorieParametersystemPlot <Graphische Darstellung>TeilmengePoisson-KlammerLokales MinimumGammafunktionIndexberechnungGEDCOMFormation <Mathematik>Web-SeiteURLFunktion <Mathematik>Dichte <Physik>BinärdatenRohdatenDatenmodellLemma <Logik>BildschirmfensterSigma-AlgebraInklusion <Mathematik>MIDI <Musikelektronik>ÄhnlichkeitsgeometrieIterationWinkelW3C-StandardSpieltheorieElektronischer FingerabdruckWärmeausdehnungSpezialrechnerArithmetischer AusdruckInnerer PunktVariableTopologieVisualisierungVerhandlungs-InformationssystemData MiningHill-DifferentialgleichungChi-Quadrat-VerteilungBenchmarkWeb SiteWeb-SeiteEntscheidungstheorieNormalverteilungProgrammbibliothekArithmetisches MittelMedianwertIterationRahmenproblemNichtlinearer OperatorFunktionalZweiLokales MinimumParametersystemSystemaufrufFamilie <Mathematik>ResultanteDelisches ProblemMultiplikationsoperatorExistenzaussageXMLProgramm/QuellcodeComputeranimation
VariableIterationGEDCOMW3C-StandardHypermediaFunktion <Mathematik>MedianwertSpieltheorieCodeSoftwaretestInformationSpezialrechnerPlot <Graphische Darstellung>DigitalfilterInhalt <Mathematik>LastWeb SiteGammafunktionBenchmarkEmulationMassestromCOMWurm <Informatik>Arithmetisches MittelGradientIRIS-TFormation <Mathematik>FlächeninhaltHausdorff-DimensionRuhmasseWinkelBitmap-GraphikSystemverwaltungRFIDCodeQuaderHalbleiterspeicherFunktionalGRASS <Programm>Installation <Informatik>Kontextbezogenes SystemMeterKonditionszahlAuflösung <Mathematik>GeradeZahlenbereichBenchmarkInformationProxy ServerIterationGamecontrollerDefaultDifferenteSchnittmengePlotterKonfiguration <Informatik>SpeicherbereinigungBildgebendes VerfahrenLoopSpeicherverwaltungFront-End <Software>MereologieZeiger <Informatik>ParametersystemProzess <Informatik>DatenstrukturBildschirmmaskeURLWeb logZweiElektronische PublikationKlasse <Mathematik>ElementargeometrieBridge <Kommunikationstechnik>PixelGanze FunktionDimensionsanalysep-BlockReservierungssystem <Warteschlangentheorie>DatenmodellProfil <Aerodynamik>Programm/Quellcode
OvalE-MailVariableSpezialrechnerBitmap-GraphikTrägheitsmomentPlot <Graphische Darstellung>W3C-StandardGammafunktionSchwebungGanze ZahlRFIDWurm <Informatik>GEDCOMLastSpieltheorieDualitätstheorieTopologieDifferentialMaß <Mathematik>RuhmasseGeradeInnerer PunktMarketinginformationssystemDateiformatBEEPIkosaederGradientAutomatische HandlungsplanungLokales MinimumEingebettetes SystemKommunikationsdesignEinsEndliche ModelltheorieBenchmarkNichtlinearer OperatorServerInstallation <Informatik>MaßerweiterungQuaderBrowserCASE <Informatik>SatellitensystemNotebook-ComputerIterationFunktionalSondierungGrundsätze ordnungsmäßiger DatenverarbeitungProzess <Informatik>SoftwareentwicklerGanze ZahlGamecontrollerKlasse <Mathematik>HalbleiterspeicherLastSichtenkonzeptZeiger <Informatik>DatenstrukturVererbungshierarchieDelisches ProblemDämpfungDatentypÄquivalenzklasseSoftwaretestRechenschieberUmsetzung <Informatik>DifferenteFunktion <Mathematik>Ein-AusgabeProgramm/Quellcode
DatenmodellURLPi <Zahl>BinärdatenRohdatenBenchmarkBitmap-GraphikGruppenoperationTermPunktLastHalbleiterspeicherKonfiguration <Informatik>ComputeranimationFlussdiagramm
VariableBitmap-GraphikFormale GrammatikSpieltheorieWinkelFreewareE-MailEingebettetes SystemROM <Informatik>MittelwertW3C-StandardOvalArithmetisches MittelGammafunktionFunktion <Mathematik>QuarkmodellIterationMAPAtomarität <Informatik>Keilförmige AnordnungPolygonGEDCOMTaskSpezialrechnerInklusion <Mathematik>Stochastische AbhängigkeitProgrammierumgebungMIDI <Musikelektronik>KommunikationsdesignDateiformatTaskGarbentheorieAuflösung <Mathematik>ResultanteRandomisierungDienst <Informatik>FunktionalVektorraumMereologieInstallation <Informatik>PunktMaßerweiterungDistributionenraumHalbleiterspeicherReservierungssystem <Warteschlangentheorie>Elektronische PublikationObjekt <Kategorie>Leistung <Physik>ZahlenbereichStochastikMultiplikationsoperatorZweiSoftwareSatellitensystemProgramm/QuellcodeComputeranimation
PunktVariableElementargeometrieCodeGammafunktionRFIDIterationBitW3C-StandardRuhmasseMAPPolygonHausdorff-DimensionCliquenweiteGEDCOME-MailPufferspeicherROM <Informatik>SpezialrechnerIRIS-TPuffer <Netzplantechnik>AbstandMereologieBenchmarkVektorraumNichtlinearer OperatorRahmenproblemDatenstrukturDreiecksfreier GraphBitmap-GraphikLesen <Datenverarbeitung>Umsetzung <Informatik>ZahlenbereichHalbleiterspeicherFunktionalTermPunktStichprobenumfangUnrundheitResultantePolygonProgramm/QuellcodeComputeranimation
Mini-DiscIkosaederBitmap-GraphikLastPunktCase-ModdingSkriptspracheProzess <Informatik>SoftwareHardwareDokumentenserverVersionsverwaltungDatenanalyseLokales MinimumPhysikalisches SystemOperations ResearchNotebook-ComputerBenchmarkMaßerweiterungInternetworkingDatenflussDatensichtgerätFunktion <Mathematik>ZufallszahlenProgrammbibliothekNabel <Mathematik>Zellularer AutomatKernel <Informatik>RechenwerkHilfesystemAuswahlaxiomElektronische PublikationStatistikStandardabweichungRundungNormierter RaumOffene MengeSpezielle unitäre GruppeInhalt <Mathematik>QuellcodeMereologieDean-ZahlVerhandlungs-InformationssystemAttributierte GrammatikDämon <Informatik>BildverstehenMeta-TagEinfach zusammenhängender RaumParametersystemShape <Informatik>MatrizenrechnungDokumentenserverBenchmarkVektorraumRadikal <Mathematik>SpielkonsoleFunktionalParametersystemMultiplikationsoperatorVorzeichen <Mathematik>BildschirmfensterProgrammbibliothekPhysikalisches SystemKlasse <Mathematik>RandomisierungRechter WinkelZahlenbereichProgrammierumgebungBitmap-GraphikZellularer AutomatLokales MinimumStatistikMereologieQuellcodeLastAbgeschlossene MengeFamilie <Mathematik>Einfach zusammenhängender RaumÄußere Algebra eines ModulsComputeranimationXML
ParametersystemTransformation <Mathematik>Bitmap-GraphikMetadatenAttributierte GrammatikLastMatrizenrechnungOffene MengeKernel <Informatik>CodeMeta-TagElektronisches ForumReservierungssystem <Warteschlangentheorie>Affiner RaumHochdruckSystemaufrufMeterSpieltheorieNotebook-ComputerZellularer AutomatTaskPlot <Graphische Darstellung>Funktion <Mathematik>Physikalisches SystemDatenstrukturRechenbuchDatentypWärmeübergangROM <Informatik>BenchmarkRuhmasseInnerer PunktEinfügungsdämpfungDean-ZahlExt-FunktorQuaderElementargeometrieProgrammbibliothekBimodulIndexberechnungElektronische PublikationVerdeckungsrechnungEinfach zusammenhängender RaumEndliche ModelltheorieProgrammbibliothekMultiplikationNichtlinearer OperatorExogene VariableLoopElektronische PublikationBitLastProzess <Informatik>QuaderGanze ZahlSchnittmengeParametersystemAffiner RaumPixelDimensionsanalyseElementargeometrieE-MailMaßerweiterungFunktionalTransformation <Mathematik>MinimalgradBenchmarkDelisches ProblemInstallation <Informatik>CASE <Informatik>Automatische IndexierungLesen <Datenverarbeitung>Computeranimation
Zellularer AutomatElektronische PublikationKernel <Informatik>RuhmassePixelBitmap-GraphikAuflösung <Mathematik>Transformation <Mathematik>ParametersystemBenchmarkMAPResamplingOffene MengeMinimalgradLastPunktMeta-TagMetadatenMaßstabZählenMittelwertShape <Informatik>CliquenweiteNotebook-ComputerAffiner RaumAutomatische IndexierungSchwellwertverfahrenGarbentheorieTaskZufallszahlenImplementierungProgrammierumgebungFunktion <Mathematik>NichtunterscheidbarkeitQuellcodeMereologieSpannweite <Stochastik>Gleichmäßige KonvergenzVektorraumDatensichtgerätInformationElementargeometrieReservierungssystem <Warteschlangentheorie>FlächeninhaltKartesisches ProduktKartesische KoordinatenMeterGebundener ZustandVisualisierungÄhnlichkeitsgeometrieWort <Informatik>Physikalisches SystemFormation <Mathematik>EllipsoidPrimidealPufferspeicherGammafunktionSchreiben <Datenverarbeitung>AbstandStichprobeProzess <Informatik>SoftwareRahmenproblemElektronische PublikationPunktVerschlingungEinfach zusammenhängender RaumWort <Informatik>GruppenoperationZahlenbereichCodeKorrelationsfunktionNormalverteilungDifferenteVariableMaßerweiterungTabelleAuflösung <Mathematik>PixelOffene MengeSchnittmengeDokumentenserverBenchmarkReservierungssystem <Warteschlangentheorie>ImplementierungKoordinatenKlasse <Mathematik>ResultanteParametersystemAffiner RaumTransformation <Mathematik>Funktion <Mathematik>VektorraumBitmap-GraphikStichprobenumfangNichtlinearer OperatorComputeranimation
Reservierungssystem <Warteschlangentheorie>PunktLastAbstandGoogolTropfenVariableMathematikInklusion <Mathematik>ElementargeometrieQuellcodeTabelleElektronische PublikationPolygonVektorpotenzialTreiber <Programm>DateiformatCompilerTypentheorieInformationURLKernel <Informatik>ServerVerzeichnisdienstNotebook-ComputerPufferspeicherAuflösung <Mathematik>Funktion <Mathematik>Plot <Graphische Darstellung>CliquenweiteParametersystemZellularer AutomatGruppoidBenchmarkWeb SiteSystem FSchreiben <Datenverarbeitung>Dämon <Informatik>VektorraumStandardabweichungStichprobeBimodulBitmap-GraphikMailing-ListeZweiMultiplikationsoperatorElementargeometrieVererbungshierarchieFunktionalSchnelltasteLambda-KalkülObjekt <Kategorie>Puffer <Netzplantechnik>ZahlenbereichCodeStichprobenumfangBenchmarkOverhead <Kommunikationstechnik>AbstandTabellePlotterSchreiben <Datenverarbeitung>LoopSchnittmengeComputeranimation
PunktMailing-ListeStichprobeBitmap-GraphikFunktion <Mathematik>Schreiben <Datenverarbeitung>BimodulOffene MengeKernel <Informatik>Elektronische PublikationZellularer AutomatCodeTypentheorieQuellcodeTabelleDateiformatInformationElementargeometriePERM <Computer>ROM <Informatik>ZeitrichtungSpezielle unitäre GruppeSpieltheorieElementargeometriePunktBitmap-GraphikTabelleVisualisierungBenchmarkProjektive EbeneProgrammbibliothekVektorraumCodeRechenbuchSkalarproduktMultiplikationsoperatorTwitter <Softwareplattform>MultigraphStichprobenumfangMereologiePlotterDifferenteBildschirmmaskeUnrundheitComputeranimation
Normierter RaumSkriptspracheBenchmarkRefactoringBitmap-GraphikProzess <Informatik>CodeGruppenoperationSoftwareStapeldateiGoogolRegistrierung <Bildverarbeitung>QuellcodeCOMEinfache GenauigkeitDemoszene <Programmierung>Auflösung <Mathematik>SoftwaretestFormation <Mathematik>HardwareBefehlsprozessorIntelPixelSpezialrechnerFront-End <Software>TaskElektronische PublikationPunktMedianwertCliquenweiteTeilbarkeitExplosion <Stochastik>FlächeninhaltFunktion <Mathematik>Keller <Informatik>MusterspracheOperations ResearchAdvanced Encryption StandardLastPlot <Graphische Darstellung>Overlay-NetzJensen-MaßQuadratzahlShape <Informatik>TheoremBenchmarkElektronische PublikationPixelNotebook-ComputerVerschlingungPlotterHeimcomputerStapeldateiSkriptspracheSoftwaretestInhalt <Mathematik>ResultanteMultiplikationsoperatorSatellitensystemZahlenbereichRichtungPaarvergleichBitmap-GraphikMailing-ListeSpielkonsoleStatistikComputeranimation
DatenbankSoftwareProzess <Informatik>HardwarePhysikalisches SystemBitmap-GraphikOperations ResearchBenchmarkDokumentenserverVersionsverwaltungNotebook-ComputerDatenanalyseEndliche ModelltheorieWald <Graphentheorie>ImplementierungHomepageSpielkonsoleVerzeichnisdienstHierarchische StrukturInterface <Schaltung>Lineare RegressionGoogolPrognoseverfahrenIRIS-TMetropolitan area networkCliquenweiteInformationPunktDatenverarbeitungssystemRahmenproblemProgrammbibliothekZufallszahlenFunktion <Mathematik>DatenstrukturTopologieAnalog-Digital-UmsetzerGrößenordnungLaufzeitfehlerElementargeometrieInterpretiererService providerOverhead <Kommunikationstechnik>StichprobeProgrammschleifeElement <Gruppentheorie>NebenbedingungParallele SchnittstelleThreadArray <Informatik>Open SourceProjektive EbeneVersionsverwaltungProgrammierungNichtlinearer OperatorCodeAuswahlaxiomFokalpunktVorhersagbarkeitPhysikalischer EffektVerschlingungDatenverwaltungFunktion <Mathematik>ZweiSoftwareentwicklerMultiplikationsoperatorGüte der AnpassungVektorraumBitmap-GraphikProgrammierumgebungStapeldateiProzess <Informatik>Formale SpracheFramework <Informatik>Quantisierung <Physik>GRASS <Programm>Virtuelle MaschineSchnittmengeDefaultResultanteTwitter <Softwareplattform>Ganze ZahlElektronische PublikationSoftwareRandomisierungSystemaufrufTropfenGeradeWellenpaketMapping <Computergraphik>Wald <Graphentheorie>Rechter WinkelVererbungshierarchieAlgorithmusMinkowski-MetrikComputeranimation
TaskFormale GrammatikImplementierungProgrammbibliothekNichtlinearer OperatorSoftwareentwicklerVisualisierungProzess <Informatik>Syntaktische AnalyseDokumentenserverGrößenordnungWald <Graphentheorie>TopologieDatenstrukturLaufzeitfehlerKugelVersionsverwaltungZeitzoneGruppenoperationCodeSystemprogrammierungSoftwareInformationOpen SourceFunktion <Mathematik>GoogolSoftwaretestBenchmarkDatenbankAusnahmebehandlungRückkopplungZeitrichtungBitmap-GraphikFormation <Mathematik>Front-End <Software>RahmenproblemStichprobeLastSpieltheorieProgrammierumgebungWeb logWeb SiteE-MailObjekt <Kategorie>SchnelltasteKorrelationskoeffizientTabelleVektorraumUltraviolett-PhotoelektronenspektroskopieWeb logSystemaufrufComputeranimation
Transkript: Englisch(automatisch erzeugt)
My name is Krzysztof Ziba, I'm a PhD student at Adamitskiewicz University in Poznań, Poland. My research is mainly focused on using remote sensing data from satellites, Sentinel-2, Landsat-A, now 9-2,
and using machine learning algorithms to do classification and regression models. We can start with this topic of workshop today.
Go to this tutorial on GitHub and you can download a whole folder or you can use only scripts. There are two versions, one for Python, second for R, but at the first we can start with R, the second will be Python.
In this repository you have also a README file, so please read this. I think you should use this notebooks only, but HTML files are also available, but the most important are notebooks.
R, RMD, and Python, you put a notebook showing this one.
You post the recommended notes so people don't have to directly make it.
I think some users on YouTube will be interested too.
So this is our studio with our terminal. Can you please increase the font size?
Okay, so we can start. The topic is spatial processing benchmark in R.
And you should install benchmark, microbenchmark, sf, star, starup, edges. You didn't do this yet, so install.
Important, no one important thing. This is a practical workshop, so I will show how you can make benchmark, show performance of functions.
And this is not a professional benchmark, because first we are using small dataset. If you want to get to know two times, you should use bigger, larger datasets.
And secondly, we are, as I can see, we are using laptops. On the laptops, you can have problem with creating a skew frequency, aka no throttling. And this is not good equipment to benchmark.
I believe we should use normal PC, not laptops. So, do you know what is benchmarking benchmarks.
So, this is simply in definition this process of management or performance function performance called programs. So you can basically check how fast. What function or package is the most efficient.
And this is all. And you have this packages installed. Okay. So, in our, you can do benchmarks in many ways, because building function system that time.
And this is basically what you can use more advanced approach, like using centralized dedicated purchase.
This is bench. This package is from our studio team. And second is micro benchmark. And we check this base function, and this bench and micro benchmark, because it's.
So you know, the name of this function. To get help in our, we can do this in several ways, benchmarking, you can use a function. And this, this window you can see information for this function.
I recommend to read this, but really test this soon. And you can use a single question mark question sign. The same window.
Or you can use the double question mark. And this will be searching for topics like distributions.
And you get many pages were related to distribution topics. Okay. So now we will test this function system time, and we need some function, which generates data for us.
The most easy way is to just do something. So it's. One more thing. If you don't like notebooks, you can use just a simple R script and just rewrite code from this to R script.
So going to write a sequence from one to 100. And we can do a sample with a basement.
And we get some to some values from this vector from one to 100 and get seven and et cetera.
And now we will test out how causes this function. So as I said before, we can, we can use this function system time. Fast. And we get now results. As you can see, we have three variables, user system, and apps, and the
We should use this time because this because this is some of the user system time. So we don't need to be variables, but this last one.
Is it the same. Yes, they have, they have analogies analogies function, but I would say, I read it. I will talk about the when we go to the bike on skis, because for comparison to be fair, you have to really compare exactly the same.
Yeah, exactly. The same thing you can be about this. You have, for example, function Congress something, but in different packages. This can be can be implemented in different ways, using different algorithms.
So, if the function is the same in the back end. There can be different algorithms. So, this is important points to you know how to select field variables from the spectrum. You can use index.
Like this. I assign result of this system by function to team variable. You can use like this. Right name.
Everything is okay. For now. Okay, so this is a race function, which you can use. As you can see, this is doing only one reputation so if you want to make manifest for example.
If you have 1000, you have to replicate and we've had another function but I don't recommend this because we have indicated that because just for this so I think is the right way to use this bunch of packages package or micro bench. If you wanna do this test many times, you can use replicate, replicate system time. Okay,
we can check the bench because there you go to the information of the install features usage.
I will show you some different notes for second one benchmark.
This one doesn't have a website on package down, but only on cron. It's okay. Well, you say, I recommend using the pages like bench or metro benchmark. This your decision.
But we can do more complex example. Let's create data frame with three columns, x, y, z.
And we're gonna write some data with normal distribution. And we can try to smash package.
Firstly, there is one awesome picture, because you don't have to call to load all the cash like
library. You don't have to do it. You can, you can do it this way. You can call bench double colon, and the function for benchmarking is Mark, so you see this pretty, pretty nice.
And let's see documentation for this different arguments here but the most important iteration.
This one duration. Check. Check. If you said check to to this function you check.
If all those are from different function are exactly the same. So this was the downside. To compare it with something's simple function to assume roles in data frame.
You can do this in two ways using function in seeing sounds or using famous function from family.
And do you think the performance will be exactly the same or maybe different to different function, doing that, something. So you think there will be identical or not times. Oh, let's see.
Results here. And what do you think, minimum time, medium time iteration per seconds.
Here's one is raw sounds second is happy. Like, are they identical.
See duration per second. No, this is faster. The first one. 350 seconds, and this is slower.
At least slower. And this was the best. Second is micro benchmark. The name of the iteration there is times.
Check. We are said. Not to pause. And these two packages that don't have anything to do with each other or they extend each other.
I think they are very similar. No, no, no, no. I just showed the way you can use. We are doing exactly the same operation.
But you can notice that you have different columns. There are minimum time. Here's a quarter mean median upper quarter.
But in French.
Different columns. Or you can check in this market. Okay, so I'm running garbage collection collection. Number of seconds, number of iteration number of college college.
All these numbers. Yeah, I think so. And iteration per seconds. But someone can be interested in memory footprints memory allocation. I think this is pretty interesting, though, but as you know,
data is using as the back end status class and you can follow the memory footprints, because this is external pointer. This is data are not enough. This is
complicated. So we have some statistics. And I think there's a better way to not show on the numbers but make some thoughts. We use digital or maybe best are based on.
If not this, I can, I can recommend this package is a fantastic manner. It is to create thoughts and different clients, and we can just create a simple multiple form but based on box about base graphics.
But the code is small, advanced option.
And go soon is faster than, like, how to understand why is it faster. Yes, exactly. This function, it all sounds is implemented in C, and obviously using some are loop and this is why this is slower.
So I talked earlier about benchmarking, but there's not only the way to imagine and performance or your code, you can do something like called code profiling.
And this is not like function, you can go by a whole entire block of code so you have. Let's say, five function and go to block. And you can check which function is slow as, where is the most overhead.
And this is called code profiling. And of course you can do this in our. There is a dedicated function, our problem. From this package. And I wrote you, and I say this for.
If the package uses external c++ code or structures data structures, you can provide this, and this to foremost, geos and the bridges.
There are some interesting books. Okay, so go to the benchmarks to use each other. So, TM.
They are information about this resolution data type. And these laws, because just as a saving session in so we have two engines for
geometric cooperation in our, in a certain one is planar geos second is spherical as to. And so he wants to compare exactly the same things on the implemented spherical engine so we
have to turn this off, and we use geos by default so you need to use this line. So if you assume false. Now let's download data from this URL. There is condition I recommend you, I suggest you to run this, not this whole
blog but this line. This, and this is probably no missing files I skip this.
And now, the main part really start from benchmarking the roster data structures. Installs you can load data using this function with stars to give them to the, this roster.
And so this argument process suppose, as I said that you can processing data from this installs or from memory, so we want to load every, every, every roster into memory, not from this because this is slower.
Okay, do follow. Now we can see structure looks like but let's see.
Of course we can make a plot of image.
In context and more precise.
6000 90 meters of resolution 6000 6000 pixels. And this data is integer 16 roster you roster you can load data using cross function.
Plus dimension resolution, and special excellent CRS so name of the source. And we can make those using the same function.
And some important notes. Please read this. This documentation or grass function.
So, there is difference between packages between stars and Tara stars when you log into data using the function restarts you can set argument proxy control force, but there are this works in different way.
You have to buy default roster is not loaded into memory, it's only processed for five fires so you can process, very big fires roster that can fit into our memory.
And so we have to load this resting tomorrow in this function in the morning.
We see, we didn't know this last thing tomorrow. And if you want to
know this roster and tomorrow on the stars. You have to use this function surveys. Now check, and in the model is roster in the model. The gods is true.
So, we can now do the benchmark, comparing how fast we can load roster data using stars and Tara.
Please use five iteration because we don't want to know processing and check pause because we have given structures of rosters, and the model falls because we don't want to follow a memory footprint.
So we have some results.
But remember, one more. I will say that we have very small data set, if you want to do professional pool super benchmarks, you have to use bigger larger data sets. This example from this we can see that there is faster example. And I started talking about data types.
So, this issue is pretty complicated because this roster is integer 16 but let's check this in stars.
We loaded data into stars, but we get double float data type, not integer 16.
And this is why because our operation done by stars, and there are two bars, I will say, soon. Also, our processing doubled because this is easier for user and developer to get the most complex type, not using integer by this, I
think this is most advantage in Python because in Python you have many data types, you have, you have by integer 16, integer 32, load 16, load 32 and many more, but in R we have only integer and double.
This roster was, I don't remember the size. This roster is 70 megabytes, but check its size in R stars.
Do you see, originally this was very 70 megabytes, 16, but the view of this is 270 megabytes.
Yes, you can really do install second, you can change the data type this time. We're changing double to integer.
And check how it looks in Terra. This is very interesting. What do you think, why we have one kilobyte in Terra, but it starts over 200 or 100.
What do you think? Data is not really loaded into R in Terra, right? No.
I'll talk about it. Let's see this function in memory. I will use in memory roster from Terra. And it's true. This roster is in memory in Terra. Okay, I will explain.
Terra loaded data, but external C++ structure not into R, but data is in memory, but it's only in R this pointer to C++.
Data is in memory, but not exactly in R. But if you check system, control R, delete, and memory usage by system, you will see increase of memory usage.
Okay, so next operation we test is cropping roster. Extents, they are exactly the same. Remember, you have two different classes. So slide different outputs.
This is the input bodyboards extends postural extent, and we brought this to the smaller extent. And we had, we have to define. This is a simple function as the box and extend Terra.
Make some plots, which is possible.
So, I created each on Terra and I had a conversation with Robert and the piece about this but didn't fix it yet. But, but.
So this is using begins and basic graphics. I think you don't read all the answers, but on the begins, there is doing something.
If you want to grow browser to simple function as the cropping style scrubbing Tara. Generally, you, you have equivalents from cells, Tara, but the installs you must add a speed as a preference.
You can see the new extent. Okay, it's the benchmark. And this in this operation in this case, the styles is faster, faster.
This is one. Yeah, because we have a small data support.
I can show you. In this case I use satellites sent from lots of a, and the server laptop and.
Oh, okay. This slightly different but you probably use the roster before us, and you can see how slow it was. This old package now. Everyone is using Tara. Oh, yeah.
Yeah, this is fast because they are using in place. They don't create new object but replace in various in memory. So this is the fastest option.
Did you use the roster before. So you should solve and move on to the terror, because it's much, much faster and more efficient, efficient.
Okay. And the last task in the, in this section is now something that used to comment is how we can check resolution in installs is
pretty complicated but I think I hope it's a really improved this infrastructure, you know, into to get a resolution just like press, and you are the same. Something, something, something, something, something is a little bit complicated, because you have to definitely
define the fee, specify new, new object new, new roster with a new new extent. And this can be done is this way.
The x, the y is the new revolution, you have to feel this new roster with new extent. With new new resolution and use this function as the w a rp.
Yeah, we can see now we have smaller resolution. Instead of this look looks exactly the same. You have to specify resolution to CRS extent.
This is an answer with this nation, this is something roster. And we see resolution is this. So if you just do that. Makes sense. And then I have it on. No, we're using styles and they're using both.
Yeah, exactly. But styles in this example is not not is doing stretching thing because, as you can see, styles is create creating new files so you have not only overhead from something does something but you can count also overhead from creating new file.
Opposite Tara is only processing data in memory. So that's why Tara is fast in this, this, this operation. But benchmark.
Yeah, you can see stars metal, the rich use the true, and Tara is doing this by that. People are function is function which you object because we don't need. If you have a gigabytes of RAM, you recommend you use this object we created before because they are.
They are using all memory. Okay. And you see the Tara is three times faster than stars because stars is creating the product by with all the books. Yeah, this is not necessarily as fast.
And this is also different ways to do resampling aggregating but let's skip this. And there is exercise but we don't have time, you can do this after the workshop and arrive to me and, or we can meet and talk about this.
Here I need some ideas so you can do this. Let's skip and now second part about vector data. You can rest of this, not network if you don't have memory.
Remember, and we do this. And this part we use not read data, before we use data from satellite. In this power we can use synthetic data. So they are going to write generated by us.
This is how we can generate data. And this number of points. generating data from stochastic distribution is stochastic to so we want to make the results.
So we need to set the seat of randomness. This way.
Okay. Yeah, and we generated. Simple feature that a friend with points. In projected service.
And to save this data to fire we can use this function right so let's do it.
Yeah, let's do the same operation, which we do for us data. So they already. This is the star it was read stars to read back to the data is read, sir. In data we had, we have rasp or raster reading for vector reading is back.
Yeah, and benchmark. That is twice faster for reading a vector data.
Next operation very popular in GIS. So, creating a bus. Simple. Why would there be faster reading. The same reason because vector data is keeping external signals plus lecture and you then you don't have making the conversation for an object in.
As well as our data frame, or release data frame table, and Tara is working different way, because just external structure in
c++, you don't need to read the data to our to function as the buffer from a side, but for you, Tara.
Yeah, you can see the names are pretty the same. We have to define distance or with. This is the same of the first number of segments to create the cycle. Let's set to five.
Let's go on our fast check.
Yeah, as we have points. Now we have both of us polygons. Polygon, polygon, polygon, Tara polygons.
This is visualization. We use on the five segments to create a quarter of the cycle so you see this note. I go to the roundness.
Between the benchmark results.
Please remember the timings, because if we compare with geopoders geopoders is much faster. I don't know why, but geopoders is in this example is much faster. Over twice. Last operation is calculating this distance between points. Just, we have, we have to select the
small sample because it is very expensive operation and we don't want to stay in memory. So let's select on the 1000.
There is functioning. This is the same as the distance in terms of distance but we can wish to call the smart weeks to get my weeks from the status. Yeah, we have graphics, and x, and of distance.
This is the last benchmark in our offense. This is all. There is one more exercise but say before.
Let's do it after this workshop. And we can go to the Python part. And the questions for our parts.
Okay.
Yeah, exactly. And just go to the OGH to repository and this line.
You can see more, more complex more advanced the benchmark for vector and the rest of the rest of the data and I use in this benchmark, more gorgeous.
So let's go to the Python. I use me Honda so I have to go to this terminal console and I keep it there.
Okay. So let's start doing some benchmarks in Python using Python packages. We will use raster and high on this most popular package package for processing data in Python, or vector data right on users to use.
Japan does. Tom, did you ask about alternative function of system time in Python. So, they have timer time it's library,
and the class for making benchmarks to find some function is timer class can do this in this way.
Library time it in this class, using the question mark question sign you can get help or using help function. Exactly the same. This is a window to discuss timer.
Yeah, just check this something performance in Python. This from a library random function shows choice.
Can just specify in this way. Now there are various using K arguments. Yes, in all we have, we have system time function in Python business timer dot repeat.
In a sex, you should write the function to benchmark and specify environments and global number of how
many times we build this function will be executed, and to repeat this is common benchmark test you do. So, function execute execute executions, so one and benchmark times, but like
we did in calculate some specific like medium medium average minimum maximum. We have to import.
Like rather the statistics look better, and you can use minimum start activation. In all this dysfunction is building to base are part of you have to import statistics library. And you put a notebook.
Very interesting feature, because you can. You don't have to use this in this way. You can use the person side, and just right time it, and this will be binding one function, or you can time entire cell.
But this case, right to two percent size of repeats by number of function execute execution one. And the function. And so, this is on the benchmark one one one function is benchmark.
All exercise but see this data source is exactly the same you don't know that this is in our, you can use as
well. But if you didn't even there is a problem for you can do it in bite on. Yeah, but you are always exactly the same. Okay and benchmarks.
We use my brother. I will say this most popular bite on but they are using real x ri. Using Mumbai, but your x ri is using users.
x ri. Let's only check this one. And data loading using cluster IO is more complicated in using our packages because you have to open connection to file, load metadata, load raster values, and close.
Finally close this connection to the fire. So, as you can see this more complicated or complicated. And of course this is only.
This is on the end, you can use this. You will know the roster with one, one month with multiple multiplayer bands, but if you have response safe and support files, you have to write a loop by yourselves. You know this is automatic you know you can just give object with last five, and this will be automatically loaded.
But I think this is complicated. A bit. Load this library.
This connection get me the data. Reads various to matrix to non non non by and close the connection to fire. Okay. And run. Yeah. And there you go. You have got the integer 16.
I call in our, it was double terror and sales installs we, you can, we can convert it, convert this for future, but still it was pretty high this size of this fire.
We can access the values, this way using indexing. You follow me you can change the indexes and get another expense like in this case
we can see where it was a mask. So, we don't have values, there are no.
If we change indexes to some central this image, you can see now various 29, etc.
More over we can check the size of this file email or. Yeah, it's 68 megabytes. In, you know, it was over 200 megabytes of integer over 100 in Python this is exactly 68. Okay.
Yeah, and, like, you know, create a plot, or people are probably in Australia, we need this model, and this function.
And one important thing you need to specify parameters of affine transformation, because if you don't do it, you get this coordinates in pixel dimensions so one to two from one to 6000.
Now if you use parameters of affine transformation. We have this coordinates in degrees from 10 to 15. Yeah, that's the benchmark using this double person sign.
Next operation is dropping like we did before.
We need to import this model mask mask model. And another library with to create some shapes, you can do this. And that's the oil, you need different. And the next next library with the processing geometries This is shortly.
And we define x then. Okay, both examples are cute. Oh, I'm not.
So, okay, yeah, thank you. Thank you.
They are the values of extents of box. Take care of the older.
And another thing is they can be as box. They must be a list. And then this function mask, you need to set crop arguments to two.
We got the roster, but this in pixel coordinate system, not geography from zero to a benchmark, okay.
Probably I should prepare a table where we can put this original benchmarks, but we are using graphics. So I don't think they are, they are probably correct down. So I recommend you to check this,
this repository with better benchmark with more data plus the benchmark, okay. Oh, next correlation of something like, oh, this is working in different way in R
because in R we had data loaded into the motor in raster IO. As you can see, you are reading data, but you don't read this in full resolution
in let's say 10 pixel, but this is the data is loaded after resounding. So this will be faster than in R. But you have to specify a resolution
number of pixel in high grid. Open raster pool, read metadata, do resounding and one more thing in R,
if you do some special operation, everything is updated. I mean, extent resolution is automatically updated in Python. You have in raster IO, you have to do this by yourself.
You have recalculate parameters of resolution in this way. And of course, close connection to file. Yeah, and we can check the resolution of output.
Yeah, it's small, 500 X, 500. Oh, I did, I talking about this, this a few parameters of affine transformation. You can see resolution, we updated this
and now we have 0.01 resolution, pixel resolution. Yeah, we did it by hand. This is not automatic operation.
Yeah, benchmark. Yeah, one more task for you, but we don't have time. And vector data, yeah.
Geopandas is, I think this is only my opinion, but it's very similar to us if you're using R as F. I think the transition to Geopandas will be, for you, very easy. Yeah, in R, we create a simple data set.
In Python, we can do this, but if you recheck some design function, for example, generating samples for some distribution, normal distribution and see in R, for example,
seed number one and Python seed number one, you'll get different results because generator of numbers are different implemented.
And if you use the same seed variables in R and Python, you get different results, but there is one, this link under the word ways, you can check this, how you can solve this problem.
So this is why we will use data generated in R, but there is a simple code. Now you can do this in Python, but let's skip this. And as you can see, we use data generated from R
to compare exactly the same data sets. Yeah, let's read this file. We have common with geometrics, point coordinates.
As you remember, in R for raster and terra packages, we have classes start vector, start raster, start sf in Geopandas and probably is class two,
but you have to print CRS separately, extend with separately in R, it was automatically done. Yeah, yeah, and benchmark pretty slow, yeah.
So it's probably looks and another data set.
Do you see the difference? I'm logging geobackage, terra, sf, geobandas. Geobandas is pretty slow, but there is solution for this. 30 seconds, every time you need to load it, 30 seconds.
Yeah, this is true, but there is solution. There is another package for faster vectorized logging data in geobandas by, oh, oh, yeah, this is,
you use this loading and writing data will be much faster. They state, you'll see, if you use this librarian
by o-g-r-i-o, this will be much, much, much faster.
So don't worry, picking buffers. Yeah, simple methods for object vector, buffer, number of objects, resolutions,
the same number of segments in geobandas. And our packages do a plot and benchmark, yeah, okay.
And this is very interesting thing. Please look at this.
I don't know, but sf and terra are twice slower comparing to geobandas and pure geos information, bindings from r to geos. So I don't know what is the reason, but we have to talk with Edzer and Robert.
Yeah, as you can see, geobandas is twice faster. Personally, for terra, I would expect that this will be some speed as geos, geobandas, but it isn't, I don't know why.
And last one, yeah, like in terra, function distance,
but this work quite different because you need to specify one object and second. And if you do this in r, you will get distance matrix, but not in geobandas. You will get everything, everywhere zero,
because this is calculating distances like in table between first object in one, first object in B and et cetera, yeah. So it's not like matrix, but two lists. And we have to do this in different way.
In r, do you know airplane function? Yeah, we use this on the start of r part, but exactly the same. Of course, we can write loop in Python, but we used r before, so we can do this similar
using this apply function. Of course, use the smaller data sets on the sample. We need to write a whole function,
but using lambda expression in r, this is a very good tool. We can run this. We have matrix with distances between points,
benchmark is function time at.
Okay, above four seconds, over four seconds.
And let's see this benchmark. Yeah, in this case, there are faster tool. I don't know. There is some overhead from Python and I think geos is,
it's slower because there are, I think uses internal super spot code as I'm using this function from r and you can see geopandas is a little slower. Yeah, geos to this r implementation, r bnics for geos.
Yeah, it was the last benchmark exercise, but I think this is pretty interesting exercise
because you are combined using raster data and vector, how you can sample raster values using geometry layer, point layer,
using raster IO and geobandas, how we can do it. Here, there is the example code. You should benchmark this as homework.
So last 10 minutes, do you have any questions? Should I return to some code? Okay. Actually, there was somebody on Twitter yesterday who asked about your benchmark visualization
to produce graphs for whole vector and after benchmarks. And here it is in each project and each library are dots. Can we try to repeat the same calculation very tightly and instead of using dots,
try to show a lot of the violence in the dot. So maybe one part is very fast, but sometimes- Yeah, I know what are you talking about. Usually the first time is the slowest, next time are similar and the difference between them
are very low. So yeah, this is very good idea to create the growing plots with more points or even more benchmarks, but there is one very, some benchmark form from a data table because you know probably where they are
and they are using only two runs, first one and second round. They are create visualization bar plots and you can see this longer bar is first round,
shorter is second round, I think. Yeah, first time is second round, yeah. The first one and the second one was, can you maybe explain to all of our, your benchmark set? Because here, so how we can benchmark parts of code,
but you already are doing that for larger data. So can you explain this step up very quickly? Okay. So I test more packages,
there is a list of all packages tested. You have tips how to reproduce this benchmark data set. I use a satellite scan, advanced resolution, number of pixels.
There is the idea of this scan and two links to download this, to reproduce this results. And the most important is PC configuration. You can see this here and a lot more.
And it's calls, yeah. Every script is available. So I think everyone can repeat this test. Do you have more question about?
How you gather all of the results? Do you call all of the packages from command line or do you call the directive from R and Python and gather the results and compare them? How you gather all of the content to this plan as well? So generally I prepared batch script.
Every script is run separately from R and Python, not using RStudio. You can tell notebooks, but just plain console or Python. And next I have one prepared R notebook with calculating the statistics,
aggregation, some more statistics. This is that file. When I run benchmark, that's say cropping from raster. I'm serving a result of benchmark to object. And finally, I said from every benchmark guy,
say it says file file with results. Next, as I show, I read this all files, see is file. And in this screen, it's our notebook.
This lines, yeah, I'm making plots just, yeah. And times are aggregated using medium. I'm not sure I explained this well,
but you can ask, yeah, if you have more. There's some more question.
What would you still like to add for benchmarking? What would you still like to compare? Oh, I have many ideas. I think in my previous work, I use only one data set with integer data type.
I think I could use bigger data set, save as double, check not only processing files for memory, but also from disk. This is done by default in R, not in Python, I think.
Next thing is I put my results on Twitter and some people give me some tips, otherwise to compare not only Python and R, but also include Julia, C++, GIS software like, R GIS, Quantum GIS, Grass saga.
So I think there is many great ideas, but someone have to do it. All of this GIS, that makes it very interesting to see in this case, this GIS.
We can talk about Julia, because we have some- C++ and things like that, actually C++, I mean, R is maybe C++, right? So, you don't need the rest, I mean. Yeah. And what about, like, for example, how interested in prediction, this kind of prediction, like the random forest model, and you want to see the prediction,
but that's actually cost the most. I think the most- Overland prediction cost the most, the first huge drop. I think the most consumed time is from machine learning algorithm, or it's not from assigning values to map,
because everything is a vector, yeah? Okay. So I think you should choose good, fast machine learning algorithm. But there's a space that the way that C++ that the compiled same line with the file for R and Python.
Like if you- I did some geomorphism mapping, is mapping using random forest, it's the boost, it's here.
And I was surprised because, and do you know Ranger? Yeah. And I thought probably this is the faster to make prediction, but I was wrong. I will show you show, so I remember.
Yeah. I was very surprised from this timings. I made this, yeah.
I compared this random forest right in the R, and this super fast package, Ranger, right in C++, and just see this prediction times, yeah.
It's much faster. No, it's slower, one minute. Ah, this would mean it, okay. I was thinking one more minute. Oh, okay, yeah. But you can see prediction is not very much optimized.
We should do some profiling, so also some under-direct misses, yeah. So I think you showed me your work, make good choice using some optimized packages, yeah. The Ranger is very fast for training model, but as you can see, it's slow for prediction.
And I think that assigning values to map vector is pretty fast. The most overhead is from making prediction, like this, yeah. There's only 1 million values, so maps are much more bigger, yeah.
So this, I think, important. Do you want to put it at the bottom, I think? Oh, I don't remember. I can send you this link for this issue. Yeah, it's a fun version for maps, it's simple, but yeah.
Yeah, interesting, yeah. I was surprised when I checked this. Send it, please, on the metamorph. Okay.
We are slowly towards the end, so let's try to summarize what it looks like for many standard operation. Terra is really cutting edge. Yeah, I think if you will see using raster,
you should definitely go to the Terra. I think this is, if you're using Python, you still should use Python. If you use R, you still should use R. Yeah. But you never let me find with the geopathas that you gain a lot from.
Yeah, using, but second one, why do you ask? So I use geopathas using, but it's Python, it's local, but developers created something called pyjos, and this is also vectorized.
So if you have installed this package, you will see big speed up. If you don't have this, and geopathas will be slow. Let's say somebody's an R user and does like 80% of code is in R,
but they have few steps where they know that it's like two, four times faster in Python. So a good thing would be to then incorporate Python into R, that process, just to run that and come and then get the output. I'm not sure.
Are you talking about raster data or vector data? Good, good one. Because- Like imagine if I need to run something one hour, and now I know if I do geopathas is done, you know, 12 minutes. Yeah. Then I want to change my code to save that, you know, 40 minutes, I don't know.
I think if you are, you are do some huge projects, you shouldn't use Python, but focus on C++. That's what you think, all right, all right. Let's say, that's an easy way. But let's say, you know,
I see that many of these packages are really cutting edge, like using C++ in the back, right? So we can, if we feel comfortable programming in R, you could just- Yeah, stick to R or Python. And then just call the batch or, you know, Python in the step where you get,
where you wait for one hour, you know, it's fast. Then we'll incorporate it probably, cause you just need to get, you know, one step in the whole process and it takes one hour. I think it could be problematic in private sector because the project where I participate,
we stick to the one environment you couldn't meet if project managers use Python, our project will have to be in Python, yeah, so. But if you, you know, you make sure it's all tidy, so it shouldn't be a problem. Yeah. Do you, do you use Python in R or?
Yeah, yeah, I'm using both. Use both, okay. And what's your experience? Personally, yeah, I prefer R, but Python is much, much popular, so yeah. If Timmy was using before, so nobody will be switching all projects there are,
so you have just to learn Python and code. Okay, well, that's for me. Or just use the batch, the programming batch, and then call R, Python, Julia, whatever. Yeah. I can say about one interesting thing
because we have in R and Python, not only frameworks like a serve, geo-funders, but we have low language bindings, like this is a package from Dewey Dunnington.
Yeah, from him. Yeah, he's very famous. He creates a school and geos, yeah.
So you can use this, like bindings to C++ from R.
And you remember, serve R's back end, you see, it's using a table or data frame, but yeah, we can use data table, yeah.
This is where I can see, not this one, this one. Right, and in C, and this is very fast, and at this back end, you don't have to use data frame, but use this data table,
and I have some interesting blog posts. If you would like to get some performance, yeah,
you can use data table or call ups. There is one interesting post. How to integrate geos, data table, serve, and this is pretty fast.
Of course, only for vector data, right? So I think I'm finished.