Data engineering for Mobility Data Science (with Python and DVC) - TIB AV-Portal

Data engineering for Mobility Data Science (with Python and DVC)

00:00

21

Related Material

OpenGeoHub Foundation

Formal Metadata

Title

Data engineering for Mobility Data Science (with Python and DVC)

Title of Series

OpenGeoHub Summer School 2023

Number of Parts

17

Author

License

CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/63360 (DOI)

Publisher

OpenGeoHub Foundation

Release Date

Language

Producer

Jarrallah, Niamah

Production Place

Wageningen

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

This session introduces MovingPandas and DVC for Mobility Data Science. MovingPandas is a Python library for the analysis and visualization of movement data. It is built on top of GeoPandas and provides functions to analyze, manipulate and plot trajectories. To get a better idea of the type of analytics that MovingPandas supports, visit: https://movingpandas.org/examples DVC is a data version control (and machine learning experiment tracking) library. It follows a similar logic to source code version control systems (such as Git) and is typically used together with Git to keep track of data and experiments while Git keeps track of the source code. In this session, we will use DVC to keep track of our movement data analytics workflow. Participants are expected to come prepared with a working MovingPandas & DVC Python environment. Basic previous experience with (Geo)Pandas and version control systems (i.e. how pull, commit, push works in Git) is expected.

OpenGeoHub Summer School 20239 / 17

1

2:54:22

JuliaGeo: A gentle indroduction

2

3:04:45

Tidy geographic data with sf, dplyr, ggplot2, geos and friends

3

2:40:06

Introduction to working with spatial data in Python

4

3:02:06

Raster and vector data cubes in R

5

3:00:17

Cloud-based analysis of Earth Observation data using openEO Platform, R and Python

6

2:08:34

Machine learning for spatial data

7

2:31:22

Parallelization for big EO data processing

8

2:53:41

Progress in modernizing and replacing infrastructure packages in R-spatial workflows

9

1:20:53

Data engineering for Mobility Data Science (with Python and DVC)

10

1:14:45

xcube as a platform for spatiotemporal data analysis and visualization (Python tutorials)

11

2:47:45

Mapping explanation - Python toolchaing for spatial interpretative machine learning

12

2:16:34

Environmental analysis using satellite image time series in R

13

1:34:53

Processing large OpenStreetMap datasets for geocomputational research

14

2:36:09

Tools and packages to query and process Sentinel-1 and Sentinel-2 data with R and Python

15

1:25:00

Sharing your geospatial knowledge in the open

16

2:59:22

Spatial ML model assessment and interpretation

17

1:24:10

Discussion panel: What can R, Python, and Julia development communities do to combat the climate crisis?

Automatic playback

Speech

Text

Image

00:00

LogicSource codeSimilarity (geometry)Revision controlLibrary (computing)Virtual machineControl flowVideo trackingTrajectoryPlot (narrative)Visualization (computer graphics)Mathematical analysisFunction (mathematics)GeometryIntegrated development environmentNetwork topologyHorizonElectric currentDistribution (mathematics)Uniform resource locatorExpert systemPersonal digital assistantChemical polarityLaptopDisintegrationType theoryMachine codeInformation securityWikiGroup actionWindows RegistryComputer-generated imageryBuildingConfiguration spaceObservational studyWebsiteRepository (publishing)Local ringInterface (computing)Software repositoryData structureInformationZeno of EleaObservational studyInformation engineeringLibrary (computing)Mobile WebHome pagePointer (computer programming)Right angleMaterialization (paranormal)XMLComputer animation

02:14

Library (computing)TrajectoryPlot (narrative)Interactive televisionUsabilityFluid staticsElectric currentType theoryComputer-aided designAxonometric projectionControl flowPoint (geometry)Variable (mathematics)HistogramOutlierSmoothingSoftware repositoryBuildingComputer-generated imageryConfiguration spaceWebsiteGroup actionRepository (publishing)Interface (computing)Observational studyLocal ringVirtual machineIntegrated development environmentAnalog-to-digital converterView (database)First-order logicInformationRevision controlZeno of EleaInclusion mapWeb pageData structureSpacetimeTrajectoryPoint (geometry)CASE <Informatik>Object (grammar)Library (computing)Focus (optics)AlgorithmInteractive televisionUser interfaceBlogPlotterBitConnected spaceVirtual machineVideo trackingSoftware developerMachine learningWeb 2.0Landing pageProjective planeGraph coloringRepository (publishing)Uniform resource locatorRight angleAnalytic setCartesian coordinate systemLaptopMedical imagingContext awarenessComputer animation

05:30

TrajectorySheaf (mathematics)Function (mathematics)Software testingType theoryComputer-aided designSocial classMathematical analysisMachine codeWikiGroup actionInformation securityWeb pageAxonometric projectionPlot (narrative)Installation artElectric currentRevision controlLaptopComputer-generated imageryData conversionGeneric programmingWebsiteConfiguration spaceCellular automatonTransportation theory (mathematics)Temporal logicProcess (computing)Visualization (computer graphics)Software frameworkMachine visionLibrary (computing)Interactive televisionPreprocessorParallel portQuery languageTerm (mathematics)Integrated development environmentExploratory data analysisData analysisOpen setRepresentation (politics)SimulationFrequencyDigital object identifierGame theoryArtificial intelligenceSound effectStrategy gamePerformance appraisalInformationIntegrated development environmentRevision controlSoftware developerLibrary (computing)Right angleMathematical analysis1 (number)Context awarenessAnalytic continuationMaterialization (paranormal)MereologyMultiplication signElectronic mailing listComputer animation

07:39

Line (geometry)BlogRevision controlSource codeVirtual machineControl flowEntire functionSoftware repositoryData storage deviceConfiguration spacePhysical systemVideo trackingOpen sourceData modelWindows RegistryLocal ringState of matterVideoconferencingCodeRepository (publishing)GoogolUser interfaceState of matterFigurate numberCache (computing)ResultantMultiplication signPlug-in (computing)NeuroinformatikClient (computing)Revision controlConfiguration spacePlotterINTEGRALRight angleCore dumpWeb pageMathematical analysisVirtual machineMereologySource codeOpen setSet (mathematics)MathematicsObject (grammar)Branch (computer science)CASE <Informatik>Binary codeMetric systemComputer animation

12:10

Game theoryIntegrated development environmentComputer fontLocal ringVirtual machineRepository (publishing)Observational studyText editorInformationPersonal area networkView (database)Machine codeVisual systemComputer fileComputer networkInstallation artPulse (signal processing)HorizonLaptopMereologyVideo trackingMathematical analysisConfiguration spaceDirectory serviceIntegrated development environmentComputer fileRepository (publishing)Open setDirectory serviceLibrary (computing)Set (mathematics)Default (computer science)CASE <Informatik>Revision controlMoment (mathematics)LaptopResultantPlotterSource codeComputer animation

14:03

View (database)Computer fileOpen setRepository (publishing)RootFlagSampling (statistics)Integrated development environmentFunction (mathematics)Line (geometry)Source code

15:20

View (database)Machine codeComputer fileIntegrated development environmentConfiguration spaceComputer fontRepository (publishing)Directory serviceText editorInformationVideo trackingPersonal area networkFunction (mathematics)Mathematical analysisDirectory serviceComputer fileCodeRepository (publishing)Uniform resource locatorContext awarenessFunktionalanalysisSource codeComputer animation

16:26

InformationView (database)Computer fileComputer fontGEDCOMText editorCore dumpLevel (video gaming)Visual systemMachine codeComputer fileRevision controlSimilarity (geometry)Set (mathematics)Source codeComputer animation

17:30

ModemView (database)Computer fileMachine codeCloud computingLevel (video gaming)Core dumpMathematical analysisInformationComputer fontPersonal area networkCache (computing)Text editorVisual systemDirectory serviceComputer fileFunction (mathematics)InformationPosition operatorProcess (computing)Core dumpLevel (video gaming)Free variables and bound variablesHash functionRepository (publishing)FunktionalanalysisType theoryRevision controlFlagBitConfiguration spaceSource codeComputer animation

20:00

Computer fileView (database)Machine codeCache (computing)InformationFunction (mathematics)Computer fontText editorLaptopCore dumpLevel (video gaming)Mathematical analysisStatistical hypothesis testingVisual systemLevel (video gaming)Core dumpComputer fileDirectory serviceRight anglePerfect groupComputer animationSource code

21:14

Level (video gaming)Core dumpMathematical analysisInformationComputer fontComputer fileElectronic visual displayConfiguration spaceFunction (mathematics)Text editorView (database)Machine codeCache (computing)Asynchronous Transfer ModeGame theorySoftware repositoryFree variables and bound variablesEmailMathematicsVisual systemComputer fileBitPosition operatorEmailMathematicsComputer animationSource code

22:23

LaptopComputer fileView (database)Machine codeCache (computing)Computer fontLevel (video gaming)Mathematical analysisText editorInformationOpen setMaizeoutputMathematicsComputer fileCASE <Informatik>Free variables and bound variablesHash functionSource codeComputer animation

23:27

View (database)Computer fontMathematical analysisText editorInformationCache (computing)EmailMachine codeIntegrated development environmentExtension (kinesiology)Visual systemComputer fileMagneto-optical driveMessage passingEmailNumberComputer fileDisk read-and-write headMathematicsFree variables and bound variablesRevision controlComputer animationSource code

24:28

Cache (computing)Gamma functionMathematical analysisInformationComputer fontText editorComputer fileView (database)Revision controlEmailMachine codeVisual systemGEDCOMHill differential equationDisk read-and-write headHash functionComputer fileCodeAudiovisualisierungSet (mathematics)Free variables and bound variablesEmailRevision controlNumberDisk read-and-write headMultiplication signComputer animationSource code

25:31

View (database)Computer fileEmailDisk read-and-write headRevision controlMathematical analysisComputer fontText editorMachine codeLaceIntelVisual systemInformationLaptopDihedral groupAutomationComputer fileOrder (biology)Hash functionDisk read-and-write headPosition operatorWeb browserLoginLine (geometry)WebsitePoint (geometry)Multiplication signCodeSet (mathematics)Source codeComputer animation

26:37

Cache (computing)Mathematical analysisText editorComputer fontInformationAutomationProcess (computing)Scripting languageLatent heatHash functionLaptopTrajectoryDisk read-and-write headEmailView (database)Computer fileVisual systemMachine codeDefault (computer science)Element (mathematics)Shader <Informatik>Cellular automatonMessage sequence chartPhysical systemVirtual machinePopulation densityFunction (mathematics)Game theoryFrame problemComputer multitaskingInheritance (object-oriented programming)TendonLine (geometry)String (computer science)GeometryMaxima and minima10 (number)Set (mathematics)Utility softwareMetreData analysisMessage passingGraphics processing unitForm (programming)Overlay-NetzScripting languageFile formatGeometryDifferent (Kate Ryan album)LaptopTimestampMedical imagingCASE <Informatik>Point (geometry)PlotterDefault (computer science)Cellular automatonCirclePattern languageState of matterObject (grammar)Population densityTrajectoryLine (geometry)Position operatorAnalytic setCodeLibrary (computing)Information engineeringTouchscreenEvent horizonRevision controlSanitary sewerMultiplication signCombinational logicFunction (mathematics)Frame problemFunktionalanalysisShader <Informatik>Graph coloringTesselationProjective planeLevel (video gaming)Arrow of timeScatteringLengthView (database)Direction (geometry)Physical systemRight angleSingle-precision floating-point formatVisualization (computer graphics)Computer animation

35:11

Function (mathematics)Group actionWindows RegistryComputer-generated imageryBuildingConfiguration spaceWebsiteRepository (publishing)Integrated development environmentVirtual machineLocal ringInterface (computing)Observational studyTrajectoryPoint (geometry)LaptopComputer fileView (database)Computer fontCache (computing)Machine codeCellular automatonMaxima and minimaVisual systemSocial classKnotFunktionalanalysisExecution unitTrajectoryAdditionMetreRevision controlMultiplication signNoise (electronics)Arrow of timeHydraulic jumpIntegrated development environment2 (number)Very-high-bit-rate digital subscriber lineCombinational logicMeasurementRow (database)Computer animation

37:24

Web pageLaptopTrajectoryInteractive televisionPlot (narrative)Home pageRevision controlSocial classFunction (mathematics)Maß <Mathematik>InformationExecution unitDefault (computer science)Table (information)Content (media)Computer reservations systemSource codeThresholding (image processing)Parameter (computer programming)ImplementationData dictionaryAlpha (investment)Binary multiplierRange (statistics)Machine codeRepository (publishing)Mathematical analysisPrice indexData modelDigital filterVelocityLogical constantSmoothingoutputSequenceInstallation artHistogramTrajectoryDivisorWebsiteSet (mathematics)Hydraulic jumpAdditionMultiplication signError messageKalman-FilterPoint (geometry)Execution unitComputer animationDiagram

38:57

Point (geometry)GeometryDisk read-and-write headMaxima and minimaLaptopCache (computing)Computer fontView (database)Computer fileMachine codeVisual systemCellular automatonTrajectoryPopulation densityScatteringGEDCOMFrame problemSet (mathematics)CASE <Informatik>TrajectoryLine (geometry)Maxima and minimaPlotterCoordinate systemParameter (computer programming)Point (geometry)Cluster analysisExecution unitProduct (business)Revision controlBitAreaDiameterPhysical systemProcess (computing)Rapid PrototypingPosition operatorRight angleSocial classLevel (video gaming)Graph coloringComputer configurationRange (statistics)2 (number)Regular graphMetreGame controllerFunktionalanalysisData conversionString (computer science)MappingMultiplication1 (number)Object (grammar)Alpha (investment)TesselationScatteringShader <Informatik>Uniform resource locatorThread (computing)NeuroinformatikAnalytic setAudiovisualisierungContrast (vision)Image resolutionComputer animationDiagram

45:23

Point (geometry)LaptopMachine codeView (database)Computer fileComputer fontCellular automatonVisual systemGeometryDisk read-and-write headGamma functionInformationDevice driverTrajectoryLetterpress printingScripting languageMathematical analysisProgrammable read-only memoryFunction (mathematics)Text editorPersonal area networkProcess (computing)Ring (mathematics)Interior (topology)GeometryDirectory serviceComputer fileFunktionalanalysisFunction (mathematics)Integrated development environmentDifferent (Kate Ryan album)2 (number)PlotterFrame problemBit ratePoint (geometry)ScatteringCartesian coordinate systemMultiplication signMaxima and minimaDiameterCASE <Informatik>Type theoryPrice indexLaptopRegular graphGraph coloringLevel (video gaming)Mathematical analysisDiagonalSet (mathematics)Process (computing)Cellular automatonPhysical systemLine (geometry)Right angleComputer fontCodeScripting languageField (computer science)GradientTrajectoryComputer animationXML

51:48

Letterpress printingFunction (mathematics)Scripting languageInformationPoint (geometry)Computer fontTrajectoryMathematical analysisGeometryText editorLaptopComputer fileView (database)Visual systemMachine codePlot (narrative)IntelBit rateLevel (video gaming)Configuration spacePersonal area networkData typeInformation managementAlgorithmLetterpress printingFrame problemGeometryComputer fileElectronic data processingProcess (computing)Level (video gaming)Function (mathematics)CASE <Informatik>Scripting languageoutputComputer animationSource code

53:41

GeometryComputer fileView (database)Information managementMachine codeVisual systemMathematical analysisComputer fontPersonal area networkParameter (computer programming)Set (mathematics)Computer fileFlagLevel (video gaming)Process (computing)CodeBitMaxima and minimaAdditionRange (statistics)Source codeComputer animation

55:15

Computer fileView (database)WindowLevel (video gaming)Information managementLibrary (computing)Mathematical analysisInformationMachine codeComputer fontText editorIntegrated development environmentExtension (kinesiology)Network operating systemGeometryPoint (geometry)TrajectoryScripting languageLevel (video gaming)MathematicsFunction (mathematics)RepetitionData storage devicePosition operatorSource codeComputer animation

56:17

InformationComputer fontMathematical analysisText editorView (database)Computer fileData analysisoutputScripting languageLevel (video gaming)Function (mathematics)Data storage devicePoint (geometry)Machine codeData storage deviceSet (mathematics)Filesharing-SystemPoint cloudGoogolMultiplicationComputer fileLevel (video gaming)Revision controlMultiplication signRepetitionComputer animationSource code

57:52

Computer fontMathematical analysisText editorInformationComputer fileView (database)Scripting languageoutputData analysisLevel (video gaming)Data storage deviceMachine codeIntegrated development environmentExtension (kinesiology)Directory serviceLetterpress printingStatement (computer science)Visual systemGamma functionColor managementLevel (video gaming)MathematicsComputer fileScripting languageFunction (mathematics)Letterpress printingStatement (computer science)Computer animationSource code

59:13

TrajectoryComputer fileMachine codeView (database)Asynchronous Transfer ModeLevel (video gaming)Design of experimentsInformationMathematical analysisText editorComputer fontStatement (computer science)Letterpress printingScripting languageData storage deviceFunction (mathematics)Level (video gaming)Scripting languageRepetitionOrder (biology)ResultantCache (computing)Frame problemoutputMultiplication signComputer programmingFunction (mathematics)Set (mathematics)Letterpress printingSource codeComputer animation

01:00:44

Computer fontMathematical analysisText editorInformationView (database)Computer fileLevel (video gaming)TrajectoryData storage deviceFunction (mathematics)Visual systemMachine codeCloud computingExtension (kinesiology)Integrated development environmentoutputWorld Wide Web ConsortiumSource codeGamma functionStatistical hypothesis testingGame theoryMathematicsLetterpress printingStatement (computer science)Goodness of fitRevision controlCASE <Informatik>BitoutputLine (geometry)Client (computing)Free variables and bound variablesLevel (video gaming)Computer fileHash functionComputer animationSource code

01:01:59

View (database)Computer fileMachine codeoutputLevel (video gaming)Data storage deviceFunction (mathematics)Computer fontInformationTrajectoryFree variables and bound variablesMathematical analysisSource codeText editorGamma functionDampingMIDIGame theoryDemo (music)Prisoner's dilemmaDynamic random-access memoryWorld Wide Web ConsortiumVisual systemIntegrated development environmentExtension (kinesiology)Disk read-and-write headEmailRevision controlMathematicsLevel (video gaming)Free variables and bound variablesFunction (mathematics)Computer fileComputer animationSource code

01:03:00

Term (mathematics)Computer fileView (database)Visual systemMachine codeDisk read-and-write headInformationLevel (video gaming)Computer fontMathematical analysisText editorEmailStatistical hypothesis testingMultiplication signRow (database)Computer fileScripting languageMathematicsProof theoryCache (computing)ResultantRippingSource codeComputer animation

01:04:37

Computer fontMathematical analysisText editorInformationIRIS-TFree variables and bound variablesMachine codeTrajectorySmoothingConfiguration spaceState of matterVideoconferencingBlogStatisticsDemo (music)Parameter (computer programming)Data modelVideo trackingProcess (computing)FreewareOpen sourcePolygonRandom numberParameter (computer programming)BlogEndliche ModelltheorieConnectivity (graph theory)Level (video gaming)Type theoryVisualization (computer graphics)Computer animationDiagram

01:05:49

Random numberPolygonData modelGame theorySource codeSoftware repositoryTrajectoryNetwork topologyTemporal logicProof theoryProcess (computing)Communications protocolLevel (video gaming)Mathematical analysisScripting languageSimilarity (geometry)SequenceCore dumpPoint (geometry)Data bufferParameter (computer programming)Function (mathematics)Group actionDistanceRoundingPlot (narrative)Computer-aided designMKS system of unitsUniqueness quantificationStatisticsInclusion mapoutputVector spaceEndliche ModelltheorieLevel (video gaming)Maxima and minimaScripting languageConfiguration spaceProcess modelingComputer fileContent (media)Computer animationProgram flowchart

01:06:52

Plot (narrative)Point (geometry)Function (mathematics)DistanceRoundingGroup actionoutputStatisticsOpen setComputer-aided designGoogolScripting languageRandom numberParameter (computer programming)Level (video gaming)Data bufferTerm (mathematics)Software repositorySource codeMathematical analysisCommunications protocolNetwork topologyData modelProcess (computing)Proof theoryDemo (music)SequenceSimilarity (geometry)Process (computing)Computer fileFunction (mathematics)BitScripting languageConfidence intervalSet (mathematics)ResultantComputer animationSource code

01:08:04

InformationComputer fontMathematical analysisGEDCOMoutputLevel (video gaming)Text editorSource codeComputer fileMachine codeView (database)Personal area networkHash functionDifferent (Kate Ryan album)Uniform resource locatorConfiguration spaceBitGame controllerLimit (category theory)Multiplication signNumberCache (computing)SpacetimeSoftware developerMereologyLine (geometry)Computer fileDirectory serviceData storage deviceFunction (mathematics)ResultantComputer configurationData loggerProjective planeRootPoint (geometry)Free variables and bound variablesRepository (publishing)FunktionalanalysisLevel (video gaming)Computer animation

01:13:56

BuildingComputer-generated imageryWebsiteGroup actionConfiguration spaceIntegrated development environmentGoogolObservational studyInterface (computing)Virtual machineLocal ringSmoothingTrajectoryDigital filterOutlierMaxima and minimaSocial classFunction (mathematics)Mathematical analysisSoftware testingWikiMachine codeView (database)MathematicsCategory of beingParsingMachine visionRegulärer Ausdruck <Textverarbeitung>Maß <Mathematik>DivisorQuicksortOpen setLaptopData conversionWindows RegistryMoment of inertiaRepository (publishing)Installation artWeb pageOptical disc driveSoftwareSet (mathematics)DistanceLibrary (computing)Mathematical analysisMultiplication signInternet forumConstraint (mathematics)Row (database)Euklidischer RaumMereologyProjective planeComputer animation

01:16:25

Observational studyMathematical analysisMachine codeLine (geometry)Computer-aided designLibrary (computing)Execution unitTrajectoryPolygonDistanceHuman migrationData modelVelocityLogical constantSmoothingGeometryPoint (geometry)Military operationLaptopVirtual machineMassNetwork topologyMessage sequence chartEmailDesign of experimentsParameter (computer programming)String (computer science)Default (computer science)Plot (narrative)Heegaard splittingType theoryGEDCOMComputer fileInformation securityWikiGroup actionWeb pageGame theoryTrajectoryMathematicsInteractive televisionPlotterIntegrated development environmentLaptopLink (knot theory)Source codeComputer animationDiagram

01:17:28

Video trackingGame theoryVideoconferencingGeometryPolygonMathematical analysisUniqueness quantificationPay televisionDecimalFrame problemElectric currentConsistency9K33 OsaNumberDisk read-and-write headCartesian coordinate systemLink (knot theory)Numeral (linguistics)Control flowHuman migrationData analysisWikiMetreDensity of statesPoint (geometry)MassPhysical systemInformationVirtual machineNetwork topologyMessage sequence chartComa BerenicesRevision controlLaptopGame theoryTrajectoryCodeDifferent (Kate Ryan album)Set (mathematics)Computer fileGraph coloringMathematical analysisComputer animationEngineering drawingDiagram

01:19:09

TrajectoryClique-widthPolygonAbelian categoryBefehlsprozessorTotal S.A.Computer reservations systemGeometryAlpha (investment)Letterpress printingPointer (computer programming)Demo (music)Default (computer science)Programmable read-only memoryPairwise comparisonLevel (video gaming)Image resolutionTrajectoryCase moddingComputer animation

01:20:16

TrajectorySpecial unitary groupComputer animationXML

Transcript: English(auto-generated)

00:00

So, hello, welcome. My name is Anita Grazer. I'm a scientist at the Austrian Institute of Technology in Vienna. And it's my pleasure today to get to introduce you to mobility

00:21

data science with Python and DVC. Particularly, we will be covering some data engineering with moving pandas and DVC. We have one and a half hours, which is not that much, but I hope I can show you some of the neat little tricks and show you why it might be interesting

00:42

to try both of these tools if you ever work with mobility data or just want to do data versioning with DVC in general. To get my curiosity out of the way right in the beginning, can I ask you, who of you has already worked with moving pandas? Okay, a couple of people.

01:03

And DVC? No one at all. Okay. Some of you looked at the tutorials. That's awesome. It's already a really good start. But in the rough basics, let's go over the most important things

01:23

anyway. To follow the tutorial, please go to the moving pandas examples repository. And there in the OpenGL hub branch, we have a dedicated tutorial for this session. In the main branch, you will find in general all the tutorials for moving pandas, always

01:45

for the latest release. They are also included here in this branch. They're in the subfolders one, two, and three. And in subfolder zero, we have today's session materials. If you followed the pointers for the preparation study materials, you already know the home pages

02:04

of DVC, Geopandas, and moving pandas. But let's quickly revisit them anyway. Moving pandas is a library for data exploration and analysis, but specifically for movement data

02:20

and even more to the point for tracking data for trajectories. So that is individual moving objects through time and space. It's not restricted in the use cases to only human movement or only movement ecology. It's meant to be general purpose, as you can see from the couple of

02:42

examples that we already have here on the landing page. So this example visualizes migrating birds flying from northern Europe to Africa. And you can see we, by default, use interactive plots that are nicely linked and make for easy data exploration. That's one of my main motivations

03:06

in developing this library. It's a bit of a connection to the previous session we had here downstairs with understanding machine learning. This is more about understanding the data that goes maybe into machine learning. So a real strong focus of this library for me was to make

03:24

the plotting useful for exploring the data sets. So you can look at this kind of global data. You can also look at very local data. So on my blog you will find a recent post about video-based trajectories. This is from Denmark where they had put up a camera with an object tracking algorithm

03:45

and then they extracted the trajectories of bicyclists at this intersection. And you can use moving pandas to also visualize these kinds of trajectories. And you can put the camera image in the background so you have a nice context for looking at the data to see what's going on. Maybe

04:03

you have some weird artifacts that you're interested in better understanding. So that's also possible. You can also have custom projections so it doesn't always have to be Web Mercator. You can also use, for example, soft color stereo projection if you, for example, want to look at icebergs.

04:24

And of course we have all kinds of analytical tools as well such as the stop detection where you can see an example here on the left hand side which is a plot of the of the stop locations and on the right hand side those have been used to split the original trajectory in individual pieces.

04:43

So then they're all colored by their new ID. And you can even build some nice interactive data applications with user interfaces that run either inside Jupyter notebooks or as a standalone application if you desire to do so. Basically all these examples you can find in the Jupyter

05:09

notebooks in Moving Pandas example repository in the folders one, two, and three. You can look at them afterwards whenever you want to. The main capability repository for Moving Pandas itself

05:28

is just a Moving Pandas Moving Pandas. This is where the development is actually ongoing. You can find here all the many contributors that we have. You can always check the latest version. If you installed the environment you will have probably version 17.0

05:46

which or 17.1 which was recently released. The 0.1 just fixes one installation issue. So it doesn't matter if you have 17.0 installed today. We also have the Jupyter it doesn't matter if you have 17.0 installed today. We also have here

06:06

the examples for the installation. If you follow the instructions and the materials provided I recommend installing Mamba which is a faster solver than Conda.

06:21

So for all these libraries there's a lot of dependencies particularly with the interactive plotting if you add that as well. So by installing Mamba you can reduce the installation time of the environment just significantly. That's why I always recommend it and it definitely reduces the pain a little of handling Python environments. There's also related Python packages if you're

06:49

ever wondering. Also in R there's a ton. So I have a list here that you can have a look at tools for the analysis of movement data. In Python it's around 10 that I'm aware of

07:04

which mostly date back the earliest one to 2018 but there's continuous development right now. There's always new ones coming in. There's a couple in C++ and there's this huge review paper about R libraries which has also like a couple of dozen. So I haven't listed all of them here. I

07:25

would just refer you to this paper if you're into particularly movement ecology. So that's the related workbook. Okay with moving pandas out of the way let's look at the second part that we will be covering today

07:41

and that is DVC. DVC stands for data version control or well here they say not just data version control but the part that we are going to cover is the data version control part. What does that mean? I'm sure you're all familiar with git. You have been using it at least to fork

08:00

some repositories or to download some repositories. The issue with git repositories is that you usually don't want to check in large amounts of data, large binary objects because they just mess up your repository. They bloat it up. If you update them then they get and upload a new version.

08:22

It just gets really big and unwieldy. So people have been thinking of how to handle that. Quite often what if we want to collaborate with someone we have the code in git and the data might just be on a shared drive, on a google drive, some sharepoint or whatever other evil thing. And sooner or later you have results somewhere maybe in a paper that you submitted

08:47

for review. You have code on some github repository or gitlab locally and you have data somewhere on a shared drive and someone continues working on the code because they need to fix a bug. Someone finds a problem with the data sets and changes that and nine months after you submitted

09:05

your paper you get it back with revisions and you have to redo a lot of stuff in your analysis and in your plots. Do you still know which version of your data and which version of your code generated these plots? No you don't. Nobody does. In this case you have to sit down you have to figure it

09:26

out. Maybe you just run it with the latest version again and put the updates in for your revision. If you work with clients it's even worse. If you have to give them an analysis you have to present those results and then they question every single piece and then they ask you to just fix this one

09:42

tiny thing and you have to maybe rerun really long computations. DVC can help us cut down on the time that it takes us to run experiments because it has for example caching features so it remembers things that you have already run that have not changed and therefore can be taken out of

10:05

cache. It helps you to keep track of which version of the source code has been run with which version of the data and has produced which kinds of results so that you can more confidently state that yes we actually know which version of the code produced these figures which version of

10:25

the data that was and we can reproduce it if we have to if we need to go back in time to those results. And we will see today a couple of the core things of how this works. It's a command

10:41

line interface very much like git. The commands are very similar as you will see so I think you will feel very comfortable after a really short time. There are also integrations in like vs code so already on the main page they advertise their plugin for vs code.

11:03

It can be nice particularly to visualize the experiment tracking results which is something that we don't have time today but you can see here on the right hand side an example basically for every revision of the code you could keep tracks of some metrics of your machine learning

11:21

model. It has a nicer visualization than this command line that is built into vs code via this plugin but it also is still a bit flaky so I recommend using the command line to start out with. It gives you really nice instructions for what the next step might be and what you might want to do

11:43

with your configurations to get it set up nicely and it's way more transparent than the user interface that they currently provide via this vs code plugin. Okay, great, so let's get started I would say. If you have checked out the branch moving pandas open ghop 2023

12:12

and if you have set up the so there is this environment file of course we can quickly look into that one. I've pinned here the dvc version greater than three which should be

12:27

coming in by default anyway but just for for safety reasons. We have of course moving pandas and for the interactive plots we are using hvplot. We will see those in a short moment in the

12:45

jupyter notebook that I have prepared but these are the main libraries that make this tutorial happen. Okay, let's go. In the tutorial MD you can follow all the instructions along

13:09

also at your own speed obviously. The first thing in data versioning of course is we need some data set. To start tracking a data set first we should prepare our repository or the directory in which

13:26

we want to have this dvc thing working and for this tutorial I suggest to work in the open ghop session start directory which is an empty directory here in the this subfolder.

13:44

You can also peek into the solution folder which already gives you the results in case you ever get stuck or in case you have not installed the environment right now and just want to want to have a look. So what we need to do at first is to initialize a dvc in this start directory.

14:06

So first let's activate our environment and then I still need to change to moving pandas

14:26

samples and open ghop session start. Great. And here we do a dvc in its subdirectory. We need to add

14:47

the subdirectory flag here because we are not doing this in the root of our repository. If we would be doing that for the whole repository at once we can omit the subdirectory flag.

15:01

But basically what this will do is it initializes the dvc repository. It also shows you exactly this is what I mean. They have really nice output on the comment line which gives you all the instructions for how to get started. We can also have a look at what happens in the

15:23

folder. Remember first there was only the readme file. Now we have a dvc directory and we have a dvcignore file here. This is all that happens in the beginning. Also in VS code of course we can

15:42

look at the dvc directory and dvcignore which is basically empty so far for us. Next thing is we can download a file and here I wanted to point out this really neat function

16:00

that dvc also includes dvc-get. dvc-get you can just point at a url on github in two parts. So first the repository and then the path to the file within that repository and it will download exactly that file from github for you.

16:22

So this is a really neat and can be also used in many other contexts. Let's download this small csv file of both positions. This is the one that we are going to work with.

16:44

Not very big but need a moment. There it is. So now we can see we have a data folder and in the data folder we have a position csv.

17:09

So now already with the original version of the data set we should start tracking. This is the original one. Now we want to make sure that whatever we change from now on will be recorded by dvc. So the next thing similar syntax as in git instead of git add we do dvc add

17:30

with the data file. And even the outputs are similar to git. So when we add our first file

17:44

we get the information that we should now do also something with git. We should update the git ignore and we should add this new data position csv dvc. Let's have a look at what this actually

18:03

is, this strange file. So in the data directory we now have a file called both positions csv dvc and basically this only contains a hash information about the file size, the type of the hash

18:22

and the original path, the name of the file that is now represented by this placeholder. So this placeholder is going to be put into our git repository and it will tell dvc which version of the file to collect when we check out this commit from github. That's how all

18:48

these things will be glued together based on these on these hashes. And the csv file itself that we have to put in gitignore so that it does not end up in git because that's what we want to avoid all

19:01

along right. So of course we can always do it in this two-step manual process that we do dvc add and then update the gitignore manually. But there's also this flag to tell dvc to help us a bit so

19:27

we can say dvc config core auto stage and then it will automatically make sure that whenever we add something to dvc it's automatically also added to the gitignore and the gitignore is automatically added to the next git commit so that we don't always have to do that manually.

19:45

It's just a bit of a convenience function. You still have to push, you still have to git commit and git push manually but you at least don't have to do the git add as well all the time manually. So this is basically what they also recommend doing here right. So to enable auto staging run

20:04

dvc core auto stage 2. Let's do that and then we can add all the free things basically now to git. Let's first look at the status maybe.

20:23

So git status says that we have new files and modified files right. We have the gitignore in dvc, we have the dvc config, we have dvcignore and we have the important new data directory that we also need to add. So let's do that. Perfect.

20:52

And similar to git status we also have dvc status.

21:03

dvc status currently tells us that all the data and the pipelines are up to date. We will soon change that. If we modify one or the other we will see how dvc reacts to those changes as well. But for now we have put our original data file in and we can commit the current situation.

21:26

For example just git commit with the message add dvc. It will commit all these nodes. And we are set. Now we can actually start with some modifications. For example if we look at our

21:42

both positions file the header is really a bit of a mess with capitalized column names mixed with lowercase column names and they're just let's decide they are all way too long for us to continuously keep typing then it's just going to be a bit of a mess. So let's do that.

22:04

And we save our changes in this file and when we now do dvc status

22:25

dvc notices that we have changed our input data file. So whatever now depends on this input data file would have to be redone. So what we can do if we want to keep these changes is to commit them

22:48

in dvc. So this warns us are we sure that we want to change the csv.dvc placeholder file because there's a new hash there's maybe changes to the size of the file so we get this little warning

23:04

but yes in this case we want to save our changes. Now if we do dvc status again everything is fine on this side. On the git side on the other hand we now have of course the changed placeholder file.

23:25

So now we can add this changed placeholder file and we can commit that for example with the message that we updated the header.

23:42

All fine so far or do we have any questions these basic steps? We're just keeping track of a couple of changes. Let's now imagine we change our mind we don't want this change after all it messed up something we want to go back. So in git we can go back with

24:02

check out head tilde one that is going back by one step. You can increase the number if you want to go back more steps. And then we just say we want to get go back only for this one file only for dvc placeholder file.

24:22

So it confirms us that it has updated one path to a previous version. And what you then need to do is to do dvc check out. This now looks at the hash that is in this dvc placeholder file and it gets the correct file associated with this hash. So it also makes sure

24:45

that the data set is reverted and really if we just switch in visual studio code back to the csv file we have the old header again appearing. In dvc status if we want to have a look at that

25:09

everything should be fine because we are just going back in history but it's a known status and everything is according to how it should be. To return back to the latest version you can again

25:25

do git check out this time with head without a number attached to it. So this basically lets you go back to the latest version. Again first you just get the placeholder file and then the second step with dvc check out you get the actual file back in working order. So now we are back to the

25:47

short header here. Of course instead of heads and relative to the head positions you can also use the hashes from the git commits if you want to get go back to a specific commit based on this hash.

26:08

To find those hashes of course you can do that on the website in your browser but you can also look at the git log and particularly the the git log one line is pretty nice. So all of these hashes you could go back to any of those at any time and get both the code and the data sets that were

26:28

registered at this point in time. So this is how we can modify the data but of course we want to use it because it's about data

26:43

engineering and data analytics. So let's set up a data pipeline let's use these both positions finally with some moving pandas. So usually the workflow is to first develop the analytics workflow in a jupyter notebook at least that's what I do. And then when I'm happy with it

27:03

for use with dvc I will migrate the code into a jupyter script in a python script that can be run from the command line because that is basically what we need to put it in these data pipelines for for dvc. But let's look at the notebook example and walk you through what we can

27:25

for example do with the both positions. So this notebook is in the solutions subdirectory and it uses many of the libraries I previously mentioned. Another exciting one particularly for

27:42

visualization is the the data shader library that we've all also been using. If you remember in the previous talk we had a lot of times come up that don't run this or that don't run this or that sometimes come up that don't run this because it will run for a while it has like many points to plot. Data shader is really great at plotting many many points hundreds of thousands of points

28:06

millions of points in your jupyter notebook because it runs on the graphics card it just restorizes everything and it just pushes the rest rest image to your notebook which makes it possible to really plot lots of data points at once. It's a really neat little library in that

28:24

way. So let's let's import all that stuff. Great. Moving pandas show versions will show you the most important dependencies for for moving

28:49

show. I have a question that you want to pose. Please include the outputs from moving pandas show versions because it really helps me with these guys. Some of the problems are that they are really cool but they're both overworked in all

29:07

combinations of the different library versions so that's great. The next cell is just a some people like activate in the screen and I'm taking you to the background house that I want

29:26

to use and I'm also already checking on the IP here but I'm particularly interested in the this is the IP 235 but it's the average event if you remember there is anything

29:45

that that's that's in the sewers come out and that's that one it's from this data set they will find it okay leaving the data is you know the geo of course the only message from here

30:03

about this particular data set is that the paper format is not a default one you have to specify the format here between 30 months here and this form but that is straightforward and for this reference is found that you have seen that often before and now we can already plot

30:22

them and I already told you that paper shader is fast this is paper shader. Two things that you need to know about it paper shader will run on the graphics card but if you want to overlay it on top of a background map in WebMarketer you have to first make sure that your coordinates are also

30:41

in WebMarketer otherwise things will not align so they have this utility function here that turns latitude longitude to meters in WebMarketer so run this one first and make your x y columns and then we can create a map with the background tiles the star will put another layer on top

31:02

of the plot and this other layer is our data frame with the x y columns basically just a scatter plot of x and y with data shade is true and this gives us this nice interactive plot with a density map that is interactive and updating when we zoom in

31:26

of all the tens of thousands of points in in this data frame and it's really cool to zoom into the details of these ship data sets because they have these really nice patterns for examples if the ships are on anchor they make these circle patterns here that we see

31:43

and of course we see the main traffic through the Suez canal here as well with the stops here in the bitter lakes and the other stops obviously on the north end of the whole thing so so far so good nothing suspicious yet um next step is to turn all those individual

32:06

points into trajectories obviously if we want to analyze anything besides point densities so we have the moving pandas trajectory collection object that we just feed the state of rain uh we tell it that we can distinguish the individual moving objects by the id column

32:24

we tell it what the timestamp column is we tell the latitude and longitude column and the coordinate reference system and then we get a trajectory collection i notice here that we get a couple of uh warnings from uh geopandas with this set by copy warning um so let's keep that

32:46

in mind for our script that we ignore those warnings since they are from some geopandas update i think in any case the trajectory collection is created with 250 distinct trajectories in it we can plot them in many different ways for example we can turn them

33:07

into a data frame of lines of that would be with the trajectory collection to trajectory gdf of function so what that does is it creates a geo data frame with with line geometries

33:25

and start time and end time of every individual trajectory length and project and direction and we can of course also plot this with the normal geopandas plotting in this case with hv plot

33:43

which again gives you interactive plots based on uh hv plot color views that means we can nicely zoom in and we can see all the details of what is going on in these trajectories we can use different background tiles um to to look at the data and to to really get a better feel of what is going on

34:06

here if we want to have just an individual trajectory to look at we can use a trajectory collection get trajectory with the id of the trajectory we're interested in

34:24

the nice thing about the most recent versions of moving pandas is that they that i have included a hack that puts a nice little arrow marker at the end of the trajectory so you can actually tell whether they were going north to south or south to north this was remarkably fiddly

34:40

because it's just not a default thing to to put a marker at the end of a line um but that's something for a single trajectory right we're working with data from one idea or so this is just for for one idea right now um but technically we can also apply this on a trajectory collection plot

35:05

um so if you remember in this example you can see that every trajectory is has this arrow marker at the end

35:23

another thing we can do is uh calculate the speeds here a nice addition in recent versions is that you can specify the units that you want the speed to be calculated in so you can have any combination here of meters uh nautical miles per second per week per year

35:44

nautical miles per second per week per year whatever you feel uh tickles your fancy but kilometers per hour is of course nice because i can never remember what knots actually are or how to convert them into anything reasonable so my ships also going kilometers per

36:03

hour um and this will add the the the speed between consecutive uh measurements for every record in the whole trajectory collection for every individual trajectory so if we have like

36:22

uh tc get trajectory 1df we have a new speed column added here with the kilometers per hour just a question in urban environments that well the the gbx data might be very noisy giving very

36:46

high speeds is there anything to filter yeah yeah that's an excellent question yes there are multiple ways of filtering depending on how what kind of noise you expect like sometimes in gps trajectories you have those really long jumps these out layers which might

37:03

end up like hundreds of meters away from where they actually should be we have a cleaner for that based on just the speed value is getting really really unrealistic for for a short time so that is the trajectory cleaner that we have implemented as a class

37:21

so whenever you want to to see those uh all the available functions we of course have a read the docs in addition to all the tutorials so in in the cleaner you have an outlier cleaner which you can apply to trajectories based either on a maximum speed that you say is realistic or

37:45

are based on a factor of the um of one of the higher percentiles if if a speed is high is three times as high as the 95th percentile in the data set then you want it removed this is one of them

38:01

for these jumps and the other one is we have a smoother which is a Kalman filter Kalman filter based and this will just remove this this noisy wiggling of the trajectory so in the on the website

38:22

you will see this example down here where we have a slightly noisy trajectory still after the first cleaning step and if you want really nicely smooth you can apply the Kalman filter on top of it and it will give you this the smoother trajectory of course you have to play with it because it

38:42

might only remove some also remove some reasonable things that are going on but it will definitely help to get rid of the most egregious errors that we usually have in trajectory data sets cool we have any other questions at this point about the speeds about handling units any of that

39:01

stuff don't hesitate to to interrupt me at any point in general yeah um every one of these trajectories just has a a data frame object that you can access and analyze whenever you feel it and analyze whenever you feel it with the regular pandas functionality

39:31

of course we can also use speed to in our plots so we can tell hvplot that we want to

39:40

color the the lines according to the speed value with the c parameter we also have control here over our color maps for example the the viridis color map and we can limit the the colors of the the range of the values for example here between zero and 10 kilometers per hour to get a better

40:05

feel of where are the faster and where are the slower movements of this of our trajectories now let's look at what the ever given was doing so basically again we filter to to one specific

40:26

trajectory to the one of the ever given and what i do here is we filter to one specific trajectory and then on top of all the other trajectories so so first as a background layer i say give me from

40:42

the trajectory collection all the lines and plot those as some white lines with 0.5 alpha so they i want them somewhat transparent and then on top i would like to get only the ever given trajectory and of that one the speed with this color map and in this case without any background tiles please

41:05

otherwise i would just be covering up what is on the layer below right so whenever you tile multiple of those trajectory plots on top of each other on the higher ones you have to deactivate the tiles otherwise you will not see what's below it that's a bit of a caveat but it's nice and a

41:23

reasonably fast way to to visualize all these and we can we can now see the ever given right we can see how it came from the south how it was stopped or at least slowly moving down here like many of the other ships do as well and then it continued its way until it got stuck at this location that's

41:44

where it ends in our data set we don't have the resolution of the drama when it got moving again we have it only on this stack here at this position we can also combine the the data shader and our moving pandas plots so we can also have the the scatter plot with the data shade is true

42:10

as a background image if that is necessary for example because the data set is so huge that the previous plot wouldn't work anymore if we plot the line strings then we still have the option

42:24

to do this one instead you can also see here how to customize the color map for the data shaded heat density map i've just chosen two free color names here light pink hot pink and dark blue because they give a nice contrast to the viridis colors but really you whatever you want you can

42:45

specify here in this color map and set up a plot like this with really quite quite a large amount of data now let's do some analytics let's go a bit away from the visuals for a second

43:05

if we of course the ever given got stuck so it stopped unplanned but many of the other stops the chips do of course are planned and in many other use cases when we look at movement data we are also interested in where and for how long our objects were stopped maybe to infer what they were

43:25

doing there or what kind of objects they even are so that we have this neat little trajectory stop detector class which you provide with the trajectory collection you can also use multiple threads now in the recent version to speed up the computations a bit it's still not lightning fast

43:45

because moving pandas for me is first and foremost a tool for rapid prototyping for data exploration it's not something that i wouldn't necessarily use on a compute cluster for production for the next 10 years without touching it but every tool has its job right so we can create the

44:09

stop detector and then we run the get stop points function which we have configured in a way that the stop should be at least three hours long and within those three hours the ship should stay

44:24

in an area with a diameter of maximum one kilometer so this is always this is in in the units of the coordinate reference system and in case of geographic coordinates it's in meters we take care of the conversion internally great let's do that

44:46

that's now running for our 250 trajectories automatically we will have a column with the duration in seconds so we are also calculating a duration in hours as you can see here we now have

45:06

some trajectories with we might have multiple stops or some trajectories might only have one stop but we always get the the where and the how long and of course the trajectory id so we can

45:21

understand what was going on and we can of course the question then the parameters that you specify for the stuff the temperature okay so what's the duration and the ratings as well or not yes okay so there's a minimum duration and the maximum diameter yes thank you so of course you can

45:43

tune those depending on your applications those values can be widely different so we specify the duration as a time delta which is also kind of important for me because i want to be really open to many different use cases some of them might be interested only in stops that take at least a week others might be interested in things that

46:04

take just 15 seconds i don't want to have users over specify it in seconds and then calculate how many seconds is 15 weeks it's just stupid if we can have a time delta here and of course we can also plot the stops because they're just another geo data frame so again we can use hvplot and we

46:27

can use the duration in hours that we now calculated to to scale the size of the of the point markers so we get bigger markers for for longer stops which is also nice for exploration

46:42

of our data obviously we can also of course create non-spatial plots from the data frames because we can simply cast the geo data frame to a regular data frame and then we can use things like hvplot scatter on any other xy to for example compare the the start time and the

47:07

duration of the stop this is particularly handy because in red you see the two stops of the ever given the first south when everything was still okay that is the earlier one and then we see the later one when suddenly everything else got stuck and basically when you see this full diagonal here

47:24

in this plot this means that all the ships that have arrived at these stops they basically stayed there until the end of our data set they never got moving against everything on this diagonal are our stops that went till the end of our data set but this is a really good indicator that something

47:42

is really off now in this system yeah of course there can always be ships that just anchor somewhere and they don't want to move but these ships here they they wanted to move yeah

48:01

and the final cell i included is just because i i stumbled over this this nice way of also visualizing data in a tabular field and i thought i i share that with you as well so you can also in this tabular way you can get the cell highlighting really easily by just applying the style and the color map to it and i find that also a really neat way of how to explore data in

48:25

a jupyter notebook which i don't see often presented in tutorials so maybe you find that interesting as well

48:50

at which point in the processing step does that appear just this last one okay maybe we'll look at that afterwards because that is neither dvc nor moving pandas really i just wanted to throw it in because i find it interesting but

49:06

maybe no i think that i also have a completely new environment so let's figure that out maybe afterwards one question um so you um customized the background gradient and i was wondering um

49:23

because the the font color also changes according to the background automatic ah okay now you don't have to take care of that yourself they do that automatically which is another sign that someone actually put some forth into it

49:45

so we now have a simple analysis we have set up a stop extraction uh workflow we have we are reading the original csv file we are creating the trajectory collection we are putting it in the stop detector we are calculating the stop points we add the stop duration column to to the geodata

50:06

frame and we we can also save the geodata frame with regular geopanda's functionality right to file in any csv geo package geo chase and whatever is supported by geopanda's

50:22

if we now want to automate this this particular workflow because we're happy with it we think we have to reproduce that in the future we would just create a script out of it that pulls everything neatly together actually it's really just a few lines of code here

50:41

in the end if you know what you're doing it's usually simple right um i've also put here that we want to ignore this set by copy warnings that that we saw in the notebook we don't want those so we just take these couple of lines and we we create a new script in our directory new file

51:04

um i forgot now what i was planning to name it extract stops.py let's put that in here again basically just the main and the run function nothing fancy here

51:28

and as an output here we get a geo JSON

51:43

so this basically now runs from the command line if we just type python extract stops it should run yeah it's reading creating takes a while obviously because just as before the stop detection algorithm is not lightning fast but it it will finish

52:05

and we will get basically a print of the data frame of the stop data frame yes there it is and it created the the geo JSON file as well

52:28

how do we now tell dvc that it should take care of this process that it should be able to to run this data processing step this is done using a dvc stage

52:42

ed so every step in our data process is called a stage we have to give a name to our stage this case let's call it stop extraction we have to give it the a dependency we have to let it know about all the things it is supposed to use so in our case it needs the

53:04

script and it needs the input file so we add two dependencies we add the python script and we add our csv file we tell it that there will be one output it is the stops geo JSON and then in the end we tell it what to run and that is again python extract stops pi so this is the command

53:26

that will be executed and the other things are just so that dvc knows okay what what is coming in what are the dependencies that i'm relying on and what am i creating when we do that we get as output that dvc added a stage called stop extraction in the dvc.yaml file so let's

53:54

have a look at the dvc.yaml file that is new it will contain all our stages eventually we can we

54:02

can add as many as we want the first one is stop extraction we can see the commands here we can see dependencies we can see the outs these stages can take care of additional things such as parameters for example if we wanted to make it a bit more complicated we could pull out things like

54:21

the minimum stop duration that we are using in the stop detection we could put that in a parameters file and we could say that there's a parameter value that dvc should track and then we can just add the and change the value in this parameters file for whatever range of values we want to test

54:42

and dvc will pick up okay the code maybe hasn't changed the data set maybe hasn't changed but this parameter has changed so i need to to rerun this process you can also skip the the command line magic if you don't like to do dvc stage

55:03

adds and then remember all the flags that you need to use you can also edit this yaml file directly of course dvc doesn't doesn't care you can also just just edit here and if you don't like writing these commands that is perfectly fine

55:25

so now we we don't need to from now on we don't have to remember how to run our scripts basically because now dvc knows how to run our script so all we need to do now if we want to reproduce our experiments is run dvc rep from and it now goes through all the stages that we have configured

55:44

and and runs those so let's read what it says it says that both positions csv dvc didn't change so we don't have to do anything here so we can go on and run the next stage and then basically we get the output that we had before when we ran our script directly from the command line

56:03

and it finished successfully and it created an output and because we have an output we also have something that we can can push to storage if we want to do that

56:21

we haven't really talked about storages because what i'm showing you here for this example is just us locally working on our example but you can configure different places of where you want to store your data files so that it's accessible for multiple people to collaborate so the examples that you will find on dvc are the the simplest one are using google drive where multiple people can

56:48

push their data sets to google drive to share them but you can also i've also successfully used it on a file share within the company and you could also have some other clouds like aws but i haven't tested those myself yet for for data storage so for this tutorial we'll

57:12

just stay locally so we are not dvc pushing to to anywhere however if we would now run

57:21

dvc repro a second time because maybe it's the weekend has just passed we don't remember what we did on friday we just want to make sure that we have the latest version of all our experiments running we do it's monday we do dvc rep for again and this time it's telling us that yeah your data has not changed your stage has not changed everything is ready you don't need to do anything

57:44

anymore you're on the latest version nothing to do here so may have maybe saved some time because we didn't have to to rerun this experiment anymore so there are no changes necessary to rerun the stage

58:04

let's see what what git has now for us git has a couple of new files because we of course set up the dvc yaml file we we also have the log file and some some changes in gitignore and we also have the script that we still have to check in

58:26

so let's uh add the script all green okay we can commit we can commit that we have now our first stage set up

58:45

that's great and now of course we continue working on our script we improve it we make changes for on it for example we we will remove this noisy print stop statement that is just filling up

59:01

the the output whenever we whenever we run the reproduction so if we comment this out here um and then do dvc status again dvc now realizes that we have changed something for the stop extraction stage and in particular

59:23

we changed one dependency and that is the extract stops uh script so it knows something is odd we have to to change something so when we now run dvc rep grow it knows okay the data set hasn't changed but the stage has changed so we are rerunning that one

59:43

and now sure what happens when we have the times two stages that have the dvc programs in the all the stages that we defined yes it makes sense

01:00:00

run for all the stages. It will for each single one check if anything has changed, if it needs to be run or whether it can use the results from the cache. But it will run them and it will make sure that it runs them in the correct order. So if you specify the output of one stage as the input for another stage, it will know in which order to run the stages. And is there a way to run only one

01:00:26

stage if we want? Yes, you can specify the name of the stage in the reproduction if you don't want to run everything. So now our output is nicer. We don't have this noisy print of the geodata

01:00:42

frame in the middle, which is exactly what we wanted. So we have updated our script. We can again commit our changes to git. We have removed the print statement.

01:01:03

All good. All good. The other use case of course is something changes in the input data. Either we got a new data set, some revision from the client or we found some issues. We had to clean the data a bit more.

01:01:24

So for example, let's delete a couple of lines here just so that we have a change. And when we do that and we do DVC status, now we see that on the one hand side we have a change

01:01:45

to the stop extraction stage because we have changed the CSV. And we also have a change to this placeholder file because again, the hash of the file has changed and the size of the file has changed.

01:02:04

So in this case, when we run DVC refrom, it is doing a verifying stage first, but otherwise it doesn't have to do much for the placeholder changes. But again, it knows that it has to run the stop extraction and make sure that

01:02:23

all our experiments are now using the latest versions. Its status of course now has a new placeholder file and a new log file because all our outputs

01:02:44

have changed. So we again would commit this. And let's have a look at what is our whole history of our tutorial so far. We edit DVC. We changed something in our original file for the first time.

01:03:08

We edit our first DVC stage. Then we changed something in the script. We changed something again in the data file. And we have a track record of all the things that we did to both our data

01:03:20

and to our code, which makes it more reliable or more easy to find to provide proof of the workings that you have been doing. So of course, like previously said, one of the big advantages is that we can undo our changes. So for example, if I undo the deletion

01:03:46

of the rows here and save the changes, of course, DVC will pick up on it. It again notices that there are changes. But if I do DVC rip row, we get another nice thing. It knows there have been

01:04:07

changes, but it also knows that it has seen exactly that version before in the past. It knows that we just reverted some changes and it can just pull the results from the cache. It doesn't actually

01:04:20

have to run again. So even in this case, it will save you time because you don't have to recompute stuff that you already have in cache and you don't have to handle the cache manually, which I think is really neat. Okay, that's the main quick tutorial that I wanted to show you today.

01:04:49

Obviously, this is not the only way that you can use DVC. As I mentioned on their websites, they have lots of examples also for how to track those experiments, how to add the parameters to

01:05:07

the stages, like I mentioned before. I also have another example on my blog where I combined DVC and QGIS Model Builder. So Model Builder also has these stages, right? Like the individual

01:05:23

components of your model, you can think of those like the stages in a DVC workflow, and you can convert them pretty straightforwardly into DVC. So this DVC visualization which you get if you type DVC deck directed acyclic graph, thank you, it's basically a similar visualization as you

01:05:49

have in the Model Builder, right? Of course, this is only fancy if you have multiple stages, so it doesn't really make sense to show with our tiny example with just one stage.

01:06:03

But basically, what you have to do here is export the individual stage tools that are used in the processing model as individual scripts. So in this case, I have

01:06:22

one that creates random points, I have one that buffers the points, just these two basically, and then I put them together in a DVC.yaml file with minimum configuration.

01:06:47

Okay, this is too small to show you, I think you have to watch that yourself. It's HD content. But yeah, basically there's a lot of ways, everything that you can put into

01:07:02

a Python script that can be run from the command line that just creates file outputs somewhere, can be put into DVC so that you can track the whole process and really gain a bit more confidence in being able to reproduce a certain stage, a certain set of results

01:07:24

with the original data sets that you were meant to be using and maybe update them accordingly. Do we have any more questions regarding DVC, moving pandas in general? Yes, please. So this may be interesting to other people,

01:07:45

but I downloaded the data and it had the wrong name and I could not find an easy way to rename or move files. How does it work in DVC? You downloaded the data set with DVC get, is that

01:08:01

what you mean? Yes, and then because it was a text slash and I used first slash, it put it in a file called data for the locations. Now I want to move it while being dragged by DVC. How does this work? So you had already added it to DVC with this weird name, and then you want to change it?

01:08:25

Yes, I changed the file. That is an excellent question. So if it was Git way, then you could maybe do Git add and remove. Yeah, I imagine that they are following exactly the same workflow because

01:08:44

that's what they usually do. They try to use all the workflows that Git has. I think Git has an MV option. But MV is just a convenience function basically for also deleting it and

01:09:04

adding it again. The other thing is basically if you update the path here in the placeholder file

01:09:25

to the correct name and you rename it on the comment line, I think it should also work because that's the only thing that holds the things together. Yeah, but then you have a DVC file that's like that. Normally the DVC file is together with the data file and it's not,

01:09:52

which is not nice, right? Well yeah, you will have to move the both CSV file into the correct

01:10:01

directory on the comment line manually. That's true, yeah. So in this log file, basically that is all the files that belong together at a certain stage.

01:10:22

So when you commit this log file into Git, you basically get the status of all the individual files that belong to this commit. This keeps track of the dependencies of the stage and of the outputs of the stage. So this this GeoJSON file does not exist on GitHub. It only

01:10:45

exists as a reference here in this log file and in your DVC storage. But if you have set up a DVC storage, for example on Google Drive, it will be there. And if you collaborate with other people

01:11:00

and they do a DVC pull, like you would do a Git pull, then it would fetch these results from the storage. Yes, I was wondering, there must be a limit to the DVC cache, right? And if you happen

01:11:20

to know what that limit is, for example time-wise, does it save data to 10 years back or something? That is an excellent question and to be honest I haven't dug into that one yet. So that's something we could certainly try to follow up with, for how many steps in the

01:11:45

history it keeps the cache. I imagine it is defined like that, not as a time frame, but just for a certain number of commits that it keeps in the cache. But to be honest I'm not sure how much control you have as a user at this point in time or whether the DVC developers are still

01:12:03

working on that particular thing to make it more configurable. I guess the most important part first would be to know how much space I have. So if I try it 100 times and then it starts throwing out the first try, something like that. No, I absolutely agree.

01:12:27

Besides the cache thing, which unfortunately I'm not sure, I think in this DVC config there's things that also affect the cache if you want to customize it. What I found getting a bit fiddly

01:12:43

is keeping track of all the different gitignore and DVCignore files that are being created along the process, because you have to be kind of careful always that you really keep your data files out of the of the git history. So the auto staging sometimes places entries into gitignore

01:13:06

files in weird locations in the root of the repository but also in subdirectories of the repository like in the data folder here. So sometimes if a file doesn't show up anymore it's probably somewhere in the gitignore and that's why you can't commit it to git anymore.

01:13:27

I'm sure they are working on a better solution for that. DVC is also quite a new project but I find it very exciting to work with and it certainly saved me a bit of time already and

01:13:42

I've been mostly learning still, so I've not been as productive as I hope to be with this in the future. If you ever want to reach out to Moving Pandas, we do have besides the issue tracker

01:14:07

also a discussion forum here where you can also ask any kind of user questions not necessarily related to any issues. You can just open them here and we can discuss maybe

01:14:24

new feature request ideas before you put them into tickets or if you have any questions on how to do a certain analysis. For example, as part of the projects that I talked about yesterday in the panel, we are also planning to add more support for network

01:14:44

constraint movement data. So basically like GPS tracks that you should be matching to a street network before you do your analysis because you don't have the as the crow flies Euclidean distances between the records but you actually have to follow the road network and then also

01:15:01

the speed records that you're calculating should be based on the underlying road network and not the Euclidean distance. These are kinds of things that we want to tackle in the next year, two years, so if you have ideas, data sets that could be useful for demos,

01:15:21

feel welcome to reach out and share those. I'm also still curious to add an example of eye tracking data because that's one of the more odd movement data things which I have not covered yet in the tutorials. So if you have eye tracking data, maybe some

01:15:44

neat little story and we could also add those to the tutorials. Nice, thanks, would be great.

01:16:01

If you have some more time in the Moving Pandas examples,

01:16:21

or let's go to the main page actually movingpandas.org you can see also all the examples here from the tutorials which explain how the individual features work. For example, because we were talking about it before, to smooth the trajectories

01:16:46

you will find here the example tutorials already executed and as HTML files so you don't even have to set up the environment. You see that the interactive plots they still work,

01:17:01

which is nice, but if you want to change anything of course you would have to either go to the interactive notebook on MyBinder via this link or you can always get to the IPython notebook source code on GitHub via the add up button as well. So this is the way that we have set up

01:17:24

the documentation, basically by providing people a quick way to just look at the examples in the pre-executed notebooks while also linking to the interactive versions.

01:17:43

So all the sports analysis examples I think are pretty neat. Where you can see here is basically how to just put an ordinary png file as a background image and then put your local coordinates on top.

01:18:03

And here we have the thing of plotting the whole trajectory collection in different colors for the different teams, so the attack team and the defense team have different colors. Because this is just one short situation, so the attacking team was going towards the goal and the

01:18:22

others were of course following them to cover them. It's not a shot, it's like all the players. And I guess that was when Liverpool scored 1-0 against Fulham. I think that is what the code here means. But if you're interested in data sets like this, I think I have it linked up. Friends of

01:18:49

tracking data, F-O-T-D, they have these soccer game trajectories that you can play around with. That's wonderful what one can find on GitHub.

01:19:10

If you have any idea about a background map, a high-resolution background map for Mars, I would love to hear about that for my Mars rover demo. Because right now I have the nice Mars rover and

01:19:26

its helicopter body. I have their trajectories. I'm calculating speeds unfortunately with the circumference of Earth instead of Mars, so there's a wrong, but don't tell anyone.

01:19:44

And what I'm really lacking is a proper background map. I have a couple of pointers, but they're either not high resolution enough or I don't know exactly how to use them properly. Some of them should be loading now. Mod of luck here.

01:20:04

But basically if anyone knows some good tiles for Mars, those would be highly appreciated.

01:20:27

Right, if we're out of questions, I want to thank you for staying so active, for not falling asleep. And I think it's finally time, the sun has also come out, to explore the city a bit. Thanks so much

01:20:41

and feel free to reach out this evening if you have any other ideas or questions. Thank you.