An Analysis of Capital Bikeshare Trips in Washington D.C. with Open-Source Geospatial Tools
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 266 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/66347 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Maxwell's equationsPhysical systemPhysical systemBitProjective planeShared memorySoftware developerCivil engineeringMobile WebPoint (geometry)Workstation <Musikinstrument>Computer animation
01:13
Workstation <Musikinstrument>Maxwell's equationsBasis <Mathematik>Price indexElement (mathematics)LaptopMultiplicationComplex (psychology)Serial portFrame problemSubject indexingSingle-precision floating-point formatGeometryWebsiteNumberElement (mathematics)Workstation <Musikinstrument>Sheaf (mathematics)MereologyTotal S.A.Computer fileRoutingSlide ruleLibrary (computing)Fitness functionMultiplication signBitSet (mathematics)SoftwareShared memoryUniqueness quantificationPhysical systemFile formatVector spaceRootStatisticsMultilateration2 (number)Term (mathematics)Computer animation
05:54
Workstation <Musikinstrument>File formatBuildingComputer networkGeometryShared memoryDefault (computer science)IntegerTransportation theory (mathematics)DatabaseAlgorithmConnectivity (graph theory)Network topologyMaxima and minimaParameter (computer programming)TopostheorieRoutingCycle (graph theory)Series (mathematics)AverageLine (geometry)Computer fileRow (database)Workstation <Musikinstrument>Level (video gaming)Web 2.0NumberInteractive televisionPhysical systemSlide ruleMathematicsDisk read-and-write headCondition numberInterface (computing)Table (information)Point (geometry)SoftwareScripting languageSummierbarkeitMultiplication signInformationStructural loadNeuroinformatikBinary codeTesselationLaptopFile formatoutputVector spaceFunctional (mathematics)Exterior algebraFormal languageString (computer science)Coordinate systemTotal S.A.Query languageData storage deviceMereologyAlgebraUniqueness quantificationMathematical analysisDifferent (Kate Ryan album)Sampling (statistics)Type theoryHill differential equationComputer animation
14:07
NumberLevel (video gaming)ResultantTouchscreenGraph coloringPotenz <Mathematik>Order of magnitudeLink (knot theory)Scaling (geometry)Distribution (mathematics)Revision controlExponential distribution
15:04
SummierbarkeitGeometryValidity (statistics)Entire functionLibrary (computing)Binary codeString (computer science)Data analysisTopostheorieoutputMathematical analysisSet (mathematics)DampingIntegrated development environmentLine (geometry)RoutingMonster groupGoodness of fitCivil engineeringConnectivity (graph theory)Inclined planeComputer animation
17:35
Maxwell's equations
Transcript: English(auto-generated)
00:08
Hey, guys. Welcome. I'm Max. And I this is a little personal project I did during the pandemic when I was bored on Capital Bike Share. So, a little bit about me. First
00:21
off, I am from Washington, D.C., which explains my obsession with D.C. bike politics and Capital Bike Share. In particular, my background is in coastal engineering and civil engineering more generally. But right now I'm working as a geospatial developer at Van Oord in the Netherlands. So, a little bit about the Capital Bike Share system first.
00:45
Capital Bike Share is primarily docked bike share. So, you bike between fixed points. And, yeah, it's primarily intended for short trips. So, any trip over 30 minutes, you get an additional fee. So, yeah, it's intended mostly for urban mobility and,
01:03
you know, you don't keep the bike with you for a day or anything. 700 docking stations and up to last month, 35 million or so trips. And so, these are the current stations, all 700 something of them. It's kind of hard to see in the center of D.C. here because there's so
01:22
many stations. And it also the network extends out into the suburbs as well. A little bit. So, Capital Bike Share makes their data all publicly available. Including the data for every single trip. They're released monthly in CSV format. Including you have the start time of
01:40
the trip, the end time of the trip. You have like an index number that tells you which station it started at and which one it ended at. And also, if the trip was done by somebody who has a membership to Capital Bike Share or somebody who is just a casual user who just took out a short term membership. So, they're published every month. At the end of the
02:02
month, they publish the last month's trips on the Capital Bike Share website. So, this is a just part of an individual CSV file. Yeah. So, there's a bunch of individual trips there. And this there's this and there's more columns that didn't fit on the
02:23
slide. And there's 35 million trips overall. So, what I wanted to get out of this dataset was how many trips occurred between every unique combination of stations. How many trips occurred between station A and station B for every all possible stations.
02:46
Also, second milestone here was to find a route, some possible route for each one of these trips. We don't actually know where the bikes went, of course. Because there's no GPS trackers on the bikes. We just know where they started and ended. So, we have to kind of guess the route. And the goal 3 was to try to combine all of these routes and the number of trips between
03:05
the stations to get how many bike share trips occurred on every individual street in the entire region. So, what challenges what were the biggest challenges I encountered? For one thing, getting a little spoiler alert for later. There's about 100,000 unique trips.
03:24
And getting all of these put through a routing engine in an efficient way. And then combining the routing gives you a vector geometry, how to combine those and sum up all the trips that occurred on all of them within an efficient way. So, here's what I ended up doing.
03:47
First, I got all the every single CSV trip for every single CSV file for the in the system history. So, that's about I think since 2008 when the system was officially inaugurated. Using the Python library pandas, common data science library. I parsed all of those,
04:05
cleaned up the data a little bit. And then found the number of trips between the unique pandas. That was all in pandas. And then using Valhalla, a common routing engine, I was able to route a finder route between each station pair. And then aggregate all of the statistics
04:25
or all of the trips to get the statistics per section of road. So, the data cleaning, first of all, for the individual trips, I loaded all 35 million into a single pandas data frame. Then I calculated the total trip time. I assume anything longer than four hours is probably a
04:43
leisure trip. And, yeah, my ability to, you know, to guess the route was significantly reduced because it probably did not follow the most logical route according to the routing software. And, of course, trips starting and ending at the same station. We have no idea where they went. And then there are some trips in the dataset that just have a zero as the starter
05:05
end station. Capital picture says they cleaned the data and they mostly do. I found a few hundred I think that were missing a starter end station for some reason. So, I threw out about 175,000 because they were too long. And then in combining them,
05:22
I got about 183,000 individual trips. That's unique stations if we count A to B and B to A as separate trips. If we consider those as the same trip and we sum up the numbers between them, we get 105,000 or so unique routes. And for each of those 105,000, I have the number of trips that
05:44
occurred. And so, yeah, this was just done in pandas. It's not that exciting. It's on my GitHub if you're interested. But, yeah, it just involved looping over a pandas data frame. And this is the results. A little chaotic. So, this is one individual I have an interactive
06:02
web map of this. But this is one individual station. And you can see how many trips occurred between this one� sorry. This one station and the others in the system ranging from about 14,000 at the highest to 2,000 or so at the� I'm sure there's less than 2,000. So,
06:22
yeah. But, yeah, every single trip that occurred from this station or ended at this station is here. But, of course, this is quite hard to visualize to keep in your head. It looks completely chaotic. I couldn't find a good way to display this data. So, I wanted to get the individual trips. So, sorry. That slide is out of place. So, to get the routes between the
06:50
stations. So, A to B, don't know where it goes. The solution involved using Valhalla, the routing engine. So, there's a lot of different routing engines out there. It's a
07:02
common problem in GIS analysis, what's the most efficient way to get between two points given certain conditions. I ended up using Valhalla for a couple reasons. So, first of all, Valhalla gives really good information or really good costing parameters to try to set the information about the cycling behavior. So,
07:23
you can set a parameter for the bicycle type. So, I used a city bike here. You can also set a parameter for the cyclist's willingness to use, like, roads. I set that quite low because most capital bike share users are, you know, just using it for transportation and are probably going to be hesitant to bike in D.C. traffic. And willingness to bike up hills.
07:46
D.C. is a very hilly city and the capital bike share bikes are very heavy. So, I set that to the minimum. So, that just means the routing algorithm will prefer to take a more flat route even if it's significantly longer. And you can also set the average
08:01
cycling speed as a parameter. I think I just used the default for a city bike. So, how do we route all of these efficiently? The default way to use Valhalla as a routing engine is you can run it in a Docker container and that gives you a web HTTP interface you can
08:24
call. And Python can call that interface on your local computer. And this is what I tried the first time. However, the layer of calling it over HTTP slows it down significantly. So,
08:43
a package called py Valhalla. You can download a Python API for Valhalla and also the binaries. So, you can use your vector tiles for Valhalla. And which is the input format for Valhalla.
09:04
And you can call them directly from Python. And that ended up speeding things up by probably 50 times faster. So, I was able to do this. I think all 100,000 trips I can route them if I remake the database in about an hour. So, it's quite feasible to do on my laptop.
09:21
Okay. And so, I think this is the hardest challenge of all. Goal 3, how to combine. I know how many trips occurred between each station. And I have a guess of what the route might be. How to combine and sum up all the trips that occurred for each of those line strings. So, the first way I looked at doing this was to try to use like vector algebra packages or to try
09:45
to do some kind of a spatial join to sum where they overlap. And I tried also exploding these lines first and then doing a spatial join or sum there. But none of these methods were proving remotely feasible. I tried them on a small subsample of like 200 trips to see if they
10:02
work. And they did not work out. Because they were excruciatingly slow with 200 and extending that to 100,000 is not practical. So, the solution I ended up coming to that ended up working really well is topo geometry. It's a geometry that's defined topologically.
10:25
This is the trip between station A and station B. We can split this line into different segments that are unique segments. And we give these all a number in the database. And then instead of defining the default way of storing geometry is, of course, a series of coordinates. But this,
10:44
instead of a series of coordinates, it just stores the trip as a series of like unique path through this system of unique integers. So, this trip would be 1, 2, 3, 4, 5, 6. And then these individual topology components get reused if I add another trip that overlaps this.
11:05
And so, this is implemented, this topo geometry is implemented in post GIS. And post GIS can build it automatically. I have a series of in PGSQL, a series of triggers, which if I add a new row to the geometry table, it automatically creates a new row in
11:22
topo geometry table. As I'm loading the database, it's automatically calculating all these topologies for me. This step takes about two hours to rebuild the entire database. And yeah. So, it's relatively efficient considering the problem it's solving at least.
11:43
So, yeah. Post GIS can automatically do this for me. And so, from there, once we have a series of trips that are defined using this topology, it's a really simple SQL query that gives you the geometry, each individual piece of road. And the number of trips, the total
12:03
sum of trips that have occurred on it. So, for example, if we add another station C and D to this example from before, we can say, okay, there's about a thousand trips on this red line and 8,000 trips on this green line. We can easily sum because they're using the same
12:23
little component here. Easily sum up those trips to be about 10,000 trips. So, how did I implement this in practice? Originally, I had a Docker Docker compose file to define everything. I had a couple Docker containers posting post GIS inside of the
12:43
container. When I switched away from using the container, I still keep the container in the Docker compose file because the container is very intelligently designed. And if you it automatically rebuilds the geometry, if there's any rebuilds the tiles, if there's any
13:03
changes in the OpenStreetMap data. So, you can feed it OpenStreetMap data from Geofabric or any other kind of extract of OpenStreetMap and it will automatically rebuild tiles inside of there. So, it's quite a cool feature. And then, yeah, I have a bunch of Python
13:20
scripts to load all the data and they call the routing software and feed in the individual points and then load they feed in the points and then they get the line string and move that into post GIS all using pandas and Geopandas. And then I have a makefile that ties everything together. It's a little old school. And the language is defining a makefile can be
13:45
very tricky. But I still don't have a good I don't know a good alternative that can provide that functionality. But essentially what the makefile does is I can run it and it tracks what has changed and only reruns the parts that need to change. But if anybody has any
14:01
ideas for more modern ways to accomplish that, please let me know. And so, this was my this was kind of my end result. I have an interactive version of this on my GitHub I can provide a link to if you're interested. But, yeah, I think the biggest takeaway here, obviously, it's hard to see exactly what's going
14:22
on on the screen from far away. Obviously, the biggest takeaway, this is an exponential scale. So, every graduation of the color map has like an order of magnitude more trips. So, there's
14:40
six there's a lot more trips. There's in the city center there's an exponential distribution. The highest 10% have like a million more millions more rides on them than the lowest 10%. But, yeah, I have an interactive map gap on the GitHub that explains the data a little better.
15:03
But what were my takeaways? So, also, are these are these accurate? It's kind of hard to say. I don't have any good set of validation data for this. Of course, we don't know where the bikes actually went. I would speculate that it's
15:24
not terribly accurate. But I think that probably the relative ratios of how much bike traffic occurs on each street are relatively representative. And also, yeah, I think that I'm presenting it here more as a this is how I solve some interesting GIS problems.
15:43
I don't know if this went past muster at an urban planning or a civil engineering traffic engineering conference. But, yeah, I think the GIS method is more what I'm trying to get across here than my scientific approach for how many trips occur on each street. So, what were the most important things to make this pretty large data analysis practical?
16:06
Using Pyvalhalla and binaries for routing instead of using the microservices approach of keeping it all in a separate container. I brought it all into a little monolith. And it runs much more efficiently just directly calling the library. And the topo geometry using
16:27
a topologically defined geometry made this feasible. Which otherwise I could not find a way to do it to do it feasibly. And, yeah, using SQL triggers to automatically generate the topo geometry I think also made things a lot easier. Because previously I was trying to manually
16:46
I would load all the routes with line strings and then generate the topo geometry. And that was much slower. And, yeah, if I had to start from scratch, I would probably use I would probably do more in SQL. Python is really great for loading the data and reading CSVs. But
17:05
I think I could probably do basically the entire analysis in SQL queries if I were so inclined. But Python is a great glue. So, it worked out pretty well, I think, in Python. But I think having more components in the PostGIS environment would probably make it a little easier.
17:24
And, yeah, these are the libraries that were most critical here. Also, shoutout to OpenStreetMap, which provides the input data for Valhalla. And, yeah, thanks.