Building a data analytics library in Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 351 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/69057 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2022 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
FOSS4G Firenze 2022187 / 351
1
7
13
22
25
31
33
36
39
41
43
44
46
52
53
55
58
59
60
76
80
93
98
104
108
127
128
133
135
141
142
143
150
151
168
173
176
178
190
196
200
201
202
204
211
219
225
226
236
242
251
258
263
270
284
285
292
00:00
Dot productAnalytic setContext awarenessComputer animation
00:08
Data analysisDatabaseMathematical analysisTransportation theory (mathematics)Interface (computing)Table (information)Function (mathematics)Query languageShape (magazine)System callAverageLattice (order)Bus (computing)Address spacePartition (number theory)Total S.A.Crash (computing)Local GroupDrop (liquid)PasswordCoordinate systemQuery languageSet (mathematics)Virtual machineConsistencyInferenceVisualization (computer graphics)Single-precision floating-point formatTable (information)Product (business)Standard deviationCollaborationismMathematical analysisTransportation theory (mathematics)Heegaard splittingFunktionalanalysisComputer fileBitDatabaseMultiplication signExploratory data analysisNumberCore dumpFrequencyMetadataProjective planeLibrary (computing)PlotterElectronic mailing listDefault (computer science)File formatData typeSubject indexingTimestampAssociative propertyShape (magazine)System administratorServer (computing)Frame problemComputer animation
Transcript: English(auto-generated)
00:01
Hi, I'm Seth. I'm going to talk about a tool we built to support our analytics work at the New York City Department of Transportation. So first, for some context, my team is responsible for analysis and data products related to transportation safety for the department. And our work spans from requests that need to be answered in a few minutes to research projects that take months.
00:24
And while we do maintain a number of core databases both in Postgres and SQL Server, the bulk of our work requires analyzing how external data sets relate to our data. And so we spent a lot of time doing ETL work and every analyst was doing it their own way,
00:40
which made coordination and duplication very challenging. So we cobbled together our most common workflows and built a simple Python library to standardize and simplify our work. We wanted something that would be a easy way to handle all of our data I.O. between the various data sets and our databases,
01:02
handle the common errors, and keep all of our workflow together in a single executable Python file. We also wanted to make sure that we were organized and keeping our database clean. Since we get a lot of requests, our database was filling up with a lot of garbage. So just to go through what this looks like with some basic examples,
01:23
importing data is pretty simple. You're just defining the file type and passing in the path. Exporting is pretty much the same thing. You're passing in a table or query and where you want it to go. Essentially, we're just writing the GDAL commands here, but we're adding a little bit on top. For example, when you import data, it does some standardization and cleaning of column names. If you're coming from an Excel or CSV, it will
01:47
do some inference on the data types to create a slightly cleaner table. If you're exporting to a shapefile, it checks for timestamps, and we'll split it into date and time so you don't lose any data. And if you're writing to a database,
02:02
it will add a comment describing where the data came from, a little bit of metadata. The other thing that it does is any table that is created or imported via this library will by default be added to a log for automatic deletion after a set period of time, which of course is configurable. So for cleaning or processing your data, you can do it in Python.
02:22
So here we're using an API to reverse geocode some data and send it back to the database. Or you can do it just in plain SQL, and the benefit of using this for your SQL as opposed to, say, pgAdmin is mostly just to keep your workflow together, so you have to rerun everything. You're not missing any steps. But also, if you create a table or rename a table, it will update the log, and
02:42
it will rename any associated indexes for consistency. Analysis, of course, you can do it in SQL again, or you can tell it to return your data as a Python list or a pandas data frame, which is really useful if you want to send it on to be operationalized or to a machine learning model or if you just want to do some visualization.
03:04
And then we also added a function to send queries to Plotly to generate basic core plus. It's been really useful for our exploratory analysis. Then when you're done, you can drop everything you've done at the end of your session, which if you're done, you don't need it anymore. That's very nice.
03:21
If not, next time you connect to the database, it will check the log for any tables that you've created and drop any that have expired. So just to wrap up, this is a very simple library, but it's been really useful for us in making things move faster and more efficient. Everything's clean and consistent and collaboration has become very easy, and then it has
03:43
kept our database from being cluttered with old data sets that are not needed anymore. Thank you.