We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Building a data analytics library in Python

00:00

Formal Metadata

Title
Building a data analytics library in Python
Title of Series
Number of Parts
351
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2022

Content Metadata

Subject Area
Genre
Abstract
The Data Operations Systems and Analytics team at NYC DOT’s primary mission is to support the data analysis and data product needs relating to transportation safety for the Agency. The team’s work producing safety analysis for projects and programs typically involves merging data from a variety of sources with collision data, asset data, and/or program data. The bulk of the analysis is performed in PostgreSQL databases all with a geospatial component. The work necessitates ingesting input data from other databases, csv/excel files, and various geospatial data formats. It is critical that the analysis be documented and repeatable. Moving data around, getting external data into the database, transforming it, geocoding it etc., previously occupied the bulk of the team’s time before, reducing capacity for the actual analysis. Additionally the volume of one-off and exploratory analyses resulted in a cluttered database environment with multiple versions of datasets with unclear lineage and state of completeness. Modeled on the infrastructure as code idea, we began building a python library that would allow us to preserve the entire analysis workflow from data ingestion to analysis and to output generation in a single python file or Jupyter notebook. The library began as a way to reduce the friction and standardize the process of ingesting external data into the various database environments utilized. It has since grown into the primary method to facilitate reproducible data analysis processes that includes the data ingestion, transformation, analysis, and output generation. The library includes basic database connections, and facilitates quick and easy import and export from flat files, geospatial data files, and other databases. It provides both inferred and defined schemas, to allow both quick exploration and more thoroughly defined data pipeline processes. The library includes standardization of column naming, comments, and permissions. There are built in database cleaning processes, geocoding processes, and we have started building simple geospatial data display functions for exploratory analysis. The code is heavily reliant on numpy, pandas, GDAL/ogr2ogr, pyodbc, psycopg2, shapely, and basic sql and python. The library is not an ORM, but occupies a similar role, but geared towards analytic workflows. The talk will discuss how the library has evolved over time, the functionality and use cases in the team’s daily workflows as well as where we would like to extend the functionality and open it up for contributions. While the library is not currently open source, we are actively working on creating an open version and migrating to Python 3.x. This library has greatly improved the speed and simplicity of conducting exploratory analysis and enhanced the quality and completeness of the documentation of our more substantial data analytics and research. The library should be of interest and utility for anyone working with data without the support of a dedicated data engineering team to facilitate the collection of multiple datasets from a variety of formats, as well as anyone looking to standardize their data analysis workflows from beginning to end.
Keywords
202
Thumbnail
1:16:05
226
242
Dot productAnalytic setContext awarenessComputer animation
Data analysisDatabaseMathematical analysisTransportation theory (mathematics)Interface (computing)Table (information)Function (mathematics)Query languageShape (magazine)System callAverageLattice (order)Bus (computing)Address spacePartition (number theory)Total S.A.Crash (computing)Local GroupDrop (liquid)PasswordCoordinate systemQuery languageSet (mathematics)Virtual machineConsistencyInferenceVisualization (computer graphics)Single-precision floating-point formatTable (information)Product (business)Standard deviationCollaborationismMathematical analysisTransportation theory (mathematics)Heegaard splittingFunktionalanalysisComputer fileBitDatabaseMultiplication signExploratory data analysisNumberCore dumpFrequencyMetadataProjective planeLibrary (computing)PlotterElectronic mailing listDefault (computer science)File formatData typeSubject indexingTimestampAssociative propertyShape (magazine)System administratorServer (computing)Frame problemComputer animation
Transcript: English(auto-generated)
Hi, I'm Seth. I'm going to talk about a tool we built to support our analytics work at the New York City Department of Transportation. So first, for some context, my team is responsible for analysis and data products related to transportation safety for the department. And our work spans from requests that need to be answered in a few minutes to research projects that take months.
And while we do maintain a number of core databases both in Postgres and SQL Server, the bulk of our work requires analyzing how external data sets relate to our data. And so we spent a lot of time doing ETL work and every analyst was doing it their own way,
which made coordination and duplication very challenging. So we cobbled together our most common workflows and built a simple Python library to standardize and simplify our work. We wanted something that would be a easy way to handle all of our data I.O. between the various data sets and our databases,
handle the common errors, and keep all of our workflow together in a single executable Python file. We also wanted to make sure that we were organized and keeping our database clean. Since we get a lot of requests, our database was filling up with a lot of garbage. So just to go through what this looks like with some basic examples,
importing data is pretty simple. You're just defining the file type and passing in the path. Exporting is pretty much the same thing. You're passing in a table or query and where you want it to go. Essentially, we're just writing the GDAL commands here, but we're adding a little bit on top. For example, when you import data, it does some standardization and cleaning of column names. If you're coming from an Excel or CSV, it will
do some inference on the data types to create a slightly cleaner table. If you're exporting to a shapefile, it checks for timestamps, and we'll split it into date and time so you don't lose any data. And if you're writing to a database,
it will add a comment describing where the data came from, a little bit of metadata. The other thing that it does is any table that is created or imported via this library will by default be added to a log for automatic deletion after a set period of time, which of course is configurable. So for cleaning or processing your data, you can do it in Python.
So here we're using an API to reverse geocode some data and send it back to the database. Or you can do it just in plain SQL, and the benefit of using this for your SQL as opposed to, say, pgAdmin is mostly just to keep your workflow together, so you have to rerun everything. You're not missing any steps. But also, if you create a table or rename a table, it will update the log, and
it will rename any associated indexes for consistency. Analysis, of course, you can do it in SQL again, or you can tell it to return your data as a Python list or a pandas data frame, which is really useful if you want to send it on to be operationalized or to a machine learning model or if you just want to do some visualization.
And then we also added a function to send queries to Plotly to generate basic core plus. It's been really useful for our exploratory analysis. Then when you're done, you can drop everything you've done at the end of your session, which if you're done, you don't need it anymore. That's very nice.
If not, next time you connect to the database, it will check the log for any tables that you've created and drop any that have expired. So just to wrap up, this is a very simple library, but it's been really useful for us in making things move faster and more efficient. Everything's clean and consistent and collaboration has become very easy, and then it has
kept our database from being cluttered with old data sets that are not needed anymore. Thank you.