Building a data analytics library in Python

Cite

Related Material

FOSS4G

Open Source Geospatial Foundation (OSGeo)

Hostetter, Seth

Formal Metadata

Title

Building a data analytics library in Python

Title of Series

FOSS4G Firenze 2022

Number of Parts

351

Author

Hostetter, Seth

License

CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/69057 (DOI)

Publisher

FOSS4G

Open Source Geospatial Foundation (OSGeo)

Release Date

2024

Language

English

Production Year

2022

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

The Data Operations Systems and Analytics team at NYC DOT’s primary mission is to support the data analysis and data product needs relating to transportation safety for the Agency. The team’s work producing safety analysis for projects and programs typically involves merging data from a variety of sources with collision data, asset data, and/or program data. The bulk of the analysis is performed in PostgreSQL databases all with a geospatial component. The work necessitates ingesting input data from other databases, csv/excel files, and various geospatial data formats. It is critical that the analysis be documented and repeatable. Moving data around, getting external data into the database, transforming it, geocoding it etc., previously occupied the bulk of the team’s time before, reducing capacity for the actual analysis. Additionally the volume of one-off and exploratory analyses resulted in a cluttered database environment with multiple versions of datasets with unclear lineage and state of completeness. Modeled on the infrastructure as code idea, we began building a python library that would allow us to preserve the entire analysis workflow from data ingestion to analysis and to output generation in a single python file or Jupyter notebook. The library began as a way to reduce the friction and standardize the process of ingesting external data into the various database environments utilized. It has since grown into the primary method to facilitate reproducible data analysis processes that includes the data ingestion, transformation, analysis, and output generation. The library includes basic database connections, and facilitates quick and easy import and export from flat files, geospatial data files, and other databases. It provides both inferred and defined schemas, to allow both quick exploration and more thoroughly defined data pipeline processes. The library includes standardization of column naming, comments, and permissions. There are built in database cleaning processes, geocoding processes, and we have started building simple geospatial data display functions for exploratory analysis. The code is heavily reliant on numpy, pandas, GDAL/ogr2ogr, pyodbc, psycopg2, shapely, and basic sql and python. The library is not an ORM, but occupies a similar role, but geared towards analytic workflows. The talk will discuss how the library has evolved over time, the functionality and use cases in the team’s daily workflows as well as where we would like to extend the functionality and open it up for contributions. While the library is not currently open source, we are actively working on creating an open version and migrating to Python 3.x. This library has greatly improved the speed and simplicity of conducting exploratory analysis and enhanced the quality and completeness of the documentation of our more substantial data analytics and research. The library should be of interest and utility for anyone working with data without the support of a dedicated data engineering team to facilitate the collection of multiple datasets from a variety of formats, as well as anyone looking to standardize their data analysis workflows from beginning to end.

Keywords

foss4g2022

generaltrack

UsecasesAndapplications