From telemetry data to CSVs with Python, Spark and Azure Databricks

Zitieren

EuroPython

Giso, Nicolò

Formale Metadaten

Titel

From telemetry data to CSVs with Python, Spark and Azure Databricks

Serientitel

EuroPython 2021

Anzahl der Teile

115

Autor

Giso, Nicolò

Mitwirkende

Christen, Martin (Moderation)

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 4.0 International:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben.

Identifikatoren

10.5446/58811 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2021

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Tenova is an engineering company working alongside client-partners to design and develop innovative technologies and services that improve their business, creating solutions that help metals and mining companies to reduce costs, save energy, limit environmental impact and improve working conditions for their employees. In the context of Industry 4.0, Tenova provides each equipment with a field gateway, named Tenova Edge, to collect telemetry data, perform edge analytics with AI models and send data to the Tenova Platform (hosted on Microsoft Azure) for further elaborations. To develop analytics solutions, data scientists and process engineers need the data in a manageable format. Furthermore, continuous retraining of AI models is necessary to guarantee high performances and reliable results. For all of these reasons, we needed to implement an ETL solution to transform the raw data in formats ready for analysis and retraining. In particular, the key requirement was to convert the JSON Lines files coming from the field in CSV files ready to be used. The CSV files have to satisfy the following conditions: - each file contains the data for a device - only one file for device per day - each file has a midnight row containing for each cell the value recorded at midnight or the last value of the previous day (SPOILER: here it’s where the fun happens!) For this purpose, we have implemented a series of Databricks Notebooks, run daily by Azure DataFactory, that leveraging Pyspark and Pandas manipulates the raw JsonLines files in nicely formatted CSVs.