Data pipelines with Celery: modular, signal-driven and manageable

Zitieren

Zugehöriges Material

EuroPython

Čuvić, Marin Aglić

Formale Metadaten

Titel

Data pipelines with Celery: modular, signal-driven and manageable

Serientitel

EuroPython 2024

Anzahl der Teile

131

Autor

Čuvić, Marin Aglić

Mitwirkende

N. N. (Moderation)

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben

Identifikatoren

10.5446/69418 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2024

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Writing pipelines for processing large datasets has its challenges – processing data within an acceptable time frame, dealing with unreliable and rate-limited APIs, and unexpected failures that can cause data incompleteness. In this talk we’ll discuss how to design & implement modular, efficient, and manageable workflows with Celery, Redis, and signal-based triggering. We’ll begin by exploring the motivation behind segmenting pipelines into smaller, more manageable ones. The segmentation simplifies development, enhances fault tolerance, and improves modularity, making it easier to test and debug each component. By leveraging Redis as a data store and Celery’s signals, we introduce self-triggering (or looped) pipelines that efficiently manage data batches within API rate limits and system resource constraints. We will look at an example of how we did things in the past using periodic tasks and how this new approach, instead, simplifies and increases our data throughput and completeness. Additionally, this facilitates triggering pipelines with secondary benefits, such as persisting and reporting results, which allows analysis and insight into the processed data. This can help us tackle inaccuracies and optimise data handling in budget-sensitive environments. The talk offers the attendees a perspective on designing data pipelines in Celery that they may have not seen before. We will share the techniques for implementing more effective and maintainable data pipelines in their own projects.