Develop and deploy a Machine Learning pipeline in 30 minutes with Ploomber - TIB AV-Portal

Develop and deploy a Machine Learning pipeline in 30 minutes with Ploomber

00:00

6

Related Material

Blancas, Eduardo

Formal Metadata

Title

Develop and deploy a Machine Learning pipeline in 30 minutes with Ploomber

Title of Series

EuroPython 2021

Number of Parts

115

Author

Blancas, Eduardo

Contributors

Pierfederici, Francesco (Moderation)

License

CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/58760 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Development tools such as Jupyter are prevalent among data scientists because they provide an environment to explore data visually and interactively. However, when deploying a project, we must ensure the analysis can run reliably in a production environment like Airflow or Argo; this causes data scientists to move code back and forth between their notebooks and these production tools. Furthermore, data scientists have to spend time learning an unfamiliar framework and writing pipeline code, which severely delays the deployment process. Ploomber solves this problem by providing: A workflow orchestrator that automatically infers task execution order using static analysis. A sensible layout to bootstrap projects. A development environment integrated with Jupyter. Capabilities to export to production systems (Kubernetes, Airflow, and AWS Batch) without code changes. * Who and why * This talk is for data scientists (with experience developing Machine Learning projects) looking to enhance their workflow. Experience with production tools such as Airflow or Argo is not necessary. The talk has two objectives: Advocate for more development-friendly tools that let data scientists focus on analyzing data and take off the overhead of popular production tools. Demonstrate an example workflow using Ploomber where a pipeline is developed interactively (using Jupyter) and deployed without code changes.

EuroPython 202180 / 115

1

21:47

Stop Writing Tests!

2

42:50

Getting Started with Statically Typed Programming in Python 3.10

3

27:49

Python Anti-Patterns

4

25:00

A Tale Of Python C Extensions And Cross-Platform Wheels

5

28:10

Python and Flutter application for Colouring and Enhancing Old Photos

6

27:31

Personal growth and the Python community

7

29:55

Extending Cython with GIL-free types

8

31:10

The Optimal Wedding (With Pyomo)

9

29:47

Powerful tests and reproducible benchmarks with `pytest-cases`

10

22:07

Streamlit: The Fastest Way to build Data Apps

11

43:20

The Myth of Neutrality: How AI is widening social divides

12

45:57

Fast native data structures: C/C++ from Python

13

33:20

Python Data Science with VS Code and Azure

14

33:58

Building Application using Flutter and Django

15

39:56

Refactoring legacy Django app using OOP

16

33:26

A Hitchhiker’s Guide to functools

17

29:54

Simple, open, music recommendations

18

28:25

Sweeter debugging and benchmarking with ycecream

19

43:26

20

30:07

Faceoff Fun with Python Frameworks: FastAPI vs Flask

21

29:31

Towards a sustainable solution to open source sustainability

22

29:08

Generators, coroutines, and nanoservices

23

26:57

Data lake: Design for schema evolution

24

45:24

Building Brain-Computer Interfaces with Timeflux

25

1:00:34

Creative Coding with imagiLabs

26

23:37

Django with PostgreSQL superpowers

27

45:07

Nobody expects the Spanish inquisition

28

31:25

Python RPC and PubSub over Websockets

29

31:02

Automated Machine Learning With Keras

30

49:05

Protecting Your Machine Learning Against Drift: An Introduction

31

27:04

Virtual Tourism In Covid Times

32

24:10

From telemetry data to CSVs with Python, Spark and Azure Databricks

33

33:35

Leveraging Linked Data using Python and SPARQL

34

28:04

Should We Return to Python 2?

35

27:18

A crowdsourced map for checking supermarket wait times worldwide

36

39:41

Python security best practices

37

30:42

Writing Better Documentation for Developers

38

1:09:31

Data Ingestion and Big Data

39

44:36

Auto-Sklearn: Automated Machine Learning in Python

40

24:07

Learn Python automation by recreating Git Commit from scratch

41

22:09

Designing secure APIs

42

26:47

An Introduction to Kubernetes

43

28:36

DIY Home Automation with Microcontrollers and CircuitPython

44

28:39

SQLite, an (un) known super ant

45

33:33

Hypertag: an Indented Language for easy HTML Generation & Web Templating

46

29:13

Thoughts on the Future of Python

47

45:45

Speeding up the deep learning development life cycle for cancer diagnostics

48

40:19

Hiring Demystified

49

46:30

Automatic Testing of Python Functions Based on Contracts

50

32:27

Analyzing COVID Vaccine Distribution By Type in Europe

51

20:35

Creating the Next Generation of Billionaires-Part 3

52

26:49

Type Check your Django app

53

41:20

Learn from LL(1) to PEG parser the hard way

54

42:23

Our Universe through Sight, Sound & Touch

55

27:22

Computational Complexity Theoretical Foundation on How Long Will Program Run

56

30:57

Building a clean, maintainable and tested code base

57

30:58

Documentation-driven development for Python web APIs

58

26:03

How I helped fly a helicopter on Mars

59

38:46

Sound Event Detection with Machine Learning

60

29:10

Handling Transitive Dependencies

61

46:44

Python the Bad Parts

62

51:11

The Libre-SOC Project

63

30:12

Reproducible & Deployable Data Science with Open-Source Python

64

30:05

Measuring memory: Python memory profilers and when to use them

65

44:18

Sentry as a way not to be afraid

66

24:03

High Performance Data Processing with Python, Kafka and Elasticsearch

67

30:49

Build Serverless Python Applications using AWS Chalice

68

40:25

From Research Project to PyPI Release

69

30:37

Driving 3D Printers with Python

70

26:34

Build Your First Cyber Forensic Application using Python

71

44:32

Darts: Unifying time series forecasting models from ARIMA to Deep Learning

72

29:35

Traveling through a secure API in Python

73

28:46

Formalizing a Language

74

28:53

PyAutoFit: A Classy Probabilistic Programming Language For Data Science

75

44:03

EuroPython 2021 - Lightning Talks 07/30

76

1:03:01

EuroPython 2021 - Lightning Talks 07/29

77

33:29

EuroPython 2021 - Lightning Talks 07/28

78

44:25

Writing a python web framework in 2021

79

25:49

Pointers? In my Python? It's more likely than you think

80

32:10

Develop and deploy a Machine Learning pipeline in 30 minutes with Ploomber

81

44:01

Federated Machine Learning with Python

82

30:05

Moving Fast with FastAPI

83

29:24

Pattern Matching in Python

84

29:30

Learn CPython by breaking it

85

46:36

Connecting Communities: the Helmholtz Analytics Framework and the making of Heat

86

22:35

Why You Should Consider Getting a Python Pet Project

87

33:37

Designing Functional Data Pipelines for Reproducibility and Maintainability

88

29:22

How Scientific Computing is advancing the world of Football

89

26:37

Functional Programming inside OOP? It’s possible with Python

90

29:29

Dependency Injection: Stealing Cool stuff from the Weird Kids

91

31:58

Python monorepos: what, why and how

92

28:28

Innovation in the newsroom

93

24:36

Why you always had trouble understanding metaclasses

94

45:51

Finding Magic in Python

95

31:03

The spec you never knew you needed

96

36:28

Heartbeats for Hackers

97

27:27

Continuous Documentation for your code

98

32:12

We build a ML pipeline after we deploy

99

31:59

Sliding into Causal Inference, with Python!

100

39:34

Create beautiful and localized documentations and websites using MkDocs + Github

101

25:41

Production ML Monitoring: Outliers, Drift, Explainers & Statistical Performance

102

27:06

Wildfire Modeling in Yosemite National Park

103

46:49

Python in a world of Pan-Africanism

104

43:11

Introduction to Quantum Deep Learning

105

27:19

Developing Flask Applications for Google Cloud

106

56:37

The Pattern: Machine Learning Natural Language Processing meets VR/AR

107

42:18

Tech for Good: Build the world you want to live in

108

52:53

A Gentle Introduction To Causal Inference

109

31:27

Code From Nothing: Procedural Generation of Python Source Code

110

52:35

No, not typing. Types

111

46:33

Adventures in Real-time Python NoSQL-style

112

29:18

Python Versions and Dependencies Made Easy

113

29:42

Introducing Asynchronous SQLAlchemy

114

28:48

Taming Nondeterminism with Dependency Injection

115

31:10

EuroPython 2022: Help us build the next edition!

Automatic playback

Speech

Text

Image

00:00

Software developerGoogolText editorOpen sourceCross-validation (statistics)Right angleWave packetComputer filePatch (Unix)Frame problemSelf-organizationSoftware maintenanceLaptopPoint cloudSet (mathematics)Moving averageConfiguration spaceRaw image formatProjective planeEndliche ModelltheorieDemo (music)Front and back endsDifferent (Kate Ryan album)Inheritance (object-oriented programming)Server (computing)2 (number)StapeldateiIntegrated development environmentCodeSystem callMathematicsFunction (mathematics)MereologyPredictabilityDampingStructural loadVariable (mathematics)Task (computing)PlotterWritingLogicScripting languageUniform resource locatorFilter <Stochastik>PreprocessorPoint (geometry)Repository (publishing)Data structureWordBuildingEvoluteClosed setLink (knot theory)TwitterExpert systemVirtual machineVideo gameHeegaard splittingForestMusical ensembleObject-oriented programmingElectric generatorProduct (business)Table (information)Subject indexing1 (number)Cellular automatonPresentation of a groupObject (grammar)State observerIRIS-TCentralizer and normalizerFile formatSampling (statistics)Level (video gaming)Medical imagingParameter (computer programming)Process (computing)Source codeMachine codeTouchscreenSoftware testingMultiplication signBitMatrix (mathematics)Slide ruleControl flowClient (computing)Asynchronous Transfer ModeFitness functionoutputElectronic mailing listFigurate numberOrder (biology)Router (computing)Classical physicsBinary multiplierAngleComplete metric spacePlug-in (computing)Drop (liquid)Streaming mediaOpen setQuicksortINTEGRALInformationMeeting/Interview

00:25

TouchscreenPerfect groupVirtual machineMeeting/Interview

01:11

Boom (sailing)Link (knot theory)TwitterGame theoryDemo (music)Installation artThermodynamisches SystemTask (computing)LaptopCodeComputer fileProjective planeSource codeExpert systemDemo (music)Link (knot theory)Centralizer and normalizerTwitterComputer clusterVirtual machinePresentation of a groupPerfect groupScripting languageFunction (mathematics)

04:00

Demo (music)Product (business)Gamma functionWorld Wide Web ConsortiumThermodynamisches SystemInstallation artGastropod shellACIDTask (computing)Product (business)IRIS-TFile formatSource codeINTEGRALLogicComputer fileRight anglePlotterEndliche ModelltheorieSource code

05:45

Demo (music)Product (business)CodeTask (computing)Cellular automatonScripting languageSource codeData modelMagneto-optical driveMeta elementClique-widthPrice indexCore dumpElectric generatorModemModule (mathematics)Lattice (order)OvalInterior (topology)Musical ensembleDrop (liquid)Curve fittingTape driveAreaPlot (narrative)Matrix (mathematics)Parameter (computer programming)Task (computing)2 (number)Multiplication signPredictabilityMatrix (mathematics)Fitness functionSystem callEndliche ModelltheorieLaptopInformationIntrusion detection systemFunction (mathematics)Frame problemRight angleSet (mathematics)Scripting languageProduct (business)Subject indexingComputer fileCodeStructural loadVirtual machineCross-validation (statistics)MereologyHeegaard splittingForestPerformance appraisalOrder (biology)outputVariable (mathematics)Software maintenanceSelf-organizationBinary multiplierClassical physicsPlug-in (computing)Interactive televisionINTEGRALComputer animation

13:32

DampingDemo (music)Task (computing)Cellular automatonLemma (mathematics)World Wide Web ConsortiumData modelSigma-algebraMeta elementProduct (business)ExplosionACIDComputer fileData modelLaptop2 (number)Order (biology)

14:11

VacuumLattice (order)Plot (narrative)Product (business)Musical ensembleDemo (music)Data modelMagneto-optical driveBoom (sailing)PredictionMatrix (mathematics)Endliche ModelltheorieEqualiser (mathematics)Product (business)2 (number)Function (mathematics)Computer animation

15:21

Product (business)Data modelMeta elementDemo (music)Plot (narrative)World Wide Web ConsortiumComa BerenicesAnnulus (mathematics)Keilförmige AnordnungFile Transfer ProtocolGamma functionIRIS-TTask (computing)Gastropod shellBuildingWave packetParameter (computer programming)CodeDemo (music)Computer fileEndliche ModelltheorieLogicDampingScripting languagePredictabilityFunction (mathematics)System callTask (computing)MathematicsPlotterPreprocessorPoint (geometry)Different (Kate Ryan album)

21:13

Formal languageTask (computing)CodeProduct (business)Maxima and minimaModemMagneto-optical drivePredictabilityVariable (mathematics)CodeVirtual machineTask (computing)State observerEndliche ModelltheorieStructural loadSet (mathematics)Sampling (statistics)Computer animation

22:14

Process (computing)Task (computing)Product (business)Cellular automatonStreaming mediaComputer fileGauge theoryMach's principleMenu (computing)Formal grammarMetropolitan area networkMIDIMaxima and minimaSystem callInformation managementLine (geometry)Demo (music)File formatCore dumpCodeAsynchronous Transfer ModePrice indexEmailData storage deviceComputer configurationDecimalFunction (mathematics)Source codeComputer animation

22:42

Hill differential equationAsynchronous Transfer ModeDirectory serviceDemo (music)File formatMagneto-optical driveProduct (business)Read-only memoryCodeData storage devicePredictabilityPlateau's problemFormal languageTask (computing)Extension (kinesiology)Computer fileTask (computing)CodeServer (computing)Computer animationSource code

23:24

Menu (computing)Least squaresVulnerability (computing)Structural loadService-oriented architectureTask (computing)Product (business)Formal languageParameter (computer programming)Extension (kinesiology)Process (computing)Demo (music)ExistenceExplosion1 (number)Scripting languagePredictabilityBuildingSource codeComputer animation

23:57

Embedded systemProduct (business)ExistenceTask (computing)Serial portDemo (music)Data modelMetreCartesian coordinate systemEmpennageExplosionCodeCellular automatonMeta elementMagneto-optical driveTask (computing)MathematicsSource codeComputer animation

24:28

Demo (music)Task (computing)Information managementProduct (business)Data modelStatisticsCellular automatonProcess (computing)ExistenceExplosionFinite element methodLattice (order)Menu (computing)Magneto-optical driveMeta elementBoom (sailing)Streaming mediaModemMIDISystem callPredictionEndliche ModelltheorieObject (grammar)Computer filePredictabilityVariable (mathematics)Structural loadPoint cloudMachine codeDampingTask (computing)Product (business)Frame problemSubject indexingComputer animation

26:19

StapeldateiTask (computing)Game theoryMaxima and minimaGamma functionComputer configurationPlanar graphExpert systemProduct (business)Moving averageGEDCOMCodeDuality (mathematics)SummierbarkeitSource codeMassoutputSkewnessTask (computing)StapeldateiIntegrated development environmentDifferent (Kate Ryan album)Server (computing)Configuration spaceFront and back endsLocal ringMedical imaging2 (number)Endliche ModelltheorieMereologyClient (computing)Function (mathematics)Set (mathematics)Uniform resource locatorRepository (publishing)Point cloudMultiplication signComputer file1 (number)Scripting languageProcess (computing)Computer animationSource code

30:44

GEDCOMOnline chatInteractive televisionBitMatrix (mathematics)Control flowPresentation of a groupSource codeLecture/ConferenceMeeting/Interview

Transcript: English(auto-generated)

00:06

So, I guess we are going for one ad or directly to the next speaker. Well, it's 4.15, 16.15. So, let's go to the next speaker, Eduardo on the stage, please.

00:27

All right. Thank you. Thank you, Consuelo. Hello, Eduardo. Hi. So, you're the next speaker up. Where are you connecting from? From Mexico City. Nice. Nice. Nice place. Very nice. I was there many, many years ago.

00:46

So, you're going to talk about develop and deploy a machine learning pipeline in 30 minutes with... Bloomberg. How do you pronounce your... Ah, okay. Perfect. So, you can start sharing your screen if you have slides anytime.

01:04

Okay. Let me try. Can you see my screen? Now, yes. Perfect. So, I'll disappear. You have 30 minutes and take it away.

01:21

Great. Thank you. Welcome, everyone. Thanks for being here at my presentation. My name is Eduardo and I'm going to be showing a demo of a project I've been working on, Plumber. So, the talk is going to be develop and deploy a machine learning pipeline in 30 minutes with Plumber. So, I'm going to be coding. I'm going to be coding as fast as I can,

01:42

trying to explain as many details as I can. But bear in mind that this presentation, the objective of this presentation is not for you to be an expert in Plumber, but rather to get a glimpse of how the experience looks like so you can use it for your next project. So, before we start with the demo, I want to show a few things. Otherwise,

02:02

I'm going to forget this by the end of the presentation. So, just a few things. The project is open source, so you can check out the code on GitHub. Here is the link. If you like the project, please show your support with the star on GitHub. Please also join our community if you have any questions or just want to chat. The link is on the GitHubs with me.

02:23

Or you can also reach out to me on Twitter. So, here's my handle. Okay. Let's start. The first thing that we are going to do is we are going to create a base project. So, I'm going to run the first command, which is Plumber Scaffold. We're going to be using Conda for my dependencies. I can also use pip. And we are going to create an empty project.

02:46

So, that's the first step. We are going to create a base project. Just the final step we need to get started, I'm going to call this demo. Then we go to the demo folder. And I'm going to start explaining how this pipeline thing looks like. So, a pipeline is just a

03:01

bunch of tasks. Right? We get some data, we clean some data, we generate some features, we train a model. And we usually split this into many small steps so that we can modularize our pipeline. So, the central piece in Plumber is this pipeline.yaml file where we declare our tasks. So, that's what I'm going to do now. I am going to create my first task.

03:22

I'm going to say source. That's where my source code is. I'm going to store this in a scripts folder. And I'm going to say get.i. So, this script is going to get some data. Just the raw data that we need. And this is going to generate two outputs. So, I say products. And then the first one is going to be a notebook. Why a notebook? That's because

03:42

Bloomberg treats scripts as notebooks. So, we can develop them interactively, but then we can execute them from the command line and we can get an output notebook. The idea is that if our script generates any kinds of charts or tables, once we execute the pipeline, we are going to be able to get all of these in a file that we can take a look

04:02

at. So, I'll do this. I'll say products.get.ipynb. I can also change the format. For example, I can say HTML, but I'll leave it as ipynb. This is also going to generate some data. So, products.get. This is where I want to save my data. And that's for our first

04:23

task. Now, let's continue with the next task. We're going to be using the Iris dataset. So, let's, I'm going to generate a feature from the sepal columns. So, I'll just call the sepal feature. Same idea, source code and products. Let's continue with the next task, which I'm going to be using the petal columns. Same thing. And finally, we train a model. So,

04:50

I'm going to call this fit. I'm going to change this because this doesn't generate data. This is going to be a model. So, I'm going to change the name and say model.pico. So, now we have a basic structure, the basic layout. Now, I'm going to ask Bloomberg to

05:05

generate some basic files for me. I made a mistake. Yes, this should be, yes. Okay,

05:20

let's try again. Right. So, now we have the base files and I can generate a plot from this. So, we see that Bloomberg is recognizing these files as our tasks. So, you see, we have four tasks. This doesn't have any structure yet. So, that's what we are going to be working on and I'm going to show the integration with Jupyter. So, I'm going to open

05:40

JupyterLab and we are going to start calling the logic for our pipeline. Okay, let's give it a few seconds. Okay. So, now let's go to the first task. So, I'm going to be getting some data. As you can see, this is something important to mention. I have my pipeline.jaml here

06:03

and as you can see, get generates two outputs, right? So, Bloomberg is auto-completing that for me and telling me, this is where you are supposed to save your output. So, I can simply run this L and I have the information that I need. I'm going to import, this is where I'm going to be getting my data from, import load IDs. So, it's import, yes. The integration with Jupyter

06:32

is really nice because it allows me to do these kinds of things like doing things interactively. It just makes things much easier than just using a script or a regular script. And remember that

06:44

this is a regular script. It just happens that Bloomberg has a plugin that allows us to open them as notebooks. We rely on the Jupyter, the fantastic Jupyter package and we just add a bunch of things on top of it to make this work. So, I think I need this frame. Yeah, this contains everything I need. So, I'll just save this with csv and then I'm going to use a variable

07:06

that Bloomberg adds for me. So, product.data and I don't want to save the index. So, I'll say index false. Okay, so that's it for our first task. Now, let's continue with the next one. So, simple feature and I'm going to show something interesting here. So, we are going to generate

07:24

a feature but we depend on the raw data to do so. So, what I'm going to do is I use this special upstream variable and say I want to use get as a dependency. So, I save my file and reload. And you see that Plumber is going to auto complete things for me. So, I have my output where I'm

07:43

supposed to save my output and where's my input. So, I continue working and let me import pandas and I'm going to read my raw data. So, the data that I generated in the previous task. This is going to be upstream get. Okay, now I have my raw data. This is where I'm going to

08:08

generate one feature. So, it's going to be a really simple feature. So, it's going to be let's call this petalfeater and say this is going to be equal to df. Let's just take this one. I'm just doing the classic feature engineering step and multiply by the other one.

08:30

A really simple thing just for the sake of example. What's going on here? Oh, this one is extra. Okay, and now I got my new column and I'm only going to save this one

08:43

because I already have the rest of the columns. I'm just going to save this one. CSV, same thing. So, I use the variable that Plumber auto completes product because that's where I should save my output. Okay, so we finished the second task. Now, let's move to

09:02

the second, sorry the third one. Repeat that again and the code is going to be really similar. So, just to save some time I'm just going to copy a few things here. Oh, first I have to declare my dependencies. So, let's reload. Okay, now let's add this new feature. So,

09:24

it's going to be really, really similar. I'll just copy this thing. I just have to change something here. All right, I skipped one important step which is loading my data,

09:45

my raw data. Here, yes. So, we load the raw data. We generate the feature and we save it. Let's just make a quick check. Everything looks good. Okay, so now we finished our

10:06

third step. Let's go to the final task which is fitting the model. Oops, actually. Oh, I kind of overlooked this important detail. So, these are .py scripts. In order to open them

10:21

as notebooks, I have to double click and then open as notebook. Now, this final step uses all previous tasks as inputs. So, I'm going to make a list. So, I'm going to use sepal feature, et al feature and the get task. Okay, so these are my dependencies. Now, I am going to reload

10:41

this. You can see I get everything I need and let's work on our machine learning model. Let's load the raw data. And then we say upstream. The raw data is here.

11:04

So, we see our raw data. And now, let's load the features that we generated. So, let's start with sepal. So, as you can see, this auto-completion and all these things allows us to really break down this. What usually happens is that people

11:23

code notebooks like really long notebooks and it becomes a real mess. So, in this way, we are breaking down this huge notebook and we have many small files that we can concatenate one with another and this helps a lot with organization and maintainability and we can also

11:41

collaborate with people because people may work on different files without any issues. Okay, so I have everything I need. I'm going to create one data frame with everything. So, let's call this df and this has everything that I've been working on, right? So, this is my training set. We have the raw data. We have the features that I generated and we have the target

12:03

variable. So, let's now train a model. Just a random forest. Okay. And just to show some charts

12:23

on evaluation charts, I'm going to create a confusion matrix. Okay. Now, we have our data. Let's split this into x and y. So, let's drop the target, axis, columns, and then y is going to be

12:46

df.target. All right. So, we have x and y. Let's train our model. So, this is going to be the random forest. Now, let's go fit x, y. I'm going to skip the cross-validation part

13:01

just to save some time and to be quick. But in real life, you should be doing cross-validation to evaluate your models. So, don't do this, please, in a real machine learning project. I'm going to generate predictions in my training set. Predict confusion matrix

13:21

and then we need y and y frame. Right. So, we have our evaluation. So, we finished the Jupyter notebook. Now, I've run things interactively, but I want to make sure that my pipeline runs from start to finish for reproducibility. So, what I'm going to do

13:42

is I'm going to ask Bloomberg to run everything for me from start to finish. And you're going to see that it's going to run things in order. So, you can see here, it's getting the data. Then, it's going to generate the first feature, then the second feature, and finally, it's going to train a model. So, we are making sure, oh, I forgot something important.

14:05

Yes. I didn't save the model. I just trained a model, but I didn't save it. So, let's go back to JupyterLab and fix that. Give a few seconds. Okay. So, let's come back here.

14:23

And this doesn't take too long to run, so I'm just going to run everything. Okay. So, here's where we have to save our model. We see here that we declare a model as an output. So, we have to do that. That's why Bloomberg was complaining,

14:41

because it's saying, well, you told me you were going to save something, and I don't see it. So, tell me what it is. What else? Oh, I need import equal. And now, I have my model, and I'm going to save this in product. I think it's model. And write bytes. And then,

15:13

people.bunks. All right. So, now, we can close this. I'm going to show this can help us to show

15:26

some nice features from Bloomberg, because I can call this plumber build command again. So, I already built most of my pipeline. I run get, and the two tasks that generate features. So, if I call this again, check out what's going to happen. So, it's only running fit,

15:44

because I already have the outputs for the other tasks, and I haven't changed anything, so it can skip tasks that haven't changed since the last run. So, for example, if I run this again, it's not going to do anything, because I haven't done anything. So, it helps you to iterate faster on your pipeline. Okay. So, we finished the turning pipeline. Let's work on

16:04

the serving pipeline. I'm going to show the new plot. Now that we established the relationships between the tasks, I can see these new charts. So, before, we had a plot without any structure. Now, we are saying we are getting some data, we generate some features, and we join

16:20

everything to train a model. Now, I want to generate a serving pipeline, and the only difference between this training pipeline and my serving pipeline is what happens at the beginning and at the end. When we are training a model, we want to get historical data. We process it, and then we train a model. When we want to make predictions, we are going to get new data. So,

16:42

all the new data points that we want to make predictions on, we have to apply the same pre-processing to generate the same features, and we are going to load up a model and make predictions. So, as you can see, what happens here in the middle is the same thing. So, I'm going to use that fact and reuse this code, so I don't have to compute my feature

17:00

generating code twice. So, that's what I'm going to do now. What I'm going to do is I'm going to create a new file that separates what's common to both pipelines. So, I'm going to call this features channel, and then I'm going to create another file where I'm going to be declaring my serving logic. So, let's go back to the training pipeline. I'm going to take out these two tasks

17:24

which generate the features. So, I'm going to put them here. This is going to be common to both pipelines. So, I put this here, and now to fix my training pipeline, I'm going to import that file. So, I do import tasks from say, features.jaml. Okay. So, that's it. Now,

17:45

our serving pipeline, I'm going to reuse our training pipeline as a base, and I'm just going to make a few changes. So, for our saving pipeline, instead of getting historical data, we need to get new data. So, I'm going to create a new script called get new. I have to

18:06

be compatible with the rest of the code. What else we have to do? I have to change this. So, this is not going to be training.jaml. This is going to be making predictions. I'm going to do predict, and this is going to be a data file with the prediction. So, I'll call this predict.

18:22

Okay. So, now I'm going to parameterize these two pipelines because when I run the training pipeline, I want to save the output in one folder, and when running the serving pipeline, I want to save the files in a different folder. So, I create a new file to parameterize my pipelines, and I'm going to say out train. So, my training pipeline is going to save its output

18:45

here. And for our serving pipeline, I'm going to create that. And out's going to be serve. Now, I want to, I need to parameterize my pipeline. What I'm going to do is I'm going

19:01

to change the path to the output files, and I'm going to include that parameter that I just created. Okay. So, I know this is really fast. I'm skipping lots of details, but just want to get some idea of how the experience looks like. So, I parameterized my pipeline, my training pipeline. Now, I have to do the same thing with my serving pipeline. Out. And finally,

19:26

this other file. Okay. Out. I'm going to test this thing. Oh, I missed something. I have to include my model. So, when serving a, when serving predictions, we have to load our model. So,

19:45

what I'm going to do is I'm going to say my model is in this folder in a file called model.pico. And this has to be a parameter. So, params model. Now, let's get our pickle file, the one that we generated when we ran our training pipeline. I'm just going to copy that.

20:03

I'm going to delete the rest of this because I want to show you how, now that we parameterized our pipelines, I can run plumber build again. This is going to run the training pipeline, and you are going to see that it's going to save everything in the train folder because we parameterized the pipeline. So, it's running everything from

20:21

scratch again. It's training a new model. Now that it finished, I'm going to do the same for the serving pipeline. We want to test that we can actually serve, oh, actually, I'm skipping a really important step, which is coding the logic for the training, for the serving pipeline. So, I'm just going to do that. I have to tell plumber that it has to use the serving pipeline instead of the training one. So, I'm just going to do this.

20:49

Now, I'm going to generate the base files right now because I don't have anything, I don't have this file or this file. So, that's what I'm going to do now. I'm going to call plumber scaffold and then use my serve demo file. Okay. So, we got those two files. We

21:05

can see them here. And now, let's go to JupyterLab. So, we code the logic that gets new data and the one that loads the model and makes predictions. So, just for simplicity,

21:20

I'm going to be loading the same data. It's not going to be new data because this sample data set is limited. So, again, please don't do this in a real machine learning project. This is just to make an example of how this works. In a real project, we would be getting new observations that we want to make predictions on. So, I just copied the code from the other task because

21:44

this is going to be really similar. So, let's assume this command load iris gives us new data and we want to make predictions on this. I have to change this because we don't want the target variable. So, you can see we only have the raw data and we want to make predictions on this. So, I'm going to save this. I didn't. Oh.

22:08

Why is this not auto completing things for me? All right. Let me see what's going on. Oh, I see what happened. This shouldn't be new. This should be good. All right. Let's see if

22:35

my output. Oh, I don't have the output folder. I can actually use the command line to

22:44

ask Plumber to run my new code. So, it generates that folder. I am missing the serve folder. So, it cannot save that file because I only have train. But just to show how the command line works, I'm going to say Plumber task and then get. So, I want to run this task

23:04

and use my pipeline serve. So, this is going to run this file from the command line and it's going to create that folder for me. All right. So, we finished that. We can ignore this and let's continue working. So, we are going to reuse the previous code. So, these two files,

23:25

the ones that generate features, this one and this one. So, I can simply run my pipeline because we already declared that in our serving logic. So, I'm going to do Plumber build. This, of course, is going to break at the end because I don't have the script that makes a

23:43

prediction. I have to work on that now. So, I am just going to generate the features. Now, I have the features that I need and I can continue working on my final step. I can show that I have the serve folder. All right. So, we have this is generated by the serve pipeline.

24:04

Okay. So, now let's continue working on this. We need everything from the previous tasks. Now, let's reload this thing. And I am going to borrow some of the code from here, not from here.

24:28

From here. Just to make things a little fast. I need this. I'm just going to make a few changes here. Actually, I think I thought we could change anything. So, we are going to generate

24:47

the features and then we are going to load the model and make predictions. So, you see, we have all the features. We don't have the target variable because this is the task that loads and predicts. Now, let's load our model from path leave, import path, and import pico.

25:04

So, we have the path to our model here. Let's load that at model. I think it's model. Yes, model. Create bytes. And we need pico loads. This is going to return the object.

25:26

Okay. So, we load our model. We are going to make predictions now. I think it's called here. Yeah. Spreads. And now let's just create a data frame with this.

25:40

So, that we can save this as a CSV file. Okay. So, let's assume these are the predictions that we want to generate. Now, we save this. Cxbeam. Product data. Yes. Index false. Okay.

26:02

So, we finished coding. Let's make sure that our serving pipeline actually runs from scratch before we deploy this to the cloud. Okay. It's working. Great. We finished working on the Jupyter cloud so we can close this. Let me shut this down. And now, so, we finish

26:25

with the coding part. We don't need this anymore, the output. We just need the models. I'm just going to delete it. Now, we use the second command line tool. So, Plumber takes care of helps you write pipelines locally. And if you want to run things in the cloud, you can use

26:43

the second command line tool that we're going to use now. So, what I'm going to do is I'm going to create a new deployment environment. So, I'm going to use the supervisor add. I'm going to call this serve. And we are going to use the AWS batch backend. We can also use Airflow or Kubernetes. The experience is pretty much the same. The only difference is

27:06

the configuration. It's pretty much the same. Oh, I'm missing two files that I need. My dependencies. So, what I'm going to do is I'm going to get those files. I need

27:20

these three. Just configuration files that I need. My credentials for the S3 bucket and my dependencies. That's why it didn't work. The command didn't work. So, now, okay. So, now it worked. And we have this new file, which is where we are going to be configuring, setting the configuration for the execution in the cloud.

27:43

So, this is AWS batch settings. You can ignore the details here if you are not interested in AWS batch. These settings change if you change the backend. So, I need to get a copy of my repository URL. So, I have that here. Great. So, that's, those are my settings. I finished

28:06

configuring this thing. Now, there's one remaining piece here. Because when we run things in the cloud, we are going to be running each task. So, each of these scripts in a different container

28:22

that are completely isolated. So, if we run one task that depends on a previous task, we need to pass or transfer the data. So, what we use is that we use an S3 bucket. And I have to configure a client. So, I need to add a new file to configure the client. And I say

28:42

clients.get. And I'm going to create that file now. Clients.py. And now, I have to configure my S3 bucket. So, S3 client. Let's get return S3. Timber bucket. Now, the folder, I'll say

29:04

hello from Python. And my credentials are in credential.json. Okay. So, I think we're done.

29:20

Let's check out if this configuration works. So, plumber status. And check our serving pipeline. And if this configuration can successfully connect to S3, which it happened, it means that we are ready to deploy. So, what I'm going to do now is I do supervisor export

29:42

and the name of my environment, which is serve. Okay. Let's run this. So, it's loading my pipeline. And now, it's going to make sure that it actually works. It's creating the Docker image. It's pretty fast because I already generated a base image. So, it's only adding the new code, but it's not installing dependencies because it already has

30:05

those things just to make this faster. Now, it verified that the pipeline works, that you can import it. Then it checked the configuration with the S3 bucket. It pushed the image and it submitted the jobs. So, that's it. We deploy our pipeline. It's

30:29

ignore this. These are just the practice that I did last night just to make sure that I was able to do this in 30 minutes. So, you can see the new tasks here. These are the ones that we just submitted to AWS. And that's it. We finish. We finish on time.

30:58

All right, Eduardo. Thank you so much. Great talk. Great tool. Great presentation.

31:05

Live coding, everything. I think people also in the matrix were really impressed. We have, well, we can cut a little bit into the break right now. So, I will ask you

31:21

one question, one very quick question that people had in the chat. And can Bloomberg be used without Jupyter notebooks? Yes, yes, you can. You can use it without Jupyter.

31:40

I have a really strong preference for Jupyter because it allows me to do things interactively. But if you like a text editor, of course, you can use the tool that you prefer. Fantastic. Great. Thank you so much, Eduardo. Folks, let's thank all our speakers again

32:01

for this session. I will do the chat clapping myself.