Simple ETL in python 3.5+ with Bonobo
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 160 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/33794 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201731 / 160
10
14
17
19
21
32
37
39
40
41
43
46
54
57
70
73
85
89
92
95
98
99
102
103
108
113
114
115
119
121
122
130
135
136
141
142
143
146
149
153
157
158
00:00
ExpressionRewritingProjective planeRevision controlGame theoryExpected valueReal numberMeeting/Interview
00:37
Hacker (term)Musical ensembleComputer programWeb-DesignerQuicksortSoftware engineeringDifferent (Kate Ryan album)Context awarenessFood energyReal number
01:22
Singuläres IntegralStructural loadExtension (kinesiology)Demo (music)FeedbackPlanningPointer (computer programming)View (database)Multiplication signCASE <Informatik>Real numberFormal languageBitComputer animation
02:14
WebsiteSinguläres IntegralStructural loadJava appletPentagonDisintegrationGraphical user interfaceCodeTransformation (genetics)Block (periodic table)Focus (optics)Task (computing)Point cloudSoftwareFiber bundleINTEGRALMaizeResultantScaling (geometry)Food energyDataflowUniform resource locatorLine (geometry)Transformation (genetics)Configuration spaceMultiplication signParallel portLibrary (computing)Complex (psychology)Physical systemCodeDimensional analysisRight angleDescriptive statisticsParameter (computer programming)Focus (optics)State of matterMathematical analysisBeta functionSet (mathematics)3 (number)SubsetCuboidPoint cloudNumbering schemeGraphical user interfaceLaptopWebsiteGraph coloringBitComputer programmingSoftware as a serviceGoodness of fitCloud computingDifferent (Kate Ryan album)Electronic mailing listSoftware bugJava appletData storage deviceIntegrated development environmentInterface (chemistry)Programmer (hardware)Open setEmailLibrary catalogComputer animation
08:31
Configuration spaceCodeDisintegrationVirtual machineInheritance (object-oriented programming)LaptopServer (computing)Graphical user interfaceSoftware frameworkObject-oriented programmingStrategy gameFrame problemQuery languageFormal languageEndliche ModelltheorieMedical imagingVolume (thermodynamics)Service (economics)LaptopConfiguration spaceCodeCodeEuler anglesDifferent (Kate Ryan album)Social classWritingGraphical user interfaceGraph coloringHacker (term)Software frameworkFormal languageAdaptive behaviorServer (computing)Process (computing)BitWeb 2.0Computer animation
09:53
Demo (music)PlanningProjective planeConservation lawHTTP cookieBitRadical (chemistry)Address spaceComputer animation
10:36
Staff (military)Service (economics)Total S.A.Disk read-and-write headTemplate (C++)Directory serviceMetropolitan area networkDifferent (Kate Ryan album)NumberTransformation (genetics)Default (computer science)Dot productComputer fileLine (geometry)Computer animationSource code
11:46
Zoom lensTemplate (C++)File formatInsertion lossoutputGraph (mathematics)Singuläres IntegralWritingGraph (mathematics)outputGame theoryPoint (geometry)Protein1 (number)PlanningString (computer science)Electronic visual displayFunction (mathematics)ChainDefault (computer science)Instance (computer science)Plug-in (computing)Source codeProgram flowchartJSONXMLComputer animation
12:45
Data bufferSinguläres IntegralLocal GroupMountain passDisk read-and-write headTemplate (C++)Letterpress printingGraph (mathematics)Structural loadEmailBitFunction (mathematics)CodeStructural loadTransformation (genetics)StatisticsState of matterCodeComputer animation
13:35
Graph (mathematics)Singuläres IntegralWritingCAN busAsynchronous Transfer ModeFormal grammarDemo (music)Data bufferLocal GroupLine (geometry)Revision controlParameter (computer programming)Axiom of choiceError messageCodeMultiplication signOvalLetterpress printingRegulärer Ausdruck <Textverarbeitung>Hand fanFunction (mathematics)BitGraph (mathematics)NumberInformationComputer fileLine (geometry)File formatSound effectWave packetGraph (mathematics)Uniform resource locatorSquare numberEndliche ModelltheorieLevel (video gaming)Atomic numberLogic gateSource codeProgram flowchartComputer animation
15:16
Asynchronous Transfer ModeMathematicsFormal grammarDemo (music)Graph (mathematics)ChainVertex (graph theory)Gastropod shellContext awarenessThread (computing)Service (economics)Inheritance (object-oriented programming)Strategy gameImplementationTransformation (genetics)Query languageFunction (mathematics)Electric generatorEmailSocial classDatabaseDefault (computer science)Computer configurationSoftware testingInjektivitätRun time (program lifecycle phase)Library (computing)BitMathematicsTransformation (genetics)TouchscreenDistanceDefault (computer science)Context awarenessService (economics)Extension (kinesiology)Computer configurationSocial classQueue (abstract data type)Strategy gameThread (computing)Table (information)DatabaseKeyboard shortcutoutputStatisticsConstructor (object-oriented programming)Library (computing)Function (mathematics)Computer fileDifferent (Kate Ryan album)Graph (mathematics)Buffer solutionImplementationElectronic mailing listGastropod shellQuery languageChainGraph (mathematics)Functional (mathematics)Plug-in (computing)CodeSoftware testingComputer clusterLine (geometry)Data dictionaryMultiplication signElectric generatorObject (grammar)Concurrency (computer science)WeightProduct (business)Inheritance (object-oriented programming)Assembly languageGame controllerTheoryMilitary baseShift operatorGame theoryGraph theoryGraph coloringConfiguration spaceSource codeComputer animation
21:20
Video game consoleUniform resource locatorInformationInfinityRadio-frequency identificationComa BerenicesRevision controlExtension (kinesiology)Transformation (genetics)Video game consolePresentation of a groupGraph (mathematics)outputVideo projectorMultiplication signPlug-in (computing)Network topologyReal-time operating systemProteinGraphical user interfaceCASE <Informatik>Source codeJSONXML
21:57
Query languageExtension (kinesiology)Library (computing)Multiplication signExtension (kinesiology)Projective planeDatabaseRepository (publishing)Wave packetHorizonComputer animation
22:48
Demo (music)Physical systemLocal GroupGraph (mathematics)Singuläres IntegralData bufferLine (geometry)Visual systemLattice (order)Service (economics)Multiplication signComputer fileRight angleDemo (music)Service (economics)ResultantLocal ringFile systemLibrary (computing)Parameter (computer programming)Default (computer science)Directory serviceRootWeb pageIdentity managementTransformation (genetics)MereologyPhysical systemMathematicsWage labourObject (grammar)Different (Kate Ryan album)Inference engineInstance (computer science)QuicksortDependent and independent variablesEndliche ModelltheorieComputer animation
25:06
Demo (music)Singuläres IntegralGraph (mathematics)Local GroupService (economics)Total S.A.Tablet computerNetwork topologyParameter (computer programming)Transformation (genetics)KälteerzeugungPlanningLevel (video gaming)ChainoutputService (economics)WeightPresentation of a groupPosition operatorFunction (mathematics)Default (computer science)Data dictionaryMatching (graph theory)Letterpress printingComputer filePhysical systemProjective planeMultiplication signGraph (mathematics)Impulse responseFile systemJSON
27:08
Line (geometry)Service (economics)Total S.A.Singuläres IntegralGraph (mathematics)Graph (mathematics)NumberPresentation of a groupWebsiteTask (computing)SupersymmetryPhase transitionComputer configurationKeyboard shortcutBitFile formatComputer fileSource codeComputer animationJSON
28:06
Demo (music)InfinityService (economics)Staff (military)StatisticsTotal S.A.Error messageLocal ringComputer fileDisk read-and-write headLine (geometry)Core dumpInternet forumBitWeb serviceAreaMessage passingLine (geometry)Self-organizationSpeech synthesisParallel portStapeldateiDifferent (Kate Ryan album)Power (physics)Instance (computer science)Phase transitionGenderMultiplication signSubstitute goodSound effectResultantSummierbarkeitComputer fileService (economics)Open setWeb pageSource codeComputer animation
31:04
Representational state transferWeb pageWebsiteVotingBitFocus (optics)LaptopSource codeComputer animation
31:48
Service (economics)Network socketComputer-generated imageryCase moddingParsingBoom (sailing)Device driverPrice indexComputer iconGraph (mathematics)LaptopWeb browserBuildingGraph (mathematics)Computer fileFunction (mathematics)Service (economics)Wrapper (data mining)BitWebsiteMultiplication signWeb pageLibrary (computing)Focus (optics)File systemTwitterPosition operatorVideo game consoleComputer animation
33:18
Digital photographyTheorySoftware frameworkQuantumSoftware developerSelf-organizationCodeVirtual machineService (economics)Computer wormIdentity managementPublic-key infrastructureInformation securityCore dumpHash functionModule (mathematics)SoftwareComputerArtificial intelligenceStatisticsMathematicsRäumliche StatistikScripting languagePoint cloudPlot (narrative)Event horizonAnalytic setRoboticsRobotComputing platformData analysisScale (map)Walsh functionUniqueness quantificationServer (computing)Endliche ModelltheorieSystem on a chipWechselseitige InformationDirected graphTotal S.A.Key (cryptography)Electric currentCommon Language InfrastructureInterface (computing)Software development kitDegree (graph theory)System programmingProcess (computing)Bounded variationData managementGame theoryArtistic renderingLocal ringGroup actionSearch engine (computing)Disk read-and-write headSoftware maintenanceThread (computing)Computer programMathematical optimizationLibrary (computing)Physical systemForm (programming)Operator (mathematics)Graph (mathematics)Plasma displayExecution unitNormal (geometry)Data modelOpen setMultiplication signInfinityElement (mathematics)1 (number)2 (number)Ferry CorstenWebsiteComputer animation
33:53
Normal (geometry)Graph (mathematics)Service (economics)Operator (mathematics)Data modelFunction (mathematics)Game theoryParsingTwin primeScale (map)Computer hardwareDirected graphObservational studyComputerSoftwareSystem programmingLine (geometry)Design by contractSoftware developerAreaCloud computingJava appletArchitectureInternetworkingPoint cloudCodeDemonPolarization (waves)Transformation (genetics)Sign (mathematics)Instance (computer science)Line (geometry)AverageComputer animation
34:33
LoginFluid staticsLocal ringKernel (computing)LaptopControl flowServer (computing)Directory serviceWindows RegistryZoom lensFormal grammarSynchronizationOpen sourceLibrary (computing)Extension (kinesiology)Scale (map)Installation artStatisticsWeb 2.0Process (computing)WebsiteStatisticsStrategy gameVideo game consoleStandard deviationProjective planeOpen sourceProduct (business)Link (knot theory)Sheaf (mathematics)Different (Kate Ryan album)GodProcess (computing)Event horizonComplex (psychology)Software testingChainInternetworkingData dictionaryNumerical taxonomyField (computer science)Simplex algorithmOntologyWave packetLibrary (computing)Extension (kinesiology)Limit (category theory)Computer fileSystem callGroup actionAffine spaceMultiplication signInterpreter (computing)CodeOperator (mathematics)Scheduling (computing)BitScaling (geometry)BefehlsprozessorVideo gameType theoryCore dumpDefault (computer science)File formatThread (computing)Source codeJSONComputer animation
38:35
FeedbackLine (geometry)Uniform resource locatorMultiplication signForm (programming)BitObject (grammar)Message passingSoftware frameworkGoodness of fitError messageWordQueue (abstract data type)Type theorySimilarity (geometry)Point (geometry)Scaling (geometry)Graph (mathematics)Transformation (genetics)Self-organizationServer (computing)Process (computing)Bus (computing)Computer architectureInstance (computer science)Different (Kate Ryan album)Video game consoleException handlingTask (computing)Function (mathematics)NumberDefault (computer science)Network topologyPlug-in (computing)Observational studyElement (mathematics)Cellular automatonGroup actionComputer programmingData transmissionProjective planeStreaming mediaCovering spaceOracleCentralizer and normalizerAreaVariable (mathematics)CASE <Informatik>Computer animationLecture/Conference
Transcript: English(auto-generated)
00:06
Welcome everyone. Thank you for attending and thank you for the introduction. I hope I will meet the expectations. So yes, I'm here to present Bonobo, which is an ETL project in Python 3.5 and the next versions.
00:23
That is about six months old. I worked on it for much more time, but this version is a brand new rewrite from the beginning of the year. So real quick, I'm Romain D'Orgueil. I have a French name and probably a French accent too.
00:42
I worked in a lot of different companies and different contexts, but I've been around web development and software engineering for around the last ten years. I've seen a lot of ETLs, market ETLs in different contexts, and I didn't find what I really wanted. So it's the main reason why Bonobo exists today and why I put so much energy in that.
01:05
Real quick also, I'm working currently as an advisor in the startup accelerators of BNP Paribas. So we are a team of former entrepreneurs building YBoost and FinTech Incorporated accelerators, where we basically give business and technical advice to founders.
01:23
So back to the beginning. My plan for this talk is to take maybe something like ten minutes to go into what exists in various languages, not only Python, because most of the ETLs I used were not Python,
01:41
what exists also in Python and why I decided that it was not meeting my needs. Then I will try to use most of the time of the talk to show you real cases and real world usage. Well, not really real world, but example usage so you can understand a bit more
02:01
and also dive you into the few things you have to know to start using it, and there is really not much. And then we have a conclusion with some pointers of where you can go from that. So ETLs, as many of you probably already know, it means Extract, Transform, Learn.
02:22
According to Wikipedia, it was already popular in 1970, so definitely not something new. And it's basically everywhere where you have more than one data store talking about one data. So if you have some master slave data, if you have some stock system connected to some e-commerce website, for example,
02:43
you will probably use some kind of ETL to connect all that together. For those who don't know about that, the most simple schema I could come up with about ETLs is that you have a stack of data, here it's foo, bar, and baz.
03:01
You have another list of transformations you want to apply on each line of data, and the Extract here can transform foo into something else. When it finishes transforming it, it goes into Transform, and while Transform is taking care of the result of extracting foo, Extract can start handling bar and etc.
03:22
So you can, as it's completely independent, you can handle each transformation line in parallel, step by step. In the real world it usually looks like that. There is databases, there is mails.sense, there is logging, maybe not mails, anything.
03:44
But the general concept is exactly the same, just it's not as linear as it is here. There is a lot of tools I've seen in the market, mostly Java-based, that looks like that.
04:02
It's usually an IDE first, and probably you can code it. Here it's Talend Open Studio. Probably you can use code, but mostly it's a graphical interface that configures everything from dialogues. It's very handy, but when you're a programmer you feel very limited, very fast.
04:23
There is Clover that I never used, but it looks exactly the same, just a bit different. This one is Pentaho Open, Pentaho Data Integration, also called Kettle. I use this one a lot, but also same concept. In this world we have mostly GUI first, eventually code, but mostly Java-based.
04:47
In the Python world there is a few libraries, and not at all exhaustive here. Bubbles, I think, is now marked unmaintained. Pettle is much more a fluid interface. There is a lot more.
05:01
Some people in last conference told me about METL. I think there is a Python ETL. But none of these, according to my analysis, were doing the same thing as the Java tools, which is simply connecting independent boxes together using a data flow. So I started to create Bonobo.
05:22
In fact, I started to create another library, which was Python 2.7, but it was just badly written, so the best thing to do, according to me, was to start again and drop completely the Python 2 support. I explain more later. There is also related tools that you must know, I guess.
05:44
Joblib, Dask, Pandas tools. You may at least know some of them. Maybe Pandas. It's amazing tools, but ETL is not really their main focus. For example, Pandas is really good to transform a dataset into another dataset, and I'm using it maybe every day.
06:03
But when I want to do more engineering on data, like taking one item at a time and transforming in step A, step B, step C, step D, it's not really the topic at all. There are other scales of tools to transform data. Real quick, you may know IFTTT or Zapier,
06:23
which are cloud-based software-as-a-service tools to do small automations. Obviously, this won't run on your laptop. There are huge data tools, like Spark, Hadoop. I just used a few here, but either you need a big infrastructure to start doing things,
06:42
or at least a decent infrastructure. Either you're using a cloud-based thing, and you have the same problem about how do you work without the cloud and how you're not locking yourself into one vendor. As said in the description of the talk, there is no big data in this room.
07:03
I want to tell you a bit of a story about how I came to discover ETLs while I was co-founding a company. When we started, we had a few different partners. We were doing a marketplace about retail.
07:21
It was a closest, mostly for women. We needed to work with different partners to integrate the stocks and catalogues and colors and pictures and et cetera on our marketplace, which was multi-brand. The first partner went very well. We just coded it in Kettle, Pantaho.
07:44
After a while, we moved things everywhere. After a while, it was working, so we were really happy. We got a few deals. The best idea to integrate the other partners is to copy-paste the code to the second and third partner
08:04
because it's about the same, but just a bit different. Of course, this is not a good engineering practice, but when you're used to subclass things and to instantiate things with different parameters and now you have a GUI, you're just lost
08:23
because you can't really do that, so you don't know how to not repeat yourself. Of course, the time comes when you need to fix a bug and you just go crazy. Maybe if it's not a bug, it's new features because you didn't support colors for now and now you have a different model of colors and you need to update everything.
08:42
Really, what I needed was something cheap I could install on my laptop, use on servers too, and using code as configuration and preferably Python code. This one is not for any good reason, except the fact that I prefer code in Python than anything else. Mostly, I needed something that used code as configuration
09:01
to do ETL, just like Pentaho were doing ETL. And yeah, that's vulnerable. It's a framework to write ETL jobs in Python using code and eventually, someday, some kind of GUI may come to visualize things, but first, it's code, so you can write classes, you can subclass things and adapt things like you're just coding web
09:23
or maybe other engineering. I'll go very fast on that, but it's very different to all tools existing in the PyData world that I know. Maybe there are tools I don't know, so I'm very happy if you tell me about that.
09:41
And Hacker News told me that I'm a bit stupid. Bonobo is not a monkey, it's an ape. French language apparently don't make the difference, so I did not even know that that was two different worlds. So, let's see. I will try to show you first how to bootstrap a project,
10:08
then I'll pause to show you all the different concepts that I used without telling before, and then I will go back to the demo and examples to apply the concept I showed to different demos.
10:22
So, the basics is pip install, and you have a generator using cookie cutter that is just bonobo init something, and you can run something with bonobo run. So, how do I switch to a terminal? Not like this, obviously.
10:41
Okay, so I already run the init, but because I'm sure you don't believe me, or maybe you believe me, but I will just show. You can bonobo init foo, for example, and it will just create a foo directory with a main.py file, and probably if I div main.py with foo slash main.py,
11:05
yeah, there is a few difference, but mostly end of lines. So, it's the same file. So, I remove foo because I have other file that I will use after that, and just so that, yeah, I can bonobo run main.py, for example.
11:23
That's the default transformation that is bundled with the generator. Nothing really fancy, just generate numbers and takes only the odd numbers, and I can also run on a directory because the main file is considered the main,
11:41
and so running dots will do exactly the same. Okay, that's not really interesting, but that's really the basics. So, that was that. So, now I want to show you what was actually run, so I won't show you the 1 to 42 or 0 to 41,
12:02
but I wrote a simple one here, which is basically the definition of three different function, one yielding Euro, then Python, then 2017, one just applying title to a string in inputs, and one just printing the thing. Once I define all that, I can create a graph instance.
12:23
Here it's a linear graph, which is the default thing we can do, but there is also API method to add other chains, forking for some point in the graph. I will show it in an example. And yeah, I just define a graph, and because there is a graph instance here, I can use bonobo run on this file,
12:42
and it will just add some plugins for the display and run that. So, here I should have the first .py file, maybe it's a bit big, I don't know. I have the first .py file, which is not at all what I wanted to show.
13:01
Okay, I need to check out example one, and I replace the code in main.py, so it's exactly what I showed, and I can, bonobo run, main.py, I will see the outputs of the load which prints the thing,
13:21
so euro, python, 2017, which is titled. And I see some statistics. It's very fast here, so it's already gray, but you will see the statistics move while it's running on a longer transformation. Okay, that was this one.
13:41
Yeah, so second one, a bit more complete. I just made a euro, python.txt file. I just extracted data on the EuroPython Society, which is a company behind EuroPython, about all the conference in EuroPython. It's like two or three lines each time that says, yeah, it was there, there was maybe a few attendees.
14:02
Sometimes we don't have the information about the number of attendees. Sometimes we have the date. We have the date every time, but not really formatted the same in the same way. So I took this data and said, okay, I will extract all the paragraphs about each conference
14:21
and send it, yielding it to the next. The detail of this code is not really important. Then transform that using a few reg eggs to find the location, the number of attendees, if it's here, et cetera, and create a dictionary from that. And then I made a little helper function
14:40
called org0 to crogs that changed the formatting of fan input, output, but not really important yet. So here I create a graph the same way, and I'm using a built-in, which is pretty printer, which is better than print to print. And if I run that, yes, I need to run, actually.
15:03
I will see the name of a new date, attendees, attendees only if available. For example, it's not yet available for your Python 2017. Yeah, does that. So there is a few change I can make to this transformation
15:21
to make it a bit more useful, but just before I want to explain what's happening under the hood. And maybe full screen is better. So what's happening here? We created a graph instance, and the graph is really a list of edges and nodes, nothing fancy.
15:42
It can be represented graphically like this, but it's just two lists, in fact. And to prove that, I just removed all the code that, yes, is in use, but not really useful, and, yes, the graph definition is that. On first call, you have a shortcut to call addChain that add the first chain you pass
16:02
to the constructor of the graph, but then you can addChain anytime you want, and you can specify your different inputs because you don't want every first node to be at the beginning of the graph, but maybe fork an existing chain. Then once you define the graph, you either run it using the bonobo.run method,
16:24
or you can run it using the CLI like we did before in the shell. And what happens is that it takes this graph we defined before. There is an executor strategy that adds a few things. Here it adds a global context, a context for each node, and a thread around each node.
16:42
It creates FIFO queues that are thread-safe queues, Python built-ins that, not built-ins, but standard library, that are used to buffer input and output between the different nodes. In fact, the context here
17:00
is only used to keep the transformation contextless and stateless because if we need to keep, for example, statistics or maybe instantiation of something we need during the time of the execution, we don't want to modify the object you provided to the graph. So it looks like that,
17:22
the global context. It creates a context for each node, just what I said before. The strategy is relying on thread pool executor of Python, and concurrent.futures, I think. And we just create a runner, which is just something that will run
17:42
every time it gets something in the input queue. It will run a node and push the outputs to the next queues. Then it does nothing, and when it's finished, it shutdowns. That's implementation details. You definitely don't need to know that to use Bonobo, but it's not that complex what happens under the hood.
18:05
What you can use as transformations in Bonobo is various things. You can use functions like we did before, mostly if you have for each line of input one line of output. You can use generators if you have for each line of input
18:22
you can use zero, one, or more output lines. For example, it's very useful to implement joins, Cartesian products, or even to make something that is a yield or not. You can use iterators, which are not really callable, but it's handy to say, okay, I can have transformations that have no input.
18:43
It's why it's called extract here, that it has no input and yields a bunch of outputs. Of course, you can use everything that is callable in Python. I'm just trying to call it. If it's callable, then, yeah, it's probably a duck.
19:01
So that's the handmade way to do that. You can do the __callDunder and it will work, but there is a handier way to do that. I have a bonobo.configurable class that allows to use a few descriptors to specify what kind of options and dependencies
19:23
you will have in your transformation. Of course, for simple transformation, you won't use that, but if you need to configure the transformation, it's probably easier to use that. Here, we define an option called table name that we will use to query a database. It has a default, but you can override it.
19:41
We'll see after. We define a service that we call database, which defaults to database.default. It's a symbolic name that will point to something. We'll also see later. Yes, whatever you want to instantiate this query database class, you can or not override the different values
20:02
and that can be validation, but that's the detail. Much more interesting, there is services and services like the database service we provided here is basically saying, okay, my transformation will rely on a database, but I don't want yet to tell you what implementation I will use.
20:20
I just say it's probably called database.default. At one time, provide me something called database.default and I'll try to use it like I thought it would work. At one time, you can provide via getServices function a simple dictionary that provides the implementations,
20:40
allowing, for example, to provide a different dictionary for tests and so you will be able to provide a my database test implementation or mock implementation instead of this one to test the transformation without testing the external dependencies and PostgreSQL and et cetera, et cetera. There is bananas with Bonobo.
21:02
Not a lot for now. It's kind of a standard library. It allows to read files, read files. We will use that just after. Nothing fancy here. There is a few tools to work with the lifecycle or to debug things. We used pretty printer before. There is a few extensions and plugins
21:20
or plugins and extensions in the order of presentation. This projector has really good quality because we can see the thing. It was not the case last time. The console plugin will show in real time the input and output of each transformation I'm apparently not able to draw a tree,
21:42
a nasty art tree in the console but to show the graph like a git log would do but if some of you know how to do that I would be really interested because it will be the same feature but it will be nicer. And of course I'm using Python logging to do that. There is a Jupyter plugin.
22:00
I should have the time to show you that after. Everything that relies on bigger libraries or big dependencies are bundled eventually as extensions. There is for example the SQLAlchemy extension I'm starting to use which allows to work with SQL databases.
22:22
There is a Docker extension that adds a runc command to the Bonobo CLI and tries to do the exact same thing as Bonobo run but within a container Yes, there is a different repository called Bonobo Devkit that allows to work on different forks
22:41
at the same time of the project probably more something for me or for anything that wants to contribute but it's useful too. And yes, we have time for more examples so I will show you a lot of things. So, first, what up.
23:04
So, I'll start again from the demo I did before and we'll try to show you how to use a service instead of directly opening a file how to write the result to a CSV and how to write to JSON. So, what did we have?
23:24
It was this one. Okay, so we were reading the europe-iten.txt transform, etc. So, what I want is instead of opening this file using something that I will be able to switch from local file system to S3, etc. And there is a very good library
23:42
we are relying on on Bonobo which is called file system 2 that does exactly that. Yes, so we're just depending on that and it will be installed. So, I can do something that is at requires of FS, for example.
24:04
I will import requires. So, it will be provided as a parameter
24:24
by the decorator just above and I can use FS like this using FS.open instead of just opening the file. So, I did an FS service maybe it's already defined but by default it's the only one that I defined by default
24:42
but in fact, all this thing is not really useful. So, we'll just say, okay, FS is Bonobo.openFS which by default will use the current directory as the root of the file system object. So, the Bonobo run first.py should do the exact same thing it did before
25:01
but we are not really anymore directly opening the file. Next step was writing to a CSV. So, that's a good occasion for me to show you how we forked the graph and not make something just linear. So, I will use the addChain method
25:22
to add Bonobo.csvwriter to a file name which will use also the file system service which is the same system service. So, we'll write to europathen.csv and to explain that I don't want it to be a new chain
25:43
that just takes an empty impulsion at the beginning of the transformation I need to say, okay, the input of this chain like the node before or the first node of this chain is arg2quags. About arg0quags
26:01
in fact the previous transformation so transform function was returning one argument containing a dictionary and it just transformed this first argument which is a dictionary into keyword arguments and by default it can be overwritten
26:21
but by default all writers are taking keyword arguments as inputs so it's also why it's here and it works better with project printer too. So, that should work. It's here that I would really like the ASCII art tree but there is a CSV writer
26:41
that ran at the same level of project printer. It didn't take the output of project printer because project printer didn't have any output. It took the output of arg0quags and this output which is here came to project printer and to CSV writer at the same time because I don't know if I removed the file before the presentation.
27:01
I will remove it and run it again so you're sure I'm not lying. And if I open that it should contain all the data but formatted as CSV with maybe the number of attendees like for last year or maybe not if we don't know from the European Society site.
27:21
Okay, now we'll do... So, the next task I had was to write to JSON which would be really easy because the syntax is exactly the same. If I wanted to change a bit the formatting there is advanced options that are not the same for CSV and JSON obviously
27:42
but if I just want to write to a file it's easy if I can use my keyboard. Okay, so yeah, it's exactly the same but European.json. So, I don't have any European.json file here.
28:03
I run the thing and now I should have the file containing the same thing but formatted as JSON. Okay, so that's very basic. So, I tried to find other examples to show.
28:26
I looked up yesterday I think for Rimini open data and yeah, Rimini has open data in fact. So, I don't speak very good Italian so I understand absolutely nothing about what it was about but I understand JSON
28:40
so I could extract things and just play a bit. I think I need to get checkout something which is example three and it's not really happy so I will force the checkout. Yeah, okay.
29:02
Don't look at my commit messages. So, I have a Rimini.py which it's a bit similar as what we did before but here we require the service that we call HTTP. If I open the services.py file
29:23
I will see that I just defined that HTTP is a request. I could have used anything else but that means that probably I will rely on anything that works like requests. So, I use this HTTP to HTTP get an URL
29:41
while I have a next URL because each batch of 100 results if there is a next page say okay, next URL is that because the web service is not very good I need to substitute slash node by slash node.json because it returns to HTML
30:01
and I iterate until there is no next URL. Then I still act a crag and just write a JSON file about that. So, it will be a bit longer because I need to obviously I can't do it parallel because I need to have the result of the first request to know the next URL before I can do anything.
30:21
So, I should have run it before I started to say that. But yeah, 100 by 100 there will be an extraction from HTTP then the R0 to cross is maybe instant and JSONRater is barely instant too.
30:40
It's a bit different to use JSONRater than to just JSON encode the whole thing because I will just encode each line independently to avoid having to buffer everything. So, really what it needs is only one line at a time and it will write really a few bytes every time in the file.
31:01
So, I guess it's about the elections in Rimini. We don't see a lot but we just aggregated all pages from this REST API not REST, just API. There are things I don't understand.
31:21
There are users, maybe the person that created the item on the website I don't know. And there are different districts so I guess that's related to where you have to go to vote when there is an election. But that's really a wild guess. That's Rimini. But for something that we all understand
31:41
and which is English I will show how to extract all the public EuroPython attendees in a notebook. Of course, you can say it's doable with other ways but it's just for the example. So, I will write a Jupyter notebook. I guess everybody knows about Jupyter notebook here.
32:10
So, here I have an attendee.json which should not exist. And I have this so I will restart on a clear output
32:23
just to be sure. Mostly to clear output, in fact. So, I'm using Selenium here which is basically something to control a browser if you don't know about it. It's not new, it's a very old library. And I have a few wrappers but really it doesn't contain much code. So, I say, okay, there is a Who's Coming page
32:43
on the EuroPython website. I need to implement a browser service which is using bonobo-selenium.create-browser. You could just create a Selenium browser directly and I open a file system.
33:00
Okay, so I'm a bit short on time so I will just execute the graph. Execute the graph building step. I have a graph, great. And then I will just use bonobo.run and so there should be a Firefox happening with the Who's Coming that will scroll down. Every time the infinite scroll is not done
33:22
it will get as long as it gets elements. And when it can't get elements anymore it will try to bounce top, bottom ones just to see if it's not some JavaScript that doesn't work or some lags. And if really we didn't get any new data it will exit after a few seconds.
33:42
I think there is something like 350 public attendees announced on the website. Of course, a lot have made it private and they're right because there is people like me who crawl the thing. But I won't do anything with the data. It's just an example for here.
34:01
So probably here it was bouncing. Probably Lucas is the last one. Maybe it already bounced. And the plus sign here just changed to a minus to say that this transformation has finished the other one is instant so we should have a JSON with all the attendees now.
34:22
Yeah. And we know if it's a speaker. We know it's tagline if you put one. Here there is no tagline. Here there is tagline, et cetera, et cetera. So with the very few minutes left I will skip the CRAN example and go to the end.
34:43
So yes, Bonobo is a very young library. Six months old is not a lot, definitely. I'm trying to work as hard as I can but I'm not superhuman so it's not enough, of course. But yeah, I'm really excited by this because it's something I'm using every time
35:02
for everything, in fact. And I'd really like to get to 1.0 either the end of the year or early next year. And 1.0 mostly means for me a stable API you can rely on that is fully documented and fully tested, et cetera. It's already fully tested and a bit documented but I need much more.
35:21
Python 3.5 is a personal scheduler. I started this year. I don't want Python 2 anymore so I'm trying to push the most I can to only use Python 3.5. And there is some really handy syntax to work with data. I don't know what it's called but the start star operator within a dictionary to expand a dictionary instead of updating things in place
35:43
using the dot update on dictionaries is really, really awesome. We still have a global interpreter lock, of course. But maybe we will overcome this limitation of running on only one core using different strategies. For now it's the threading strategy by default so we have the jail.
36:00
Maybe not a problem if you're IO bound but if you're CPU bound it can be a problem. But yeah, probably process pool strategies, maybe a Dask distributor strategy, everything like this we can try to limit a bit to what the jail is bringing us as trouble. So 1.0 will stay, of course, 100% open source.
36:23
It's Apache-ly sense. I want a very light library. Of course it should do the basic things like CSV, etc. Most file formats and tools should be included but all things containing dependencies and complex things should either be implemented by the user
36:40
or go to extensions. It's small scale. The goal is one minute to install, easy to deploy. It's not, once again, not big data, not statistics, not analytics. If you want to do blockchain with that you're probably not in a good conference. And it's basically lean manufacturing for data.
37:02
It's like I have a production chain where I use all little packets of data and one at a time I'm adding something, checking something, modifying something, etc. I'll skip on that but the Internet is completely crazy.
37:20
Like I can use this more concerned about me knowing about actual taxonomy of monkeys and apes and primates. So I really like the last one that says Python not only has duct typing, it has the little non-primate typing feature
37:41
and yeah, this one saved my life. Not really but it was really funny. I'd really like it to become data processing for humans. Of course there is a lot more to do. You can read more on the website bonobo-project.org. You can read more in the documentation. You will find a link in the website.
38:01
There is a Slack channel. You can discuss, come. It's really open. And there is a GitHub. You will also find the link on the website. Yeah, one more thing because I've finished. I will try to organize a sprint whether you want to come or not. It's not a problem but you should really consider to go to whatever sprint.
38:20
It's really amazing at EuroPython to just code on a project by guys. Last year I did pytest which was a really great way to learn. Come, of course, code on Bonobo but if you don't come to other sprints it's a really, really, really great thing. Just before we take a few minutes for questions
38:40
just before that it would be really great if you could give me a really fast few lines of feedback as you think it. I really need raw feedback on this URL. I have a little form and yeah that would be really, really, really great for me. Thank you very much and if we still have a bit of time let's try to answer questions if you have some.
39:02
Thank you very much. So big applause, very interesting and I think there are many questions at least I have many questions but I would give you... It's always good to ask questions
39:21
in the back of the room because it's good for the health of the organizer. Hello, thank you very much for your great talk and working on this project. There's one question that caught my eye. When I work with ETL something can go wrong
39:40
especially with URLs and the weapon stuff. How would I, in this framework, deal with that? Okay, so the question is about error handling. For now what's happening in the framework is that on each line it's calling the function of the node one time
40:03
so there is no possibilities. There is errors that I call unworkable. It's not really easy, I should find a shorter word. So there is unworkable errors that will just stop the graph and raise the exception so I can't just run the graph
40:21
so you, developer, you should fix that. Instead there is also recoverable errors which are errors that happen only on one line of data or a few lines of data so there is a default error handler that you could override that will just use the console to show the unworkable errors
40:41
and just skip to the next line of data. But if you want to handle it differently you could override this handle error thing and just do whatever you want. Probably there will be things like sentry plugins or things like this but for now it's really not a priority and it should be just a few lines to implement that.
41:00
Another question? Okay, then maybe I have a question. Do you know the Kafka Streams? It somehow reminds me just that the Kafka Streams are meant for being distributed on a cluster running big data on the Kafka queue but you also have queues
41:21
distributing the task to threads, I think. So more or less it looks like the same architecture, isn't it? Yeah, so I'm not at all familiar with Kafka but here we are talking of queue instances that are queue.queue Python queues within one process.
41:40
I guess that's Kafka queues but maybe I'm wrong because I don't know. I guess it's some kind of message bus that will be able to pass messages from one server to another because it's kind of an architecture thought for big data first. Here, what I really want to solve is the problem that hey, I need to transform data right now so let's install it and code something.
42:02
So it's intra-process queues. Probably it's a bit similar because you have first-in-first-out messages so at one point you need a queue but it's thought for different scale of architecture first. Of course, tomorrow if I can do the same
42:20
with dask.distributed, for example, there will be same kind of queues but not the same exact type of object that will be able to pass messages from one server to another. But then you have to think of funny problems like how do you optimize the topology of your graph to group the nearby transformation on only one server
42:41
and maybe they cost not the same so how do you balance the number of transformations on each server? It's not really easy and I'm pretty sure that if you have data of this scale you can definitely afford to install even big data infrastructure and use either Kafka or Hadoop
43:01
or PySpark for example or things like this. Thank you. Another question? No one? Okay. So everyone is hungry. So big thanks again. Thank you.