From Zero to Portability
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 561 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/44282 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
00:00
Programming languageProcess (computing)Mathematical singularityMachine visionWritingSinguläres IntegralPrimitive (album)Phase transitionFunction (mathematics)outputRaw image formatCountingWordProcess (computing)Projective planeProgramming languagePortable communications deviceOpen sourceElectronic data processingObject-oriented programmingSystem callRow (database)InformationSoftware frameworkVirtual machineStapeldateiKey (cryptography)BitWordFront and back endsElectronic mailing listSequelSoftware engineeringParallel portElectronic program guideFormal languageDistribution (mathematics)Uniqueness quantificationJava appletSoftwareMachine visionTransformation (genetics)CountingStreaming mediaMultiplication signSign (mathematics)CASE <Informatik>1 (number)MereologyComputer clusterMultiplicationGroup actionMaxima and minimaComputer configurationBookmark (World Wide Web)Axiom of choiceAbstractionBranch (computer science)NumberSoftware developerPower (physics)DataflowMusical ensembleJSON
06:42
Singuläres IntegralCountingWordEvent horizonState of matterProgramming languageMachine visionComputer configurationFormal languageSoftware frameworkPoint cloudPortable communications deviceLink (knot theory)outputComputer configurationParameter (computer programming)CASE <Informatik>File formatJava appletPoint (geometry)Element (mathematics)QuicksortNumberBitMultiplication signIntegerCountingLevel (video gaming)ExpressionFunctional programmingMereologyFigurate numberSoftware frameworkKey (cryptography)SummierbarkeitState of matterTranslation (relic)Open sourceEvent horizonTerm (mathematics)Programming languageLoop (music)Goodness of fitDifferent (Kate Ryan album)CodeTask (computing)Context awarenessHTTP cookieFormal languageArithmetic meanProcess (computing)AbstractionPoint cloudDigital watermarkingMoment (mathematics)Machine visionLibrary (computing)Right anglePosition operatorDivisorControl flowTransformation (genetics)WebsiteComputer animation
13:24
Server (computing)Level (video gaming)Software development kitTask (computing)Execution unitFormal languageDivisorLink (knot theory)Fiber bundleLimit (category theory)Online chatJava appletConfiguration spaceThermal expansionIntegrated development environmentCountingSource codeProgramming languageServer (computing)Integrated development environmentProcess (computing)Mathematical optimizationLatent heatThermal expansionBitObject-oriented programmingTask (computing)CASE <Informatik>Level (video gaming)State of matterFront and back endsElectronic data processingComputer configurationRight angleJava appletMoment (mathematics)QuicksortFormal languageSerial portMedical imagingInstance (computer science)CodeTransformation (genetics)Software developerPortable communications deviceTraffic reportingAxiom of choiceWindows RegistryDifferent (Kate Ryan album)Arithmetic progressionTwitterUniverse (mathematics)Event horizonLink (knot theory)DataflowFile formatConfiguration spaceService (economics)TensorGame controllerPlanningCountingTranslation (relic)Direction (geometry)Free variables and bound variablesTelecommunicationGroup action
20:07
Execution unitMatrix (mathematics)Lattice (order)Portable communications deviceMoment (mathematics)Link (knot theory)Matrix (mathematics)EmailDataflowLibrary (computing)Exception handlingMusical ensembleBytecodeComputer-assisted translationArrow of timeInterpreter (computing)Operator (mathematics)Electronic mailing listMereologyOnline helpWebsiteCASE <Informatik>Programming languageRevision controlError messageGoodness of fitSubsetRun time (program lifecycle phase)Software testingSoftware frameworkJava appletNormal (geometry)JSON
25:05
Computer animation
Transcript: English(auto-generated)
00:05
That's probably fine now it's on yeah Please stay
00:31
Okay, cool Yeah, I guess I have to scream
00:43
Yeah Yeah, it's weird For the recording Yeah, maybe that's better
01:13
Hello, so yeah, my name is max and I'm a software engineer at the beam project and
01:20
I want to tell you about beam today and how beam realized its vision for a portability and What do I mean by portability because portability can mean a lot of things? Well, first of all, um, you have to listen carefully to understand but hey Colton But the short answer is that it enables you to run your data processing jobs on top of various
01:41
Execution engines like spark or fling or samsa or Google Cloud Dataflow and you can do that in the programming language of your choice So that sounds pretty good, doesn't it? So I've put this agenda together So first of all, I mean some of you might know beam and I will give like a short introduction Then we will talk about you know a little bit more about portability and
02:03
Then how we can actually achieve it because there are multiple ways to do that And then we let's recap and see how far we are actually with portability So what is beam? So first of all beam is an open source project at the Apache software foundation So, I don't know if you know the Apache software foundation, but it's like a framework for developing open source software
02:26
Which they provide infrastructure and kind of a guide how to develop software in the open source and Beam is a project there and it it focuses on parallel and distribute data processing So and you typically run your beam job on like multiple machines and and you have probably like you have
02:44
mostly large data, but you can also run it on a single machine if you want and It has a really cool API which can do batch and stream processing at the same time So often like you have like a batch and stream API which are separate and you have to like port your Batch up to streaming but in beam it's all the same
03:02
So and and once you've written your job, you can actually run it on like multiple executions engines That's why sometimes we say it's like an uber API because I use one API but you can execute with multiple backends or execution engines and now you can also use your favorite programming language
03:21
So a little bit more detail on this so we have I mean this is the vision of beam we have the SDKs here on the left side and so that's like Java go Python scholar and sequel and Then we have some magic happening and beam which are the runners There's a runner for every execution backend and the runner translates the beam job in the SDK into
03:47
the language of the execution engine and you can see there a bunch of them and more and more are coming and Yeah, I mean that's really nice to have that choice, right?
04:00
So how does the API work like just concept wise so in beam there they're called There are P collections So the first of all there there's the pipeline the pipeline is like the the object that holds all your job information So you create that from some options which you can pass in there and then you and create P collections P collections are created by
04:23
Applying transforms to the pipeline so you do always like apply transforms It's really easy and this can be like you can do multiple Transforms after each other or you can do you can also branch like here where you create this P call to which is like, you know the branch of P call one
04:41
So and then you can you can run that and that pipeline That's pretty sweet. So transforms are actually quite nice abstraction because Transforms can be either primitive or composite. What does it mean? Actually in beam, we only have a few primitive transforms We only have like Pardew group by key assign windows and flatten
05:03
So I will explain two of them in a bit but so basically what that means you can define like composer transforms which use these and then These are actually the composite transforms like expanded to these primitive ones Which is really easy because we just need to I mean as a runner creator
05:21
You just need to implement those for primitive transforms and we can we can do optimizations for composite transforms But it's enough to implement that primitive transform. So of course because this is like a big data framework we have to do a little word count and For those of you who don't know what count it's basically you're trying to you have a list of words like to be or not to
05:43
Be and you try to count how often is like a unique word distinct word appears in that list The way to do that is you to use if we are talking about beam then you use a Pardew which stands for parallel do and you would you would assign like a key value you would transform your words into a key value object with like a
06:03
Which stands for number of occurrences and then you would do a group by key which basically well shuffles the data and gives you a list of all the values for every distinct key and then you can sum them up and you know that two is twice in this list and B also and the others just once and
06:23
so Don't don't get confused. Now. This is this looks really ugly This is actually how you would do it in beam, but we will see we can simplify it a lot So we we have the pipeline we created we have our lists of words in this case like hello Hello for stem and we we have this part of this first one with the signs like the one
06:45
And then we do a group by key and then we have this loop here in the second part of which sums it all up Yeah, I mean that was pretty ugly. I agree. I mean, I don't know a better way to write this any non comprehensible
07:00
so Luckily, we have composite transforms. So We we can simplify this now further. So instead of doing the the first part do which where where we do this Do with n function we just use a map elements function Which is slightly some more simple and we do like an integers per key composer transforms
07:21
But which does basically it sums up the value the number of occurrences for each key and We can simplify this even further by By just using this count per element from pose a transform. So that looks pretty simple, right? So there are a lot of these transforms and beam and if you read the documentation
07:44
You can you can write really readable code even in Java because that is that is the Java API and we have of course Fortunately also a Python API which which looks so much nicer. So here this would be the same Initial example, we just use lumber functions
08:01
to that do that work count and Also in Python we have of course these composite transforms So this is maybe slightly simpler where we have the combined per key function and we pass some Sum is an argument This is just a like a very quick look into the beam API. I thought it would be useful. There's lots of more
08:24
Composer transforms you can create your own we have lots of IO We have windowing event time watermark site inputs, I mean State and timers, which is it doesn't make sense to you at the moment. Maybe if you haven't tried it, but it's really useful concept once you Learn more about beam and your pipeline gets more complicated
08:44
So what does portability mean now? I mean I showed you Java I showed you Python Where does I mean it's I mean that should already be working, right? So Let's see first what is I mean, what are the two different kinds of portability in the beam context?
09:01
So you have the engine portability, which is like the ability to run it on Different execution engines and we have the language portability, which is like using different SDKs for composing the the pipeline and If we look back at the vision, which I showed you at the beginning. This is really I mean how it should work, right? And
09:24
In terms of Engine portability it is actually true. Like we are in the Java API We we just you know, these options would be passed to the pipeline. We just said runner flink runner and Then we do run and it really runs on flink. That's pretty amazing. So we have that part covered already
09:44
And now what about language portability? Why would we use other languages well, it's kind of I mean clear I guess Syntax expression of communities is a big point because there are a lot of people simply don't like Java for various reasons
10:01
Which I can understand I mean, I'll really like Java but it's okay But we also have a lot of libraries which is like an important factor like tensorflow are really like huge libraries Which are simply not available in in in Java. So that's a good reason to use Python So I was actually lying a bit to you
10:22
This whole this whole portability language wise doesn't really or didn't really work So it used to be the case that we just I mean basically only we're supporting Java and Scala in in the open source world and we had like When you use like the Google Cloud you could run Python, which is like
10:43
Not so cool, right? I mean It kind of breaks the promise so what we what we need is and what we worked on in the past like almost two years is to build a language portability framework into beam and its runners so that we actually
11:01
Can do the the full realize the full vision? So How do we achieve how do we achieve it? If we look at sort of the Very abstract translation process of a pipeline. It used to be like this where we had Java
11:23
and then a bunch of runners and They all executed in Java. So they need to implement their own translation way, but once they translated it was fine now that we have language portability, it seems like a Well, maybe not very good idea, but it's certainly possible to just you know, let every
11:47
SDK figure out a way to translate To every execution engine then the execution engine has like various their own various ways of supporting that language but just that just seems like a terrible idea very complicated and
12:01
Replicating a lot of work. So what what we did is We introduced the the runner API which takes the pipeline from the SDK and Sort of transforms it into a language agnostic format That's called the runner API. So it's it's based on protobuf. I mean doesn't really matter. It's just like a format that is
12:23
consistent across languages So and then what we also needed is during execution We have like these language dependent parts like when the execution engines or most of them are actually all of them are written in Java so and when you have Python you need to figure out a way to Send data to to that Python process and access state and on all that and this is called the
12:47
fun API FN API. Yeah, and that way we pretty much only have these two extra layers and Just have to make sure the runners are compatible with that and then we're good to go
13:02
So let me simplify this a lot. So we have the The old way was like we have the SDK and the runner and we have for example Execution engine like Flink with a bunch of tasks and the all these were in Java. So and that worked pretty well The new way is a bit different. So
13:22
And the new way we have the SDK here which uses the runner API to produce this universal pipeline format and then we actually have the job API which is a way to Send this pipeline to the job server and the job server is really a beam concept now
13:45
It used to be that every runner had you know, every execution and had its own way of submitting applications And but we wanted to you know, really Get everything portable. So we created the job server and in the job server the runner translates this runner API
14:02
pipeline and then It executes it on the engine of your choice but of course we have these like Python blobs or go blobs in between which we don't really understand and whenever we have that we we have a special Well task called executable stage
14:21
Which is a fancy name for we don't know what to do with this. So we have to send it to an external Process which is called the SDK harness and that harness exists for every language like for Java Python and go so um whenever so whenever we I mean
14:41
So we put the we create the harness when we start a job with the Python code for instance And then whenever we receive data in that Task, we we send that to the external process the external process does its processing and sends that back, you know it's very simplified and This there's some challenges to that because
15:01
There is not a great cost but there's some costs when you send data to an external process, right? Because you need to serialize that data and deserialize it again So we build in some optimization called fusion which tries to combine as many of these Python stages for instance into one SDK harness so we
15:22
Don't do any like the duplicate serialization work. How does the SDK harness work? so First of all, the SDK harness needs to be bootstrapped somehow, right? So what we typically do is we use Docker so we have an environment which contains all the dependencies like my tensorflow or my numpy and
15:45
And just use this Docker image directly. We can specify that in the options That's a really easy way of deploying because you you have an image registry and you just download an image automatically and start it but some people don't want to use Docker because
16:01
For various reasons and So you can also start like a process-based execution, but then you have to make sure you set up the environment thank you the environment like manually and it's also possible to run this embedded in case you're you are using Java and
16:20
So There's I mean there's a lot of happening with a lot of communication between like the backend and the SDK harness Like obviously we need to control like we have like control play in a data plane we have a way to access state and And report progress and also logging. I mean everything is locked So you know actually what is what is happening inside external process because otherwise debugging it would be would be really hard
16:47
so What is now missing is and kind of a problem? It is not only I mean a runner is is Like a SDK is only complete if you can read and write data right because
17:03
It's not really worth anything if we can support all the primitive transforms. We also have to be able to Actually have that connectors which we have in Java in in any SDK available and you can see there a lot of them available Now it would be kind of a lot of work to replicate them and the language support
17:24
For example when you want to create a Kafka connector in Python the language support is not so good in Java It's really good. So ideally we would we would just use the Java connector in Python and not you recreated in in Python and
17:42
Turns out we can actually do that and that's a pretty amazing solution We can simply use that process which I've described to run cross-language pipelines So how does it work theoretically? I mean we're finalizing like the specification at the moment, but it's sort of like this So you have a Python job and I mean probably it's not gonna be named IO expansion
18:03
But it's kind of like it like a dummy object where you specify your IO like Kafka IO or maybe the full Java name I mean it will be made a bit simpler and use pass some concuration and Then of course, I mean Python doesn't understand this but when we do the translation to the runner API
18:22
We actually have like an expansion service Java expansion service running if we want to In the case of Java and we so we take that stop this placeholder and Expanded into like like a native Java Kafka transform so
18:41
and then when then we do the rest of the translation and Doing when the job runs we actually have now two different kind of SDK harness running so we have a Java one for our Kafka sauce and then we have Maybe some Python data processing afterwards We would do some map and count and we of course also have the native trans like native link or whatever you're using
19:05
execution engine transform like a group by key which which just doesn't need an SDK harness or anything because it's supported by the execution engine yeah, so This is sort of how portability works. There are a lot of details, of course, but it's a 20 minutes talk
19:24
So how how far are we? So we have the engine portability Mm-hmm, and we have the language portability Almost I would say I mean for developers you can try it out yourself. I have a link for you in the end
19:42
You can try it out. It works. We just have to make it a bit better You know, there are some some like we have to tune a bit of performance although we have estimated five to ten percent only overhead in most cases and then Cross-language pipeline support needs a bit more and specification, but that's going to happen in next week's
20:01
There's also this fancy thing called splitable doofen, which you can read up, but that's not so important there's a compatibility matrix which tracks like the Status for portability of all runners. There's a link here and sling actually is like the best runner I would say because it it supports most features at the moment. The others are going to catch up and
20:25
That brings me to the end of my talk Please check out the portability website or just go to the normal beam website If you want to learn more about beam we have mailing lists and an awesome select channel Which is where there are a lot of help for people
20:41
Yeah, and that's it Thank you
21:12
compile to what sorry Common bytecode Yeah, so the question is why not use something like Apache tinker
21:22
tinker pop which uses like an intermediate common intermediate format between the Languages and then or which is like bytecode which can then be executed There are a lot of other Frameworks with do that for example flink has a Python API which uses Dyson which is sort of like the same idea. You can generate bytecode from Python
21:43
we want to be able to support all kinds of libraries like TensorFlow which is like a native C library and that you can only achieve if you run like a C Python interpreter and not like some custom version of Python which only supports a subset of Python That's the reason I have and so yeah, I repeat the question. So how is the debugging experience like in these?
22:31
Python libraries when when you run into an error in Python like how fast you see it and when you execute on a on a Well, essentially Java runtime. It's actually pretty good and it's been part of the design
22:43
So when when in Python you see an exception it will be like forwarded directly to to the to the op like Java operator and it will catch an error there and so and you Do to the logging and stuff like that. You actually see immediately What and the error is also sent back
23:02
So you see the error message immediately there and your pipeline will fail because if the runner receives a failure it should fail Yeah Good question, I'm not working on the
23:21
Yeah, so the Python 3 is it supported or not? It is supported but it is like 99% done so it is there you can use it. There are test cases and everything It's just not you know officially been released because I'm not working on the Python side myself. So
23:42
I expect it to be done actually in the 2.11 release which is the next beam release should be out next month. Yeah