We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Cloud-native ETL with Java Quarkus, Kubernetes, and Jib Container Builder

00:00

Formal Metadata

Title
Cloud-native ETL with Java Quarkus, Kubernetes, and Jib Container Builder
Title of Series
Number of Parts
56
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
DataCater unlocks more value from organizations' data, faster. This talk walks you through our stack, architecture, and processes. We develop tools to deploy and run data-driven applications in a cloud-native environment. We will give a whirlwind tour on developing a java Quarkus application, a CICD stack powered by Github Actions / ArgoCD, building and deploying containerized Kafka Streams applications at runtime with Jib container builder. Having introduced the above common understanding, we will give a high-level overview of how we utilize modern Kubernetes and Cloud tooling to manage multiple clusters in different organizations together with our customers.
Point cloudContent (media)State of matterData managementSoftwareRun time (program lifecycle phase)ScalabilityLogicDivergenceSpacetimeWindows RegistryComputer-generated imageryDeclarative programmingRevision controlScale (map)Musical ensembleCodeHash functionMechanism designInformationMathematicsOperator (mathematics)MereologyRight angleSoftware testingStructural loadInformation engineeringContent (media)Software industrySampling (statistics)Data warehouseReal numberRepresentational state transferRevision controlCloud computingSoftwareScalabilityCategory of beingProcess (computing)Analytic setCartesian coordinate systemDatabaseProduct (business)Transformation (genetics)Level (video gaming)Different (Kate Ryan album)CASE <Informatik>Physical systemGroup actionState of matterDependent and independent variablesPoint cloudWindows RegistrySocial classSource codeInternet service providerDescriptive statisticsBranch (computer science)Integrated development environmentLogicXMLUMLLecture/ConferenceComputer animation
Revision controlWeb pageFile formatDeclarative programmingScale (map)Computer-generated imageryDeclarative programmingFilter <Stochastik>Transformation (genetics)Streaming mediaDescriptive statisticsPoint cloudCartesian coordinate systemMedical imagingMereologyMusical ensembleRegulator geneQuicksortIntegrated development environmentMultiplication signProcess (computing)Endliche ModelltheorieComputer animation
Point cloudSystem programmingContinuous functionUsabilityDigital filterTransformation (genetics)Computer-generated imageryCompilerArchitectureWindows RegistryCompilerBuildingPoint cloudJava appletImplementationStreaming mediaLibrary (computing)BuildingVirtual machineMedical imagingEvent horizonThumbnailMessage passingQuicksortProduct (business)CASE <Informatik>Mobile appAbstractionMereologyPhysical systemState of matterSoftware testingProcess (computing)CompilerStructural loadGoogolCodeWindows RegistryDebuggerNeuroinformatikLecture/ConferenceComputer animation
Java appletCodeLibrary (computing)Term (mathematics)BuildingMedical imagingDemonLecture/Conference
Java appletWorld Wide Web ConsortiumComputer-generated imageryTube (container)Execution unitJava appletCodeElectronic signatureStreaming mediaCartesian coordinate systemMultilaterationPoint (geometry)Multiplication signSynchronizationType theoryMultiplicationContent (media)CASE <Informatik>Message passingSource codeQueue (abstract data type)Computer animation
Metric systemAsynchronous Transfer ModeInstallation artInterface (computing)Asynchronous Transfer ModeSoftware testingImplementationJava appletIP addressHand fanStreaming mediaRepresentational state transferService-oriented architectureComputer animation
Limit (category theory)Metric systemInstallation artAsynchronous Transfer ModeInterface (computing)Dew pointHash functionEvent horizonGene clusterLoginPhysical systemMedical imagingCartesian coordinate systemJava appletCodeCASE <Informatik>Musical ensembleMultiplication signTransformation (genetics)Windows RegistryFilter <Stochastik>Revision controlNumberStreaming mediaMathematicsComputer animation
Computer-generated imageryMessage passingWindows RegistryDefault (computer science)Computer fileJava appletDeclarative programmingProcess (computing)Limit (category theory)Term (mathematics)Medical imagingSemiconductor memoryBefehlsprozessorMultiplicationCategory of beingEvent horizonRevision controlPoint cloudCartesian coordinate systemMereologyProduct (business)Right angleImplementationStreaming mediaMultiplication signMechanism designConnected spaceAuthenticationJava appletPasswordScalabilityWeightDeclarative programmingTransformation (genetics)Gene clusterBuildingProfil (magazine)Operator (mathematics)Filter <Stochastik>Computer animation
Presentation of a groupMusical ensembleLecture/Conference
Event horizonSystem programmingSoftware frameworkMessage passingStandard deviationTranslation (relic)GoogolBlogInformation retrievalDatabaseProgrammable read-only memoryPoint cloudOperator (mathematics)Directed graphEvent horizonDatabaseConnected spaceSource codeRight angleConfiguration spaceSinc functionCommunications protocolProduct (business)Gene clusterOperator (mathematics)Social classStreaming mediaComputer animationLecture/Conference
Operator (mathematics)Context awarenessDatabaseMusical ensembleRight angleSynchronizationCuboidStreaming mediaQuery languageDirected graphComputer animationDiagram
Source codeDirected graphProcess (computing)Connected spaceImplementationMessage passingException handlingMusical ensembleLoginBefehlsprozessorLimit (category theory)1 (number)Lecture/ConferenceComputer animation
Source codeDirected graphProcess (computing)1 (number)Term (mathematics)Computer animationLecture/Conference
Scaling (geometry)Data managementStreaming mediaMusical ensembleSingle-precision floating-point formatAsynchronous Transfer ModeInstance (computer science)Computer animation
Scaling (geometry)Cartesian coordinate systemPerfect groupConnected spaceMusical ensembleLevel (video gaming)Computer animation
Scaling (geometry)Presentation of a groupMultiplication signLibrary (computing)Right angleComputer animationLecture/ConferenceMeeting/Interview
Java appletData conversionRevision controlLoginLecture/ConferenceMeeting/Interview
Point cloudMachine visionMusical ensembleLecture/Conference
User-defined functionMachine visionDescriptive statisticsFormal languageFunctional (mathematics)Musical ensembleCodeComputer fileTransformation (genetics)ExpressionTerm (mathematics)PlanningInformation engineeringJava appletPasswordLecture/ConferenceMeeting/Interview
Scheduling (computing)Operator (mathematics)Musical ensembleContent (media)CodeEvoluteGroup actionState of matterSource codeLecture/Conference
Java appletMultiplication signCodeMusical ensembleDatabaseWindows RegistryInformationComputer animationLecture/Conference
Java appletMusical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
So, what we're going to talk about, and who am I? I'm Hakan, have been a software slash data engineer for the last six, seven years, and
right now CTO of DataKata. We will first talk about the problem description, why am I even having this talk, then taking cloud native principles, apply them to ETL, where we believe the way is forward for ETL,
and then some code snippets on what actually an ETL pipeline running in a cloud native environment, in this case Kubernetes would look like. So basically the software industry has gone from dev and ops to a culture of dev ops in the sense of you build it, you run it.
We shift the responsibility left, but we provide the tools to actually operate the things that we build ourselves, and from that one of the mechanisms that came about is GitOps, so infrastructure as code, something changes in your deployable branches, you push it into
one of your stages. This doesn't really exist for ETL, we are seeing Spark operators and these kinds of things, so we are getting there, but I think, or we believe that there is more to be done to
actually get there. ETL tooling right now, severely differs by stage, so dev looks completely different than test and prod, because mostly the data loads are quite different, and then also the content sample data might be very different, so there is no real GitOps possible today.
Scalability has to be in the same code base, so scalability properties have to be in the same code base as your business logic, so in a Spark job you need to take care of how you can parallelize and how you want to parallelize wide versus narrow transformations.
And what happens is you have people operating your Spark cluster, for example, on Kubernetes, they write their manifests, but it doesn't kind of come together with your Spark code base.
So just to be clear, whenever I talk about ETL or data pipelines, in a nutshell, this is it. So we have different data sources, and often what we want to see or what we see in the world is we want to put it into a data warehouse or an analytics database, maybe put it into
another application database to provide a REST API. This is one of the most common things we see right now. Let's talk CI-CD and what GitOps actually is. So we have these stages, for us we use GitHub, so we have GitHub Actions for building, testing, merging things to our
deployable branch, and then releasing artifacts. We use ArgoCD to use these artifacts in our Kubernetes manifests to actually deploy them to our production cluster or staging cluster.
With this whole picture, there is one major question. Where is the state? Where is my code? What state is my Kubernetes cluster in? And do I actually have the artifacts that I need for deployment? So looking at this, we can see that for build, test, and merge, everything usually is
in your source control management system. So for us, again, it's GitHub, so there you have your code and usually your CI pipeline pushes something like a tag or it fixes a hash, where you know, okay, my release artifact was built by this hash exactly on Git. So the state for this whole
workflow for all the compute in CI-CD is in your source control management, your state for your artifacts is in your artifact registry, could be any cloud provider or JFrog, and when you deploy artifacts, tools like ArgoCD, they actually put metadata onto your Kubernetes resources
and then exchange information and see, okay, when was the last update? Which Git hash did I use for that? And do I need to upgrade everything or only parts of the system? So state is completely externalized. So I think this is quite suitable for ETL as well and
basically I'm here to gain allies. So some cloud native principles. We try to, what we also saw just now with Kafka Streams, we try to have
scalable applications which can be auto-scaled, for example, with a horizontal pod auto-scaler in Kubernetes. What is great for ETL is image immutability because if you apply
transformations, especially if you are in a regulated environment, you want to be sure that you can tell the regulator this is the process at time XYZ, transform my data, and that's why my model resulted as it did. And what would be great is if we can put it into a
declarative description. YAML is our enemy and friend in the cloud native world and a very simple pipeline that we envision would be like having a couple of filters and then your transformation steps described. And ideally what you would like to have is some sort of way
to have user-defined transformations. So how do we get there? One way would be to externalize your state to an event streaming system like Apache Kafka. There are multiple out
there right now that have Kafka compatible API. So we can actually kind of take the same approach with this as CI-CD does. With event streaming, one of the biggest advantages we get
for ETL is we can actually store, once we acknowledge a message, we can store the so-called offset in Apache Kafka so even if we fail we know where to pick up the process again. And ideally you only declare your computations and your pipeline immediately knows where it left
off when it failed or restarted because all the state is in Apache Kafka. So what we think where cloud native ETL is going is having multiple front ends.
This might be a YAML, an API, a UI on top of that, then some sort of pipeline compiler. Could be data data but we see others emerging as well. And what then happens is what falls out is for example a Kafka Streams app but could be also
MQTT or AQMP. And currently we also see Kinesis and Google PubSub implementing, especially bring your own streaming or event streaming system outside there.
So that basically what you would do now is wherever your Kafka runs or event streaming system you can now build test merge your ETL pipeline and release your pipeline images to a container
registry and your externalized state to what you already have transformed or aggregated it stays in your event streaming. And the advantage here is that by doing event streaming you kind of have more predictable loads and you have the same code base in testing and on
your local machine as you would have in production. So how does it look practically? So we kind of do YAML apply, compile to a in our case Java Quarkus app which kind of abstracts away certain parts of streaming implementations like Kafka Streams. It uses a
library called Smallry and then we use Jib to build container images. Do you know what Jib is? Ever heard of it? Okay we have two thumbs up from one person, that's great.
So Jib basically allows you within Java code it's a library written by the GCP folks within Java code to build container images off a base image. So we'll see how that works. You basically do a Docker build within Java code without having a Docker daemon or anything like
that. So in practical terms this YAML pipeline and what should come out in the end is we have a Kafka topic that goes in the data pipeline that falls out of this YAML and a Kafka topic for writing out to. This is the Java code, quite simple, looks similar to what we've seen
with the Kafka Streams applications but I want to bring your attention to the method signature of basic pipe. So the annotations are basically Smallry annotations
where you say hey I have two channels they abstract away the topic then you say I have a source topic and a sync topic and the return type is a multi which includes if this applique or if the content of the method is successful it will go back to Kafka in this
case and say hey I've successfully processed your message you can now mark the offset and increase that or increase the offset so that I can tell Kafka hey I don't want to reprocess this again. If this fails we usually write to a dead letter queue
but you are sure that you will reprocess this message that failed again at a later point in time depending on if you want that or not. So what is great about Quarkus for us is
with Kafka Streams in the talk we just saw that you could use test containers to spin it up a Kafka cluster. It is handy but for us RedPanda works way better. It's a C++ implementation basically of Kafka compatible technology and what they do is they this is
spin up in dev mode is way faster because it's a binary and second of all you have not the broker reassignment of IPs which Kafka does so you could even in dev mode easily go to
two nodes and it can rebalance the traffic there. And if you've never used Quarkus give it a try. Not a huge Java fan but everything you get in the dev UI is just amazing from like method profiling to test coverage everything is in there you can also look into
your swagger UI that comes out of it if you have a rest interface. Yeah give it a try. So now we go to Jib. As you can see this code snippet was prepared for another conference.
So basically what we said before is we can give an exact name or hash to a container image and every time our pipeline changes or transformations and filters we could so the numbers the pipeline id and pipeline revision would increase but you also have your container hash
which is uniquely identifiable. And we basically always work with two registries because we have the pipeline base image of the Kafka Streams application implementing the best practices that we just saw and then we have the customer registry where we need to log in to put the pipeline
registry pipeline image into there so that they can run it on their clusters. With Jib what is actually very nice is you they have event-based login so you can have
multiple you can attach multiple handlers and log this to different systems that you need to and the last snippets starting from Jib.from is basically I take my base image let's say Nginx it goes to Docker Hub and then I just say add layer which just basically copies in your
done artifact in this case this will be the application that we just saw in Java but now compiled. And if you compile with Quarkus you can get a binary from that Java
and we basically call the start command and start running this pipeline. If you use Jib what I would highly recommend is just need to check something okay what I would highly recommend is whenever you start it on a node so on startup of your application
pull the base image so that it's cached and your customers don't need to wait that you initially pull the base image. You have multiple mechanisms of authentication what you do not have is a basic attachment of Kubernetes secrets so you basically that that's actually quite
non-intuitive as it is meant for running in Kubernetes so what you always have to do is go to a Kubernetes secret pull it yourself get the username and password out of there
and also try not to use Kubernetes secrets because they are just base64 encoded so go to your vault that you need and then pull the secrets from there yeah and the detailed logging and granular logging events they are quite helpful when
something goes wrong and building that image. So where we are at right now when we look into the ecosystem that we are using to get closer to this cloud native ETL pipelines to these cloud native ETL pipelines
we need stronger self-containment in the sense of right now if you submit a spark job it doesn't tell you which resources it will need that's highly dependent on the data peaks that you will encounter but if we go towards a more Kubernetes approach where each resource says
hey these are my limits in terms of memory and CPU we can self-contain our process on a given note or multiple notes depending on scalability properties. So most of the tools that we use
right now is they are actually pre version one like whether it's jib or some parts of small right they actually run in production everywhere but there are pre version pre one also red panda
but I think the ecosystem is getting better to work and quite happy about especially red panda and small right here. What would be great in the future is to kind of have a sense of what a declarative data pipeline would be I mean we have our implementation there are others
implementing it differently but I think there is a place to have like the Kafka API to say this is the way we should declare filters and transformations on data but what we can do already is reap the benefits thanks to the guys behind Apache Kafka and red
panda we have a great dev ecosystem around Java Quarkus and especially micro profile and small right doing a great job there when it comes to messaging and streaming
and usually this talks in this talk includes a run through Streamzy which I don't have a time for I have the time for today so Streamzy allows you to quite easily operate Kafka and Kafka connect clusters in on top of Kubernetes thanks and I think we are roughly at the 20-minute
mark right okay then we might have time for the excursion to Streamzy unless you want to have drinks now I'm asking the audience drinks or Streamzy and Kafka drinks no Kafka okay I might have to close the presentation for a bit sorry for that
so let me quickly do this apologies I guess we already know what Kafka is
so event messaging do we know what Kafka connect is yes no no yes okay it's a yes no
so Kafka connect basically connects to a database and fills your Kafka topics what Streamzy allows us to do here is and that's quite handy and with the way a lot of people work nowadays in Kubernetes is it provides you CRDs for creating Kafka clusters and Kafka connect clusters
so you first install the cluster operator we are with the CRDs and then it checks the these resources the custom resources defined and creates a Kafka connect cluster and the Kafka cluster that we can see on the right hand side
for now it also has to create a Zuki cluster for distributed configuration since Apache Kafka 3 it's obsolete but Apache Kafka does not recommend to run their own protocol called craft and production yet so you can go for it
Streamzy doesn't so it still installs the zookeeper cluster and what connectors do is basically exchange like you see from data source to Kafka cluster and then takes from the sync topic and goes to a Kafka cluster again everything that
has a blue is within the blue box that's basically managed by Streamzy and in this context basically our data pipeline that we just created before is
now in the context of databases so this would have been could have been our sales DB and that could have been our big query on the right hand side the data sink Kafka connect and self-containment that's a huge issue because Kafka connect often is
deployed into the same cluster and what this leads to is there are a lot of connectors that are not written by yourself so exception handling leads to flushing a lot of log messages and exceptions to that same node which has a peak in CPU and even if you then provide resource limits
most of the implementation of the connectors have the problem that they will crash again again again again again again until someone manually checks something and that's actually quite bad for
self-containment did I not un-skip the other ones so what we started doing here oh sorry so what we started doing in terms of self-containment here is now we moved
the connectors out of the cluster or out of the management from Streamzy put them into
single instances and single node mode what you can do and put them as sidecars into the pipeline so whenever one of them fails your data pipeline actually has more insights and now your data pipelines get isolated because before that every connector for each pipeline was running on
the same Kafka connect cluster and that's very important for the approach of self-containment so that a data pipeline is reproducible when through all stages and it's self-contained
like an application would be with that being said we can now go to questions because the presentation and now we have quite some time for questions who wants to go first
hi thanks for the presentation and my question is you mentioned about the JIP library does it also suitable usable with Scala yeah it's actually so the small right and so JIP is usable with Scala
you have to only for the log handling you need to exchange I think they are called consumers and producers so they are not natively implemented in Scala so you need Java conversions and
depending on whether you use Scala 2 12 or 2 13 there's a different way of handling the logs in so it's suitable log handling is a bit tougher depending on the Scala version you use
any other question so while people think maybe I can ask a question yeah so you presented this vision about let's say moving to cloud native ETL and um as far as understood you want this declarative way of describing ETL pipelines
and therefore you're proposing this let's say YAML file so to what so what what expressiveness do you imagine in there I mean what have you done up to now and what's your plan for that
so we have a couple of no code transformations but the vision should be rather a description of what steps should be taken and what we also think is you should so most of the data people work in Python that's why we started this because ksqlDB is also declarative but once
you want to implement user-defined functions what happens is you need to write Java a lot of the data science data engineering people so if all the talks you see around data science at berlin passwords is in Python so we think the pipeline description
for us now contains Python or you could use Python but the language itself how you transform data should not matter you should be able to say hey I want to transform this and this is a
user-defined function and then plug and play a container based upon that that's totally fine but for now we have roughly 50 no code transformations that you can directly use in the YAML we do discover though that it's way better for our customers
to actually have the liberty to say okay this is the transformation and this is its execution so to have less of a restriction in terms of we don't want to define a language for a new language a new SQL so to say this is not what we want it's more about flexibility
and having the ability to have one description which then always produces the same pipeline I see and maybe one last question from my side so when comparing it let's say to a scheduler your proposal here is let's say scheduling of operations plus infrastructure somehow
yeah exactly yeah so it's basically instead of like in the CICD example in my CICD pipelines for example in GitHub actions what I describe is how to build it but I never like describe the content of my code which is the state and what we propose is the same thing can be applied
to data that you focus on the computations and externalize every state to data of course you need some sort of as you can see it's also an evolution so here on left hand side
we started out with defining source schemas and we deviated from that because for example whether you use Kafka or any other database schemas change over time and your code
will change too and if you now lock somehow your infrastructure onto that schema now there's third thing you need to manage so we rather fail on execution instead of saying okay now I need to go into my schema registry change my schema there until my pipeline works again it can also work this way still without the extra information or
the information that is like does that make sense yeah thanks so if there are no other questions let's thanks let's thank Hakan again