We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Changelog Stream Processing with Apache Flink

00:00

Formal Metadata

Title
Changelog Stream Processing with Apache Flink
Title of Series
Number of Parts
56
Author
Contributors
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We all know that the world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Message queues and logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) has become popular to capture committed changes from a database and propagate those changes to downstream consumers. In this talk, we will introduce Apache Flink as a general data processor for various kind of use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources using different kinds of joins. Finally, we illustrate how to combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
Level (video gaming)Musical ensembleLink (knot theory)PasswordSlide ruleStreaming mediaProcess (computing)MathematicsXMLUMLLecture/Conference
Open sourceTerm (mathematics)Core dumpSoftwareSoftware development kitShared memoryContent (media)SequelProjective planeMereologyCore dumpSoftware engineeringSoftwareAdditionSoftware development kitTerm (mathematics)UMLComputer animation
Data modelCache (computing)Lattice (order)Control flowStreaming mediaThermodynamischer ProzessBlock (periodic table)BuildingState of matterSynchronizationBackupRevision controlDistribution (mathematics)Data bufferUniqueness quantificationDatabaseSystem programmingDatabase transactionBlogEvent horizonSquare numberCustomer relationship managementPlastikkarteMagnetic stripe cardService (economics)Process (computing)Ordinary differential equationComputing platformInformation securityProduct (business)Physical systemGame theoryPersonal digital assistantMachine learningPoint cloudScalabilityProcess (computing)Streaming mediaFault-tolerant systemOperator (mathematics)Row (database)Key (cryptography)Data storage deviceScalabilitySoftware frameworkData integrityHeegaard splittingState of matterVirtual machineAnalytic setTime travelCategory of beingMatching (graph theory)Event horizonLevel (video gaming)Information overloadProduct (business)Game theoryComputer fileDivisorPhysical systemSlide ruleInteractive televisionCartesian coordinate systemEmailParallel portRight angleDifferent (Kate Ryan album)Regular graphRevision controlDatabase transactionSoftware developerCASE <Informatik>InformationDatabaseWhiteboardDirected graphSource codeOrder (biology)BackupMultiplication signoutputSoftwareLoginWebsiteConnected spaceInformation securityFrame problemGame controllerThread (computing)Side channel attackComputing platformScaling (geometry)Point cloudSoftware testingIntegrated development environmentLocal ringBlock (periodic table)BuildingCache (computing)Core dumpBitArithmetic progressionEndliche ModelltheorieArithmetic meanFrequencySquare numberIdentical particlesComputer animation
Stack (abstract data type)EmailStreaming mediaProcess (computing)Block (periodic table)TopologyOperator (mathematics)LogicFunction (mathematics)Type theoryLetterpress printingTable (information)Integrated development environmentRow (database)Software developerStack (abstract data type)Greatest elementDataflowRun time (program lifecycle phase)Process (computing)LoginPower (physics)Row (database)Data streamLogicNetwork topologyCategory of beingBlock (periodic table)Function (mathematics)BuildingLevel (video gaming)Streaming mediaCASE <Informatik>Table (information)Type theoryVideo game consoleIntegerEndliche ModelltheorieMathematicsAverageAsynchronous Transfer ModeState of matterIntegrated development environmentCartesian coordinate systemElement (mathematics)Computing platformOperator (mathematics)Multiplication signResultantRelational databaseClient (computing)AlgebraLibrary (computing)Mathematical optimizationStandard deviationAbstractionZustandsgrößeService (economics)Insertion lossSoftware frameworkComputer configurationRight angleInformation engineeringUser-defined functionContext awarenessSequelEntire functionComputer animationLecture/Conference
Streaming mediaMathematicsSlide ruleWebsiteBitProcess (computing)Physical lawVideo gameTouchscreenComputer animationLecture/Conference
Streaming mediaThermodynamischer ProzessData streamDefault (computer science)Insertion lossRun time (program lifecycle phase)Process (computing)StapeldateiTable (information)Database transactionInterior (topology)Total S.A.String (computer science)AerodynamicsView (database)Euclidean vectorBlogSource codeOperator (mathematics)Duality (mathematics)Asynchronous Transfer ModeMathematical optimizationFile formatProcess (computing)Streaming mediaData storage deviceTable (information)ResultantDatabaseRow (database)SynchronizationVariety (linguistics)Source codeDifferent (Kate Ryan album)Insertion lossCASE <Informatik>Video gameLoginRight angleKey (cryptography)MathematicsRun time (program lifecycle phase)Operator (mathematics)StapeldateiRegular graphBitDefault (computer science)Type theorySummierbarkeitDatabase transactionState of matterCompact spaceFunction (mathematics)Electronic data processingDuality (mathematics)Division (mathematics)Query languageoutputFile system2 (number)Connectivity (graph theory)Category of beingPresentation of a groupView (database)Asynchronous Transfer ModeBound statePhysical systemThermodynamischer ProzessSequelComputer animation
File formatAsynchronous Transfer ModeMathematical optimizationSource codeStreaming mediaTable (information)Duality (mathematics)File systemLocal GroupCountingPhysical systemDefault (computer science)Interior (topology)Database transactionString (computer science)Military operationPhysical systemDatabase transactionQuery languageNetwork topologyMathematical optimizationRun time (program lifecycle phase)MathematicsOperator (mathematics)NeuroinformatikFunction (mathematics)Table (information)ResultantCASE <Informatik>outputInsertion lossMatching (graph theory)Row (database)Task (computing)State of matterGroup actionMusical ensemble2 (number)Arithmetic meanEvent horizonMultiplication signSynchronizationDifferent (Kate Ryan album)Level (video gaming)Thread (computing)Power (physics)Key (cryptography)Streaming mediaAsynchronous Transfer ModeDatabaseSource codeFinite differenceData storage deviceDirected graphHistogramLattice (order)Default (computer science)TheoryCountingMultiplicationRelational databaseCategory of beingEigenvalues and eigenvectorsInterior (topology)SequelTwitterComputer animation
Military operationOperator (mathematics)Asynchronous Transfer ModeHill differential equationMaizeGame theoryDemo (music)Repository (publishing)Link (knot theory)Table (information)Data streamDifferent (Kate Ryan album)CASE <Informatik>Database transactionStreaming mediaDatabaseCategory of beingMultiplication signOcean currentLattice (order)Row (database)StapeldateiReal-time operating systemOperator (mathematics)Projective planeLetterpress printingMathematicsSlide ruleConfiguration spaceComputer configurationFunction (mathematics)Declarative programmingSequelSource codeSpacetimeMereologyResultantProcess (computing)Connected spaceStatement (computer science)2 (number)Fitness functionTheoryClient (computing)Machine learningThermodynamischer ProzessSelectivity (electronic)Computer programmingRevision controloutputReading (process)Level (video gaming)Computer animationLecture/ConferenceSource code
Thermodynamischer ProzessSystem programmingSemantics (computer science)Table (information)AerodynamicsData storage deviceDisintegrationThermodynamischer ProzessData storage deviceStapeldateiData warehouseDatabasePower (physics)MathematicsPhysical systemTable (information)Analytic setRevision controlQuery languageKey (cryptography)Dynamical systemLevel (video gaming)Block (periodic table)SequelComputer animation
Table (information)Row (database)EvoluteSource codeData streamSequelDegree (graph theory)Numbering schemeQuery languageElectronic mailing listStreaming mediaRight angleComputer animationLecture/Conference
Query languageCASE <Informatik>Table (information)Lecture/Conference
DatabaseSource codeMultiplication signConnected spaceDifferent (Kate Ryan album)Fluid staticsPhysical systemSet (mathematics)Lecture/ConferenceComputer animation
State of matterRow (database)Table (information)CASE <Informatik>Scaling (geometry)Computer programmingLimit (category theory)NumberSequelLecture/Conference
Event horizonTable (information)State of matterOperator (mathematics)Multiplication signCASE <Informatik>Row (database)NumberMathematical optimizationLecture/Conference
Level (video gaming)Musical ensembleLecture/ConferenceJSONXMLUML
Transcript: English(auto-generated)
Hello and welcome to stage front salon next is changelog stream processing with Apache fling Yeah, thank you very much. Yeah. Also, thank you for having me back at link at billion passwords and
presenting Apache fling today Yeah, so this talk is about changelog stream processing with Apache fling but as always I will also have some introduction slides Available so that you also like also for fling beginners. There is new content to share
So first of all, who am I? So my name is Timo Valter I am a long-term committer in the Apache fling Project actually even before it became part of this Apache software foundation in 2014 I'm a member of the project committee there I'm also one of the top five contributors in the meantime based based on commits and additions
I'm one of the core architects around fling sequel and career wise I was one of the early software engineers at data artisans part of the SDK team there data artisans got acquired by Alibaba Then I was the sequel team lead at the very cubic then and yeah today
I'm working on something completely new but still around fling at a stealth startup So first of all, what is the patchy fling? I Can actually start with first explaining what is actually stream processing So what are the building the core building blocks for performing stream processing in generals? I split these
Building blocks into four categories first. We need streams. We need time state and snapshots. So what does streams mean? Of course, we want to have a pipeline so we need Streaming pipelines we might want to distribute streams not only to the cluster like scale out
But we also maybe want to like split the stream into like side channels and so on we want to join streams together We maybe have a mainstream and the site stream. So we want to maybe enrich the mainstream with some side input data We might have a control stream that that controls the state of our application
And of course, we also want to replay the stream and do fast fast processing of historical data Then time is a very important concept in stream processing We might need to synchronize or wait a little bit for data, but still we want to make progress as quickly as possible
Sometimes we want to time out if the matching record is not is not arriving Sometimes we want to do like fast forwarding. So we when we do like window processing. We don't want to wait an entire Hour to reprocess historical data. So an hour should be a rather logical time that we can also like fast-forward
During reprocessing and of course, we also want to replay time and especially state is is a key Distinguishing factor we want to store maybe some machine learning model We want to buffer records and wait for the other matching records We want to cache records and avoid lookups to external systems
We want to like grow state maybe also in the orders of terabyte and of course We want to expire and throw away state at some point and if there is state involved you have to deal with fault tolerance So we wanted to create backups backups of our entire streaming application different versions We might want to fork our streaming application in a deployment a development cluster
We want to do a B testing in different cluster environments. We want to time travel from different Stage snapshots, and of course, we want to restore in case of failures. So like these four categories are all basically met by Apache Flink
But what makes Flink unique is also like it's it's simplicity how all of that works So you usually you can really imagine it as if you would write a pipeline on a whiteboard. So we have some some Like operators like some sources maybe a normalized step here a filter there a join there and the sink at the end
Once you have declared that Apache Flink does the rest under the under the hood So, of course you want to have this application scalable Flink takes care of this So in this case, maybe we want to scale to like a parallelism of two We have some operators that contain state as I said before Flink has the ability of
Like putting the state right next to the operator so that you have fast local state access They can also scale out and scale in together with the operator Then of course there are events Streaming through them through the pipeline. You don't have to deal with threading or network connections and so on Flink takes care of that and
Flink can also perform an entire snapshot of the pipeline to a durable storage So all in all it's a very flexible Frame framework it is used for analytics for data integration for event driven
applications ETL applications And as you can see like as input we so you can process transaction logs IOT information user interactions events of any kind Also, the the involved systems or the ecosystem around Flink is very diverse
So we have messaging systems like Kafka or Pulsar regular files on S3, for example Databases qvalu stores can be read and on the other side. It's exactly the same. You can write to Kafka You can write to Elasticsearch or you define your custom application and send out an email as an event or something like that
Another frequency asked question is who is using Flink This slide is a bit overloaded But it also shows the diversity of the community so we have a Flink Slack channel and there is an introduction Channel where companies can also like break a little bit or like introduce themselves what they are using Flink for so
Yeah, we have cloud data infrastructure at Apple. We have a feature platform at Square security products at Microsoft feature generation system at ByteDance Yeah, we have gaming companies we have little startups Various various use cases as you can see
So let's also talk about the API's of Apache Flink because how can you actually use Flink as a developer? So this shows you a rough overview over the API stack At the bottom you can see there is our data flow runtime on top of the data flow runtime
There is a low-level stream operator API For example, Apache Beam goes against this low-level stream operator API But the native API's of Flink you can see on the top. So the data stream API is still the power horse
of the framework of the platform Then there is the table and SQL API which we will talk in detail today This has the specialty that there is an optimizer and planner under the hood so less low-level access and Then on the right side we have stateful functions which gives you some kind of serverless stateful
Functions where you can design applications similar to an actor actor model in a distributed way So let me also quickly Describe the data stream API here. You can see like a very simple Example hello world example almost
Yeah, we are just in initializing our environment we set the runtime out to streaming We create a random stream from some one to three elements And we are just executing this simple stream and printing the result to the client locally The properties of this API are that the data stream API
Exposes the building blocks for stream processing. So you have access to timer services You can build custom operators custom Topologies all the business logic is written in so-called user-defined functions So you can bring whatever library you want to bring you can also use arbitrary user-defined records between the operators
So it might be just integer in this case, but it can also be an Avro protobuf record and the important property talking about like change locks in this context is that The data stream API conceptually is always an insert only or Append to only lock. So the data stream API doesn't know about updates or deletions and so on and
As you can see when we are just printing this example, the output will be one two three very simple So now let's look at the table and SQL API. So this is again a little example We are initially rising our environment
You have two options either you use the programmatic API the first one to create your tables or you use the second one Which is just sequel as you know, it's standard sequel Again, we are executing and printing this locally The important property of the table API is that it abstracts the building blocks for stream processing
So you don't see state or time or anything like that that planner Decides how the topology will look like or if a filter is Logically makes more sense to have it at the beginning of the pipeline. So yeah, the planner tries its best to optimize the entire pipeline
Also, the business logic is declarative similar to relational algebra Record types are internally to the data Data engine and there is also like a concept of Rose is exposed if you want to access the pipeline programmatically and the interesting property here is that conceptually and
We are working or dealing with tables, but under the hood the engine actually works With change locks and this I want to now show you in details And for that let's also print or like execute and print this example
And you will see that the output in this case has like a special column at the beginning when you print it To the console and an operation column and this already shows you that we have some kind of insertion and deletions Going on under the hood So let's get started with a change lock stream processing topic
The funny thing is when I when I created the slides I like I got a little bit philosophical Because like when you think about change, what is change how is change evolving and so on like I somehow had to edit cheesy comment From some website. So I added the comment from John F
John F Kennedy changes the law of life and those who look only to the past or present are certain to miss the future So let's get started. First of all, I want to start with screen processing. So how does the Apache fling community? Looks at stream processing because actually when you think about it data processing in general is always stream processing
somewhere at a company you start creating data and Yeah, the data comes continuously in and if the data doesn't does not come continuously into the company Either the use case has ended or the company has gone bankrupt So when you do batch processing at a company
What you're actually doing is you're creating you're cutting the stream into pieces and these pieces are called bounded streams in Apache fling, so we only distinguish between bounded and unbounded streams and you have unbounded streams have the nice property that you can start in the past and process present till future or you can start in
The now and process the future and then batch processing is just like a special case In the runtime based on bounded streams But we are also we also support batch processing just to mention it here, but we will focus on stream processing now
So when we talk about fling sequel How would how do we work with streams in fling sequel? And the answer for that is actually you as a user you don't work with streams users work with dynamic tables So this is a concept. It is actually similar to a materialized view and databases
And there is no Table stored somewhere in the state or so It's really just a concept and what I mean with this is basically as a user you define your sequel query like in this case and Some summing of transactions on the left side you declare the input table on the right side
You declare the output table all of that happens in sequel and you will see that the result Is the same as if you would run it on a regular database, but in this case Yeah, we're running on it on a stream processor So the question now is is fling a database and the answer is clearly no fling is not a database
Because you can bring your own data and your own external systems. So you don't need to put your data into fling So let's Let's discuss a little bit. What does it mean? Okay, it's a table but actually fling is a stream processor
So how can we convert back and forth between streams and tables tables? And this concept is also called a stream table duality Which means a stream is only just a changelog of a dynamic table so sources Operators and things they work on changelogs under the hood
So fling right now provides four different types of changes. We have regular insertions This is for example the default when you do scanning or when you work on bounded results, then you have update befores Which basically reflects the previously emitted result?
Then we have update after which updates a previously emitted result and Yeah, if there is a primary key you can also get rid of the minus U And then we have a deletion if we just want to remove the last result for example in a database
Each component source operator sync they always Declare the kind of changes that they consume or produce. So if all if the entire pipeline just Produces and consumes processes Insertions we can consider this as an append-only or insert-only pipeline if it contains some
Minus U or minus D. We can consider this as an updating Pipeline if it does not contain a minus U This is a retracting a pipeline and if it never contains a minus U but only plus use we are calling this an up sorting pipeline
So here is a little example How a stream table duality? looks in real life, so here you can see the change log and Conceptually on the left and right side. You can see the tables and on the right side, especially you see
If you would apply the change look to a key value store or a database How would the end result in the database for example would look like so again? We add some record to Alice 56 in this case so the change log of course would also contain this insertion This insertion is processed by think sequel the output would then be the first sum
I mean it's the first record so it's still 56 and this ends up in the database and then we continue there's a Bob coming in Again, it's also in the change log It's also the first Bob that we have so we're also adding this
To the table to the output table. Yeah, and now there comes a second Alice and that means We have to change the output that we previously emitted. So first we have to remove the old record from the database the 56 is gone and
Then we are emitting the new result, which is 140 45 and insert this To the database and as you can see the end result is the same as if we would have done batch processing and a regular sequel engine and If we want to do this a bit a little bit more efficient we can also provide a primary key
Primary key means the downstream system now supports Absurding this means we can save 50% of the traffic Because we don't have to delete the row first so we can we can now perform the the update of this Alice Record in place basically as you can see nothing is deleted. We are
Where was it here? So we are just doing in place updates here. So Yeah, I already said that like sources operators and sinks they all define a change lock mode
And let's also look at concrete examples. So let's start at sources We have a variety of different sources. So we have file system. We can connect to Kafka to JDBC We can also like interpret Kafka not only as a lot but as an absurd lock So Kafka has the possibility if the value is null and we have enabled
compaction topics Compaction mode then Kafka can also Work as an absurd source. Yeah, or we store divisium JSON records in Kafka and For all these kind of sources you have different changes coming in. So file system would just be insertion
regular Kafka is also just insertion a Kafka connector with absurd as you can see would be insertion and Deletion deletion in case the compaction has not started in the Kafka topic yet Then we have JDBC in this case JDBC would mean we just perform and scan of the entire table and then we don't read this table
not again, so just a single scan through the table and Then there is the divisium JSON for example, which would give you like all kinds of changes Into the runtime and then the optimizer tracks How these change lock modes propagate through the entire
Topology are their primary keys involved and so on and yeah the sink later declares the changes that it can digest So maybe one more mentioning or like explanation. What is the difference between
Retract and absurd mode because people are sometimes a bit confused about these two modes In general the retract mode is the most flexible change look more that we offer First of all, it doesn't need a primary key requirement. It works for almost every system So for example the database without a primary key, it supports duplicate rows
And in distributed systems retraction mode is often unavoidable Unavoidable and for this I also have brought a little example. Let's assume here. We have a sequel query where we counting Counts so we are creating a histogram basically and this for example is quite tricky in a distributed system
so let's assume there is a record coming in or we are grouping by user and sending and Calculating a count then we are putting this count into state So in this case, it would be count one and then there is a second Event coming in we are counting again this time the count is two, but if there is a shuffling step afterwards
It would end up at different operators of different threads so You have like a different Threads you have different Records so you still need a deletion to also update
the other Subtasks in this case. So this is why retractions are usually the default mode within the engine But as I said before like there is this Upstart optimization which saves you not only the traffic but also the computation
Internally internally and like to downstream systems. It also gives you in-place updates And we can also take a look under the hood so This is maybe for advanced people, but it might be useful to have have seen this before so this
Example here. We have a table transactions So with transaction ID and amount and we have a table payments with transaction ID And the payment method that has been used and yeah We want to basically just join the payment with a transaction and output this and then we want to just store it in the in
The result table and the result table in this case accepts all changes. Whereas the input tables are just insertions and This is a typical explain Explain the SQL query from the engine. So this is the the operator tree that is that gets executed. So you see
Let's see here. This is the source with an insertion mode another source with an insertion mode Then you're performing the join the optimizer recognizes that both inputs are insertions so the join will also just create insertions and Yes store this in the sink So let's assume for example
We have a left outer join left outer join have the specialty that if there is no matching record on the other side Then it will emit a null first and later this null might be Retracted again and replaced with a proper value when the join is When the join is able to find a matching record on the other side So as you can see here the the changelog mode of the left
Outer join changes and now we have update before update after and deletions as output So maybe now we want to do not only do like updating Results, maybe what we want to even declare a primary key
So for example the result table defined a primary key on the transaction ID But the interesting thing is now this has also changed the sink So now the sink has gotten a special property. Why is that the reason for that is?
There is we declared a primary key on the result, but there is no primary key definition on the sources So in theory there can be multiple transactions with the same transaction ID and multiple payments with the same transaction ID So the planner tries its best to perform the so-called absurd materialize to like
To met this primary key constraint. So there's like an operator before the database to like to To make the make the records unique based on primary key But actually it would be even better to properly declare your pipeline in this case so in this case all all tables get a primary key on the transaction ID and
Now you can also see that in the explain now the join operator Notices notices it that oh now we have two primary two unique keys on both sides And I can also adapt the changelog mode accordingly. So the update before is gone now
Yeah, and this is a very efficient pipeline So in general there is There is trend that there are transitions going on under the hood As I said before if the pipeline only concern contains insertions This is an append-only stream or append-only table. This is highly usually highly state efficient
Depending on the operation like for example a left outer join the the operation becomes or the table becomes updating If you have a couple of updating operations it still stays
Updating but like the the runtime also has means for example to Like remove the optimization from updating To retracting it's for special cases So like in the pipeline itself You see special operators like changelog normalize which converts an updating stream to a retraction stream
And there is also the opposite Operation Shortly before sync which is called absurd materialize which creates Retracting table out of an updating table out of a retracting table, but this is very low-level, but it's it's it's
sometimes quite useful to Understand how much power is in the engine under the hood that you usually don't see when you just declare your sequel and and your tables So now I also want to show you some little demo And there is a lot of a lot of stuff to show I also have like a big repository on my github
Account called Flink API examples with a lot of different use cases for data stream API for table API Yeah, in this case. I want to keep it simple. So I have just a MySQL database the MySQL database is already running. I have filled the MySQL database
And with some records So first of all, we have two tables the customer table and the transaction table. The customer table has a let's make it bigger The customer table has an ID column and a name and Yeah, there are three customers already in there and Alice Bob and Kyle
And then we have a transaction table With some transaction ID amount and customer ID And yeah, I also added some transactions already in there So I have created already these tables
And now comes the interesting part How do we actually like access the data in the database because you have two possibilities to access the data Either as I said before you perform a full scan and that's it. So that would be the JDBC connector or You want to continuously fetch the results from the database so you want to have like a monitoring
Source based on the database which opens the space to a so-called CDC change data capture Processing so in this example I'm using the fling CDC Project I also have a link on a later slide. And yeah, basically this is how I declare it
so I say for connector MySQL CDC and Then I give it a schema and some configuration options and yeah, that's it Then I can actually already output my data. So I run
execute SQL select star from transaction CDC print and As you can see now it first of all, it will perform a scan of the entire table It will print this in some seconds But then it will continuously wait for new incoming records So as you can see the program is still running and it waits waits for input. So I have a second
Program here in theory. I could have used the MySQL client to add new rows To the database table, but I thought why not just using fling a fling batch job to add new rows to the database so yeah, I can just
run this fling job that Simply adds new rows to the fling To the MySQL table and here the other running job will pick this up in a second Hopefully if everything goes well Otherwise, it's not my fault. No, there it is
beautiful As you can see, yeah, it's continuously monitoring Let's have a different example now Maybe let's join Different tables together So in this case As I said before the JDBC
So first of all here we want to join the customers and the transactions The specialty here is the transactions is CDC. So it's real time It comes in continuously, but we declared the customers as a regular scan it once CDC
Source so again, we say connect for connector JDBC. We declare the schema here and some Connection properties and the specialty is there is a special kind of join In fling called time versioned join So what you can do is you can perf you can perform real-time lookups
To connectors. So for example, the JDBC connector supports this and so if you read this Read this SQL statement like for every incoming Record for for transactions. We are basically perform a lookup in the database As of now, so the current processing time the current wall clock time and we are searching for the matching
customer ID So let's also run this I have to I have to stop the other one. So here takes a while, but it will start
So here you can see again
We are waiting for new transactions the existing transactions and the existing customers have been joined right at the beginning And yeah, that's now At another record for example another transaction So this should work. It just takes a while exactly. So now we have we have
Joined the the transaction with Kyle. So what happens for example if Kyle decides to become a Kylie
So we are updating the customer data in the customer table For now, nothing will change because all the previous joints have been already done. So there's no no joint performed again
but if we now This works Yeah, if we now add a new transaction, it will look up the current value for Kyle and now Kylie In the database and it will show you Kylie when performing the next the next join. I hope this works. Yeah, you can see it. Yeah at
This time it was still Kyle, but now it's Kylie. So it's updated based on the current database value And there's much more That you can do I would recommend as I said my GitHub repository and if you like to you can also
Go a level deeper. So if you If you want to declare something in sequel and you notice a wait sequel might not be the perfect fit for my use case because I have some special operator or I want to do some machine Learning or something very special that you cannot express With sequel then you can also just bridge between or like go back and forth between data stream API and table API
So you just call to change lock stream Of your table and then you're back in the data stream world So you're dealing with a data stream of row here row rows and fling always have a change lock kind a row kind
So in this case, we're for example, we are filtering filtering out Delegions and you can have arbitrary pipelines and then you can also go back into the table world whenever whenever you need it So in summary, I would say fling sequel angel is a very powerful change look processor as you might have seen
And it's also a very flexible tool for like integrating various systems from databases key value stores locks Whatever you want And you can also see this in the ecosystem So I said it already like there is this
package Flink packages org with which has this large amount of CDC Connectors, it has already 2.6 K github stars here. You can see which databases it integrates with Then there is also something completely new on the block called the table storage or table store
It's in an early version early stage But we are also starting like to have a unified storage engine for dynamic tables So you can like store a change lock in this table store, but you can also use the table store to run Data warehouse or like analytical queries using batch mode. So this is also quite interesting and maybe worth to look into it
Yeah, I think that's it. And yeah, I'm happy to answer questions now. Thank you
Hello, so my question is around supporting schema evolution Does it already support I saw some ticket which seems like there's a lot of comments and work But I don't know a degree of like support for the scheme evolution No, this is this is still on the to-do list So for table API, there is no schema evolution right now
But for data stream API as I said, you can have custom records Which means for example afro and afro has already a schema evolution So for example, if you want to go back and forth between data stream API and table API you could for example Create the source in data stream API Prepare the records as you need it for the sequel pipeline
And then switch to sequel so like you could have the the schema evolution as a pre-processing step before the sequel query For example any more questions. I saw that you use both the CDC and the JDBC one
But in the case of the JDBC when you had the update on the customer table that was not immediately shown in Flink so my question is Should we always do vice CDC or is there a special use case for the JDBC one? I
Would say it's in most of the cases you want to do CDC I would say it was just like like to show you what is the difference between a snapshot and between CDC? So and in practice if you have a static data set lying in the database This is like will be static forever then JDBC makes sense
But usually as I said before everything is changing continuously So it usually makes no sense to use the JDBC as a source Unless it is with the syntax as system time off where you perform Real-time lookups to the database. So this this still makes sense. Then that still makes sense to use the JDBC connection Okay, thank you there is also a question online I can
Show you Okay. So the question is what are the practical limitations for the Flink table API?
How many tables can be involved in joins number of records per table? So These questions are always difficult to answer because it really depends on the use case in general What I know is that the table API is used at like like very large scale Scale companies like Alibaba that run sequel programs on thousands and thousands of nodes
You can configure how long for example, you want to keep records in state so that might improve The like like how many joins of tables you can perform In general as I said before it's never the entire table that is stored in state
It's only we only store in state what is necessary to compute this particular Operation. So if you use all the optimizations that we can offer Also related to event time, which I did not mention in this talk Then the state size can be very very efficiently managed and yeah
It doesn't grow like infinitely and in this case, you can also join a lot of a lot of tables But I cannot give you a concrete numbers here Any more questions, okay, then thank you very much
Thank you