We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Streaming: Why should I care?

00:00

Formal Metadata

Title
Streaming: Why should I care?
Title of Series
Number of Parts
160
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Streaming: Why should I care? [EuroPython 2017 - Talk - 2017-07-13 - Anfiteatro 2] [Rimini, Italy] You think, all that hype about streaming solutions does not affect you? I thought so also. But when playing around with that topic for some time, I realized that it sheds a different light on many topics I struggled with for some time. In this talk I want to share with you what I discovered when switching from a from a database centric view to stream oriented processing. Splitting your application in smaller services gets easier as you have more natural boundaries You have more options to run different data schema versions in different services (instead of one central db upgrade) More scaling possibilities Operations improvements For sure, streaming does not solve any problem, but much more than I thought before. And in python you have good support with many streaming clients. I will give some examples and comparisons for working with Kafka and Avro Schemas
95
Thumbnail
1:04:08
102
119
Thumbnail
1:00:51
Streaming mediaComa BerenicesIntelSoftwareDemonImplementationStreaming mediaTask (computing)MethodenbankLecture/Conference
Process (computing)Valuation (algebra)Virtual machineDatabaseData structureMachine learningVirtual machineCode refactoringDatabaseQuery languageRevision controloutputOperator (mathematics)Cartesian coordinate systemElectronic data processingComputer fileFunction (mathematics)CuboidField (computer science)Table (information)Interface (computing)ResultantMultiplication signLimit (category theory)Flow separationBitSoftware testingRight angleStandard deviationProgram flowchart
Virtual machineData structureDatabaseHeat transferService (economics)Independence (probability theory)Cartesian coordinate systemBoundary value problemSoftware testingStrategy gameComputer architectureSolid geometryCuboidLecture/ConferenceComputer animation
Independence (probability theory)Heat transferService (economics)LoginDomain nameProjective planePerformance appraisalStreaming mediaFlow separationCuboidContext awarenessService (economics)Sound effectEndliche ModelltheorieFunctional (mathematics)Server (computing)Electronic data processingProcess (computing)Cartesian coordinate systemWeb serviceCASE <Informatik>Observational studyHeat transferBuildingSource codeLecture/ConferenceJSON
Streaming mediaInformationDatabaseDifferent (Kate Ryan album)State of matterService (economics)BefehlsprozessorPartition (number theory)Process (computing)Independence (probability theory)ScalabilityACIDFormal languageScaling (geometry)Queue (abstract data type)Order (biology)DatabaseDifferent (Kate Ryan album)Cartesian coordinate systemWebsiteWritingPartition (number theory)Streaming mediaUniform resource locatorMultiplication signInformationCoprocessorTraffic reportingState of matterSound effectDecision theoryContent (media)Point (geometry)Web serviceTable (information)ConsistencySubject indexingRow (database)Mechanism designService (economics)Scaling (geometry)Order (biology)Flow separationDisk read-and-write headDatabase transactionComputing platformValidity (statistics)MultilaterationMoment (mathematics)Data structureVirtual machineScalabilityBoundary value problemMathematicsParallel portGoodness of fitSimilarity (geometry)Key (cryptography)Loop (music)SequelIndependence (probability theory)Formal languageString (computer science)Machine learningProcess (computing)Wave packetDomain nameWordSource codeStudent's t-testNetwork topologyInformation systemsPerturbation theoryRelational databaseGame controllerComputer configurationSuite (music)Presentation of a groupXML
ImplementationBefehlsprozessorClient (computing)Content (media)Pairwise comparisonLibrary (computing)String (computer science)SchemaevolutionWindows RegistryProduct (business)Maxima and minimaService (economics)TheoryRootComputing platformConfluence (abstract rewriting)Computer configurationTunisBitSoftware testingGame theoryInterface (computing)Direction (geometry)Client (computing)Row (database)outputStreaming mediaConnected spaceString (computer science)Point (geometry)Compact spaceEvolutePairwise comparisonLibrary (computing)Field (computer science)CoprocessorDatabaseSet (mathematics)Buffer solutionDefault (computer science)Message passingSerial portBootstrap aggregatingState of matterProof theoryComplete metric spaceDifferent (Kate Ryan album)Task (computing)SpacetimeContent (media)InformationProcess (computing)Revision controlLink (knot theory)Multiplication signTracing (software)File formatCartesian coordinate systemPlastikkarteOctahedronJSONXML
String (computer science)Data typeFlow separationBefehlsprozessorPerformance appraisalLogicDatabaseCoprocessorStreaming mediaClient (computing)Validity (statistics)outputTraffic reportingRow (database)ResultantProcess (computing)Windows RegistryInformation2 (number)LogicTable (information)State of matterWeb serviceConfluence (abstract rewriting)Streaming mediaCuboidKey (cryptography)Default (computer science)Functional (mathematics)Error messageRevision controlUniform resource locatorType theoryField (computer science)DatabaseDifferent (Kate Ryan album)Scripting languagePoint (geometry)Exception handlingCASE <Informatik>Water vaporGraph (mathematics)String (computer science)Computer configurationMultiplication signProduct (business)Right angleCoprocessorRaw image formatCartesian coordinate systemFormal grammarGoodness of fitNumbering schemeJSONXML
Machine learningStapeldateiQuery languageStreaming mediaComputer configurationRead-only memoryDatabaseoutputWritingReading (process)Virtual machineProcess (computing)StapeldateiWave packetResultantSet (mathematics)Machine learningReading (process)Uniform resource locatorStreaming mediaProduct (business)Query languageComputer fileWeb serviceLattice (order)File formatFrame problemKey (cryptography)Cartesian coordinate systemoutputNormal (geometry)InformationDatabaseSemiconductor memoryMultiplication signValidity (statistics)AdditionTheoryComputer configurationWeightMoment (mathematics)Numbering schemeState of matterTable (information)PressureStatement (computer science)Adaptive behaviorRight angleIsing-ModellComputer animationProgram flowchart
State of matterBefehlsprozessorSystem programmingStreaming mediaProcess (computing)DatabaseRead-only memoryCoprocessorCondensationSoftware frameworkCoprocessorState of matterProduct (business)Uniform resource locatorProcess (computing)Computer configurationMultiplication signStreaming mediaComputing platformWeb serviceTerm (mathematics)Software frameworkSummierbarkeitFormal languageCondensationWindowCellular automatonSemiconductor memoryScaling (geometry)DatabasePoint (geometry)Functional (mathematics)Cartesian coordinate systemLink (knot theory)Level (video gaming)Physical systemCodierung <Programmierung>InformationComputer animation
Process (computing)Computer configurationCartesian coordinate systemElectronic data processingComputer configuration
Product (business)1 (number)Streaming mediaOrder (biology)Uniform resource locatorValidity (statistics)SoftwareTimestampSoftware frameworkIntrusion detection systemData loggerRow (database)Error messageExistenceMultiplication signCartesian coordinate systemLogicProcess (computing)outputDifferent (Kate Ryan album)Presentation of a groupArrow of timeState of matterPartition (number theory)Set (mathematics)Lecture/Conference
Streaming mediaState of matterCASE <Informatik>QuicksortCrash (computing)TheoryMultiplication signPoint (geometry)Data storage deviceSign (mathematics)Computer fileVector spaceLecture/Conference
Multiplication signProcess (computing)SequenceTimestampQuicksortBinary codePoint (geometry)Streaming mediaPhysical systemDifferent (Kate Ryan album)Message passingOrder (biology)Lecture/Conference
Different (Kate Ryan album)Flow separationStreaming mediaDatabaseSolid geometryUniform resource locatorTheoryWeb serviceService (economics)TelecommunicationMereologyMultiplication signCartesian coordinate systemCoprocessorFunction (mathematics)String (computer science)INTEGRALComputing platformExploratory data analysisLecture/Conference
CodeStrategy gameMathematicsHuman migrationFile formatMachine learningQuery languageFunction (mathematics)Different (Kate Ryan album)Queue (abstract data type)EvoluteWeb serviceField (computer science)Virtual machineMultiplication signForm (programming)Set (mathematics)Right angleNumbering schemeAdditionLecture/Conference
Transcript: English(auto-generated)
Yeah, thank you. Good morning and welcome to my talk streaming. Why should I care? My name is Kristian Drebing. I'm working at Blue Yonder, a German company doing retail solutions, predicting what they need, what they should order. Yeah, so that's the background of
me. And now let's start with a talk. What will expect you in this task? At first, I will give you a motivation on why it makes sense to look at streaming at all. And yeah, that's also the title of my talk and I will give you some background why I came to that question at all.
I want to give some introduction into what is streaming, what is the basics of that and then show you how you can use this within Python. Finally, showing some challenges that might await you when you go down this road or maybe swim down that stream
so that you are prepared to tackle them. Okay, let's have a look at the situation we faced as a team for quite some time. We have a data processing application. As I already said, we're doing machine learning. We get lots of data from our customers and
we had a quite monolithic application. So what we have is we get input data from the customer. They send it in large XML files. We need to validate that data. We need to make sure that this is fine, that this has the right quality to go into the machine learning.
We do the machine learning that takes lots of input data, writes out lots of output data and then we have this output interface where the customer can query the data, the results of our machine learning. So this works quite fine. I mean the results are quite are really good, but
we have some issues with operations and we're extending with that. So yeah, we have this big database in between where we store all that data, where all these boxes are accessing that. We're using SQLAlchemy to access it from Python. We're using Alembic for database upgrades and all this works fine.
But if you have also an application of that type, you might know some of the pains that come with that. So for example we have several teams developing this application and now there are dependencies. We all depend on this one database and the customer desperately wants a new feature in machine learning.
Also the machine learning team says, oh, that's great. It's no big deal. We just have to change our query a little bit. We have to write one more field to the output table and then that's fine. Then, well, that's no big deal. But the other team is also working on that and it's also working on the database and they are doing some refactoring on them.
They are changing their data validation. All is very good that they do that, but there's conflicts and they say, well, yeah, it just takes two weeks time. Just then we're finished and we need to see that we have no conflicts and that. So really,
yeah, now you as a machine learning team say, well, no, we have to wait. That's bad. But that's nothing we can do about. This is one thing, one challenge that you might face. Other things are for testing. So it's always hard if you have one big application to establish clear boundaries, to have a
solid testing strategy there. You try really hard. But well, this is at least not an architecture that supports you in that. Yeah, I also like the testing in layers talk from Monday and where you see all these nice little boxes and how you can separate that into layers. And for sure you can do that with such a monolithic application, but it's just hard to do so.
Then I went to Europizen already two years ago in 2015. I heard many talks about microservices and I really liked that. It was a great idea. It solves these issues that
you split up the monolithic application into several boxes and you can handle that boxes separately. Each has its own data. Each has its own upgrades. The teams can develop independently. Wow, that's great. And we had lots of discussions afterwards after these talks on how we can use that for us.
But well, as I said, we have a data processing application and for data processing we have to process lots of data. And that's where we saw that we cannot use this model in this purest function. We would have to to transfer lots of data between the services.
We have to save lots of data in each of the servers as each service needs many of that data. So unfortunately as nice as it looked, but that was nothing for us. Yeah, there we stood. Microservices sound great, but I mean we had lots of case studies.
They were used in billing. They were used for more transactional applications and this worked fine. But as I said, we could not use that. And then at other conferences, so many people were talking about streaming and also here at this Europison I already heard many talks and it's great.
It's great to see. But most applications use that in financial services for stock markets where they have lots of data volume, where they have to process that really fast or they are using it for online advertising, where they have to process click streams or logs and doing that really fast. And well, I thought these are the traditional domains of stream processing and we are not in there also.
So what possibilities do we have? We're just standing in the midst. Well, we come to microservices and streaming seems to be the domain of other people. Sounds interesting, but yeah, maybe nothing for us.
Fortunately, during a different project I came into contact and did some evaluations for streaming there and suddenly it came to my mind that that this might be fine for us. Although we don't have millisecond aware click streams to process, many things that well might be more side effects of streaming and not just this
pure millisecond processing are really, really good and we can use that to solve the issues we have. And that's the reason for me for giving this talk. Maybe if you are in a similar situation I hope that this will help you also and give you some ideas on how you can improve your application.
So let's do introduction to streaming and give you some basic idea of what this is. My background as a site is coming from a database centric application and therefore I want to compare this and give you some idea where are the differences.
So let's look at databases and streams. I heard the talk from Martin Kleppmann, which he gave at the Strange Loop conference in 2014, turning the database inside out and I thought this was a very good way of thinking about databases and streams. Because essentially in a database and in a stream you have the same information.
When you have a database then this database has internally a changelog and this changelog tells what did change within your database tables. And from this changelog this is essentially a stream and you can reconstruct the database content at every point in time.
So when you look at it here in the first entry it says well change a row with a key a to the value one and then you see this table has one row with key a and value one. Then as next another information comes in change row b to the value five and you see it in the table then comes change row
c to the value of three and you see that in the table. And then it gets more interesting you have updates for existing tables. So a gets the value of eight then a gets the value of four and c gets the value of two and each at each point in time you have a consistent table in your database.
So that's the basic idea of well the the similarity between tables and streams. And databases used it internally for replicating to different nodes for example. But why does this matter for us? The most interesting thing from
in my situation was that different services we have can be in different states. So we don't have this dependency to one single state because each stream processor each service can have a different offset within that stream. Let's take this example here. We have the service one which now is on index on offset three and
he sees the table as a is eight, b is five and c is three. We have a different service that is already at the offset five and there a is four, b is five and c is two. And this is totally fine both services are in a consistent state and if service one catches up to the offset of five then you will have the same information as service two.
So the interesting thing is you can have several services that can operate at different speeds and this is one thing we have for our application. At one point in time we might get lots of new data from the customer. We might write lots of data into that stream and we have services that process that faster.
We have services that process that slower, but as long as they are able to catch up, that's totally fine. And you can add new services to that structure which can process the stream from the beginning on.
And you have just more control, you have more possibilities how to scale your services. So you don't have that all services need to be at the same speed, but you can design by your needs. Do I have services that need to get faster? Do I have services where it's fine that they are slower? For example, just aggregating some reporting mechanisms, whereas on the other hand you want to answer the customer fast.
Okay, so we have that possibility. That's great. But how can we use it? I mean I said some services might be faster, some services might be slower. How can we influence the speed of our services? Well, we can program better. That's fine. So we can just improve our code, but this has just a
limited effect. Sometimes we need to have even more. And there comes one idea in 2Play that's also very helpful in streaming. You can partition your streams. And that's always a decision based on your business domain, on what partitioning makes sense.
For us, for example, we get sales streams. We get the information about in what location did we have which sale. And we can use this for partitioning our stream. So let's say we have here our sales stream. We have three locations.
We have Rimini, where we are now. We have Bilbao, the last Euro Python conference. And we have Karlsruhe, where our company is based. So we sell spaghetti, we sell ravioli, we sell pizza. And we have this on different times and in different quantities. And now we can have one processor that works on all these three partitions.
That's fine. And that's maybe the best place to start. But as we see, we need to get faster. We have the possibility to split this up to introduce one or two new processors. And they can work in parallel on these different stream partitions. So each partition could be handled by a different processor. And the state stream example will follow us throughout the rest of this presentation.
How does it look like? So we don't have the database centric. We don't have the microservices. But we have here the streaming platform now in the midst. And the idea is that we can be at different offsets in that. So what did we gain? We have clearer boundaries for our services.
We have all these ideas that we can deploy them independently, that we can operate them independently. And the data is mostly in the streaming platform. Well, you might ask yourself, why do we have these data bubbles still in the processors?
I'll answer that later. And what is about these database schema changes that I talked at the beginning, when the data validation team needs to update it and also the machine learning team needs to update that. Also this, save this question, it will be answered later.
So what did we gain? As I said, independent development, upgrade. And we have more options for scalability. Did we throw out the databases completely? We have the streaming platform in between. Well, think of these data bubbles and we'll come to that later. For me, one important question was, this all sounds so great.
This is such a good idea and it seems to solve so many problems. Sounds like magic. And magic, that's always something that makes me suspicious. Maybe there's something I don't see. Maybe there are new problems that arise, which I just don't know of at the moment. And yeah, it is not magic. It is a trade-off.
I mean, a database is very powerful. You have many guarantees that are given you by the database. You have the asset consistency guarantees in your transactions. You have the SQL language, which is a so powerful way to retrieve the data that is there in the database. You can do nearly everything.
But as we have seen, this comes at a price. We are depending on one single state and also the scaling is hard. Also, scaling of the database is hard. So we have to think, do we need all these things that the database gives us in our application? And what are things that we lose?
So with streaming, we don't have the asset guarantees anymore. We have an ordering on a stream partition. So what the streaming platform will give us, it will guarantee us when we feed entries into the stream, each service that will retrieve these entries will retrieve them in the same order.
This might be a small thing compared to the asset guarantees, but it's fascinating what you can construct from that. And when you have several things on several streams, just given these ordering constraint, you are able to construct many of the guarantees that you need.
And we don't have the SQL queries anymore. So it is not possible to query a stream. And this is something that you really have to get in your head when you are thinking about that. You might be used to SQL, you might be used that you can query for any row with any value or doing a join on that. But this is not possible.
The stream just goes through your processor and it's your possibility to keep that state and to remember what was the last value of A. And you have to decide whether you can live with that or not and what mechanisms you employ to help with that. I mean, you see what you lose, but at least I feel better now.
I know it's not a magic that might come back to me in the worst moment at all, but it's a conscious decision that you can do and you can see the trade-off on whether that's good or bad for you. OK, so much to the theory. Now, we are using Python at our company.
We really love to use Python in most of our services. And how can we do that in Python? Just taking a step back, Apache Kafka is a streaming platform. This is not Python. This is implemented in Java. But we'll come back to Python soon.
This is an example of the streaming platform. So you can have here producers that put data into the streaming platform. You can have consumers that retrieve data from the streaming platform and you have the stream processors that, well, they take that data and they wrangle that data
and they put it back to different streams then. And you can also have some connections to databases to get that data from there directly. So that's a really cool streaming platform. It's used by many people.
It's very scalable and it's really battle-proof. So that's a thing that you can build on. And there are also Kafka clients in Python. We also heard just yesterday also from others that are using them. And you have PyKafka, Python Kafka, and the Confluent Kafka client, which are the three I have seen.
And also other people already have done very nice comparisons on that. For example, here the exhibition games. You can have a look there in detail if you want. The interesting differences between them are PyKafka and Python Kafka are both written completely in Python.
PyKafka has a very Pythonic interface, whereas Python Kafka more simulates the C interface. And we have the Confluent Kafka client. This is not pure Python, but this is used in the C library librd-kafka, but it's the most performant of these. So we decided to use the Confluent Kafka client.
There's many configuration options and it's really worth looking at them because you can use them for performance tuning. So at first I was a little bit surprised when I used the client and it seemed also slow, but it was just to my testing setup where I had very little records
and the standard settings are not for these little records, but for more, it has more buffering in there. And if you reduce that, you can come to very low latency also in your test setup, which then feels much better. So let's see how can we use these clients.
First thing is to have a producer. I just have to give that producer the bootstrap server, which is my Kafka node, which is default on port 1992. I have some data. You can see here the sales data in Rimini on Monday. We sold some ravioli and we want to input
that to a stream called sales input. And we are using JSON to serialize that data. Then we want to consume that. And also the consumer, it needs to know the Kafka node to connect to. Well, it has some further settings of which here the topic config might be most interesting for you
because topic config tells you do I just use the new values appearing on that stream or do I want to process the stream from the beginning on. So default is that you just look at new values, but here, especially in testing, it's always very helpful to start at the beginning. We subscribe our consumer to the same topic, to sales input.
And the most important things are here. We are polling the consumer. We're checking that this worked fine, that we have something in there. And then we just print the received message and it will show up as the JSON string we put in there. That's great.
This already works as we needed to. Now we use JSON as a serialization format. That's also good for a starting point. But as you are working with many teams, it's always good to have some defined schema. In your databases, the schema was defined by the database and this is good so that everyone knows what is in there.
And also for your streaming applications, you can use more rigid schemas than just put on JSON. What we decided for is for Apache Avro. And this is a schema where you define, this is something where you find the schema at first.
You give some data types, you give some fields. It has many possibilities to define that schema. And it's also a very good compacted. So it's not just writing the JSON format, but it has a compaction in it, which is also very good if you want to save some space. But what excited me most about Apache Avro,
is that it defines also schema evolution. This means you can enhance your schema. You can add new fields and it defines the criteria on how you can enhance that schema. So for example, for new fields, you always have to give a default value
because this ensures that also processes that are at an older state can use that data. And if they want to retrieve data that was written by an older service, then the default value is applied. And if they're reading the data that was written by a newer version of the service, then it already has this field with what the service has put in there.
So this is really great. And this solves one issue, which I promised to answer you 10 minutes ago. This is when different teams wanting to enhance the schema. So by that, they can use in a compatible way different versions of records in there. And you don't have to reprocess all entries in the stream as you would have done
with a database upgrade script. So how does that look like in Python? We have here the schema defined. It has a name. It says it's a record of many fields. And I say, what are the types of that field? We have strings, we have quantities,
there are many other types also. And as I said, you can say whether a field is optional. You can say whether a field has a default value or not. So we're using this. And now we can use a different producer and consumer we use to Afroproduce on the Afro consumer. And you can see here, still we have to give the Kafka node. Now we have to give the schema registry URL.
So this is something where the schema is registered where every service can retrieve all versions of that. And this schema registry, this also checks that you only do compatible schema upgrades, which is great because as soon as you want to write a schema in a new version and it's not compatible, it raises an exception and tells the world this is wrong.
So you know it from an early point on. The other things are mainly the same. We are giving default schemas for a key and a value. We can use our data here and we now don't encode it in JSON, but we give it directly to the Afro producer. He uses the Afro schema to encode that and writes it to the stream.
Also for a consumer, the main new thing is that we have to give that schema registry URL so that he does know how to interpret things that are on that stream. And we are using here also polling and we just can check on what is the value on here. So these are mainly the examples used from the Confluent Kafka client.
So you can also have a look there to dig in deeper and to have some more explanations on that. Okay, these are the basics how we can write and read from a stream. What do we do with that? So let's have a look at data validation. As you remember, that's the second box we had in there.
And what do we want to do there? We have to say it's input and we need to check whether that's correct or not. So we're separating the wallet and in-wallet sales records in the same way as during the Cinderella fairy tale. She has to separate the good piece and the bad piece. These are what we want to do here within our service.
So very basically, we just pulled the new records. Let's say we have a function which checks whether the sales record is wallet, whether the location is a wallet one, whether the quantity is non-negative, for example. All these things you can think of. And if it's wallet, you write it to a new stream. Say it's validated and if not, you can write it to a new stream sales error
and then let the other processes handle this information. For example, how to answer the customer that he sent an in-wallet sales record. That's fine. A very interesting thing about streaming is that you can add additional processes in a very easy way. So let's say either we want some new stuff,
we want to have some monitoring, some reporting on that, and we write this monitoring or reporting to a new stream, or we have a new validation logic which we just want to try. We don't want to put it directly into production, but we want to write its results out for a different topic
to check whether this worked fine. When we had the database-centric application, we would have to remember the processing state. So we have one state and for every record in our sales table we would have known is it validated or not, has it been processed or not. So a second service for validation would be really hard
because it has to know, well, this has been validated by that service, but I have not validated, I need to introduce a new field or so. But this is not the case here. Each service can know how far it has processed so that it can work independently,
which is a thing that might not sound that interesting, but as soon as you have tried it, it really is fun to work with that because it makes things really much easier, especially when you develop and you want to try out new things. So these are the basics for using the streaming.
Now let's come to the challenges. I said we do machine learning, and machine learning, especially in the training, is not a thing that you do in streaming and the results, the real machine learning answers, well, they might be different, but for us, as we do forecasts for sales in a daily way,
this is also something we do in batch. So how can we work with these batch-like processes within our streaming? Our machine learning application needs to get all data based for the training in one batch,
or maybe in several, they are working in a partitioned way, but still there's many data in that. So how do we get that input data? And remember, you can't query that stream. You can't tell, give me for all keys the current value. So somewhere we need to save that data. We have several options. We can just keep it in the memory of our service.
We can use a separate database, so serving database that doesn't need to be that powerful as the database from our monolith, but still we could use a smaller database, or we could use a blob store, which is also just cheaper than a database, and we can save the values in there, and it can be used by our machine learning application.
And yes, that is data application. It feels bad at the moment, especially if you come from a very normalized database scheme, but that surprise will have to pay. We have several advantages, and we also then have to live with that data duplication. But what's the idea behind such a duplication?
How can you explain that? And what I found very helpful in there was to differentiate between a write path and a read path. So for reasoning about does it make sense or not, I think that's a very helpful distinction. And in our old way with a database, we had a relatively short write path.
So we put the validated data in the database, and then the data validation is finished. And when we start our machine learning, we are doing a machine learning query that needs to fetch all this data with a very big joint statement, and then feed it into the machine learning. So this is the read path.
And this machine learning query is something that really puts the database under much pressure, and you have to see your database sizing that it can work with such a big query and relatively short time frame. Now let's compare this to how we would do it with streaming. You see the write path now is longer.
Because we have the data validation, we write that to our topics. And as soon as we have new data on that topic, we can already do the joining. So we are joining the new sales data with the location data, with the product information, and maybe further data, and write that to a blob store.
And there it sits, and it waits until the machine learning started. What we have lost is our normalized schema, because now we have the data duplicated. But what we have gained is a very big operational advantage, because that write path can be more scaled.
As soon as the data arrives, we can write it there, and it sits there until the machine learning starts. When the machine learning starts, it doesn't have to do that big join that really puts the database under pressure, but it can use the data in a format that it needs, and that this is consumed by that. So we have duplication, but we have gained operational advantages.
How would such a thing look like? So as I said, we have this location data, we have the products data, and we have the sales data. We can treat that locations and that products, it's master data that's not such a high data volume, we can keep that in the service as a table,
and we join that additional information to the sales stream. And then that joint data we can append to a file, and this file then sits there and waits until the machine learning started. This is the possibility to cope with a challenge that the machine learning is still batch. You might have noticed when I explained that,
that I said, well, the sales and the location and the product data, they are kept in the processor. So that's a state that we have in a processor. And state, as you might know, is the nightmare of every distributed systems engineer, because it's hard to handle. What state do we need there?
I mean, for streaming, the data could just rush through. But for example, in our scenario, we need to know what is the master data so that we can join that information for that. So that's the data you want to join with. But there are also other things that might be more subtle. So when you have some time window processing,
you want to aggregate a stream that comes in and you want to know the sum every five minutes, then you need to know the data of the last five minutes so that you can sum that up correctly. And you need to know when to start a new aggregation. And there are different ways of doing that. You can have hopping time windows or sliding time windows.
So that's just different possibilities that have different requirements to the state. Formerly, the database did this for you. You could ask them. You could ask for master data. You could ask for things in the last five minutes. Now you have to know that within your service. Well, it's fine, you might think. You just keep that in memory. You know what came in.
Everything's fine. But there are some challenges. Well, a processor might fail and then it needs to restart. And all state is lost that was in its memory. So from where does he get the data for restarting? Or something less dramatic. But as I said, one thing that we really want to use is scaling.
So at first we had a processor that took care of all locations. And now we want for each processor to have one location. So this also changes the state in there. Or let's say we had three processors for each location and now we want to merge that into one. So at least the location master data needs now also to be merged
to this one processor. How can we do that? Well, as I said, we just keep in memory and maybe we just keep all state in memory that we might possibly need in the future. That's one option, but it's not the best one.
You can reprocess just a stream from the beginning on to warm that up. This might take a long time. Or each processor could keep its own database instance and save the state in there so that it can be used at restart and you just need to know to connect to which database, which might also be interesting.
You can save your condensed state in a stream, which is an interesting thing because you have that streaming platform already. Or you can just ask a different service that hopefully knows all the master data and you can tell them, please give me all things I need to know. Several frameworks already exist to cope with that problem.
And it's very interesting if you start with your world in the eve approach and using that to have a look at such frameworks, what they do and why they might do it. And then you can really learn from their experience even if you are not in the same language.
For example, you can have a look at Kafka Streams or Apache Samsung. And up to yesterday, my impression was there is nothing we can use in Python. And well, we have to think, can we employ some learnings from these other things? Do we have to write our own framework? Does it make sense or not?
And we really were searching for an answer there and was very glad to be in a talk from Winton yesterday. And they said they now open sourced yesterday the Winton Kafka Streams, which is the Kafka Streams doing in Python. And I'm really eager to have a look at that. They also said it's at the beginning state
and there are still some topics they need to solve, but it's very great to have a starting point in there and to check whether we can, whether you can contribute to that. And yeah, just grow this functionality in Python, which would be really great to build more applications on that. So you see here, that's the GitHub link,
Winton code, Winton Kafka Streams. Well, and that's the end of my talk. Let's summarize what have we learned in that. You have more options for your data processing application than you might have thought, but you also know the trade-offs.
And I want to encourage you just to broaden your way of thinking about your application to see there are small things that you could use and you need to check whether you can live with the trade-offs or not. And you know the challenges, you know some possible solutions. So yeah, now go on and build some great applications on that. That's it from my side.
Okay, well, great. Thank you very much for the great talk. That's awesome. We have almost 10 minutes for question.
Great timing. Any question? Any question for Christian on streaming? Otherwise, I do have a question. Well, I'll start with my question. Very interesting technology.
I've never used streams. How do you deal, or maybe the framework deals with missing value, say you have a network glitch somewhere, somebody trips on a network cable and you're missing, I don't know, two or three values in your stream,
or do you care? Well, what the framework guarantees to you is that within a stream partition, you get all the values. I mean, if it cannot guarantee it, you will get an error. So this is something you really can rely on. But what you have to deal with is some streams might operate
at different speeds than other ones. So for us, one challenge will be, and this was just too complex to bring it into that short presentation. Let's say the customer delivers you some location and some product master data, and he delivers you some sales data. And now the location data is delayed.
There's some issue on the node that processes that, and you are at a past offset, but you're at a new offset for the sales data. And now you would raise an error because you say the sales record is invalid and you don't because that location doesn't exist. But this would be bad because the customer would tell you,
well, I already sent you that, why do you give me an error? And this is something where your application logic then has to cope with. And we will do that as we have one input, the data input, it always will assign delivery IDs and processing timestamps when this was delivered. And now your validation logic needs to check
that you have processed also the recent master data records. So this would be the solution to cope with that issue. But for the stream partitions, you always can rely that you will get all values in the correct order. Okay, great. Thank you. Any other question? We do have a few minutes,
seven minutes more or less. I can keep asking questions, folks. But if you have any question, and a question over there. Sorry, I didn't see you. Hi, thanks a lot for the great talk.
It's really nice to see somebody else doing the same thing that we do as well in our company. I have a question about recovering state in the case that your consumer crashes. Did you try some sort of seeking back in the stream to find the latest state that you could recover from?
Well, that's the thing. It's not seeking back in the stream, but you can always have a snapshot of your data so that you know that you have in some data file,
you save what master data did I receive, for example. And this is used also when we do some previous for that to do some data science in there. And with that stream, within that storage, you also save on what was the offset in the stream,
which corresponds to that. So by that, you always know from which point in time you have to reprocess your stream if you want to get the updates compared to such a blob storage. So this would be the way to search in the stream for updates that are missing. Thanks, I like this approach.
If I can just quickly share what we did. We implemented time-based seek in the stream. So each one of our Avro messages, just like you, we use Avro, each of our Avro messages contains a timestamp. So we implemented a sort of binary search in the stream
so we can return to a particular point in time. But it works, but it's not very elegant. So I like your solution better. Yeah, I mean, it's hard with time-based things as we want to think about always in a distributed systems manner. And then you have different processes that might have slightly different time things.
So what we want to do is we want to use that stream of the delivery IDs, which then is guaranteed to always be in the same order. And if you reference these newly created things, then you can ensure that you are really in the correct sequence. Cool, excellent. So any other question? I'm standing here so I can see you.
Okay, no other questions. Well, let me see. A question here.
I have a beginner question. How do you integrate the data? It's all in different databases, right? Excuse me, can you repeat the question? How do you integrate the data of those several databases? You have data paths and they are saving the data in several databases, right?
Or not? Yeah, I mean, for each stream processor, you have different possibilities to save that in there. And not everyone uses a database and not everyone needs to be queried.
So we want that each service, they do not need to know about what technology the other services use. So the communication really is with the streams and how the databases are more, that each service can be restarted for that or that also other parts of application can say,
well, I need to know all locations that you have in there. But then he queries that service for that. So it's not a question of database integration. All this integration should work via the streaming platform and they should not need to know about that. It's the only thing
where we need to know about it is, well, as I said, we want to do some exploratory data science on this, which is not connected to the streams but directly to the output data. But then the data scientist knows where the data lies in. But within that application our communication should be wireless streams.
Okay, we have time for one more question.
One question. Do you have any particular strategy to handle the migration updating of the data that is already queued into the topics of your streaming infrastructure? If it changed the format or anything like that
due to code changes or things like that? Let me see whether I understood the question correctly. I mean, if it's not just the schema evolution we have but if you say that we need additional fields
in the query for our machine learning. Is that the question? How to handle that? Yeah, the question is, I mean, I suppose there is a service that puts the data into your streaming queue. Okay. And the data is probably formatted in some way. If you change that format in any way,
I mean, how do you tackle that? For that you then really have to reprocess your data. If you have a different output format and this does not help you what you have done in there
then you need to reprocess that data. Actually I was looking for the next speaker, sorry. So we have no more time for questions, sorry. I'm looking for the next speaker. Please identify yourself and come to the podium and let's thank Christian again.
The great talk, great questions. Thank you all.