Data governance in streaming at scale
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 69 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67353 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202113 / 69
11
25
39
43
45
51
53
54
60
00:00
Information engineeringMulti-agent systemOffice suiteWeb pageRevision controlOpen setPosition operatorXMLUMLDiagramComputer animation
01:00
Service (economics)ResultantTwitterFitness functionFerry CorstenTablet computerCovering spaceMixed realityContent (media)Metropolitan area networkEvent horizonConfidence intervalBus (computing)Term (mathematics)Information securityCollaborationismRegulator geneComputer animation
02:23
Right angleInformation securityMaxima and minimaDefault (computer science)InformationLibrary catalogGUI widgetDimensional analysisComputing platformEvent horizonArchitectureCASE <Informatik>Multiplication signSpacetimeEvent horizonFunctional (mathematics)Right angleLevel (video gaming)MereologyInformationFocus (optics)Real-time operating systemCartesian coordinate systemTraffic reportingAsynchronous Transfer ModeGame controllerDaylight saving timeReal numberState of matterField (computer science)Web serviceDimensional analysisArithmetic meanNumberFlow separationInstallable File SystemDifferent (Kate Ryan album)Computing platformLibrary catalogMaxima and minimaFile formatSelf-organizationInformation privacyPoint cloudType theoryPresentation of a groupPlanningRegulator geneCategory of beingData integrityPhysical lawImplementationVector potentialMusical ensembleGodPasswordFamilyCoefficient of determinationFacebookAdditionMathematicsSpeech synthesisCore dumpVideo gameGoodness of fitHand fanGroup actionGame theoryWordPhysical systemSoftware testingXMLComputer animation
08:43
Binary fileCompact spaceParsingCommunications protocolData managementDifferent (Kate Ryan album)Sound effectMultiplicationRevision controlProgramming languageWindows RegistryRepresentation (politics)Descriptive statisticsData storage deviceMessage passingInfinityEvoluteMultiplication signFlow separationClient (computing)Information privacyEvent horizonPoint (geometry)File formatDesign by contractSoftwareField (computer science)Serial portSingle-precision floating-point formatBinary codeFormal languageComputer programmingRepository (publishing)Library (computing)MathematicsCASE <Informatik>Default (computer science)CodeCollaborationismFunctional (mathematics)Partial derivativeDecision theoryStandard deviationProcess (computing)Streaming mediaCentralizer and normalizerContent (media)Service-oriented architectureMathematical singularityError messageGodOperator (mathematics)Cartesian coordinate systemComputer fileBitType theoryNumbering schemeInclined planeOnline helpDomain nameChord (peer-to-peer)Texture mappingSheaf (mathematics)Transportation theory (mathematics)Casting (performing arts)ForestMusical ensembleFood energySoftware testingGroup actionAndroid (robot)PolygonXMLDiagramComputer animation
16:53
Event horizonTime domainInformationEmailSource codeInformationMereologySoftwareDressing (medical)Multiplication signProjective planeData analysisMotif (narrative)Lie groupWebsiteInformation engineering1 (number)Field (computer science)EmailEvent horizonMetadataComputer animation
18:22
Revision controlComplex (psychology)ParsingMeta elementGroup actionExecution unitField (computer science)Right angleSheaf (mathematics)InformationComa BerenicesSoftwareRevision controlExistential quantificationCategory of beingProcess (computing)Computer fileNetwork topologyFamilyStatisticsTablet computerMathematical analysisHand fanSpeech synthesisShared memoryWeb serviceFigurate numberElectronic data processingReal-time operating systemInterface (computing)Source codeError messageFreewareExecution unitDatabaseAuthenticationResultantDirection (geometry)1 (number)Real numberInformation privacySet (mathematics)Cycle (graph theory)Video gameLibrary (computing)Type theoryField (computer science)MetadataAdditionTranslation (relic)Message passingFlow separationGroup actionParsingEvent horizonComputer iconDomain nameService (economics)NewsletterOffice suiteHash functionMultitier architectureEmailLevel (video gaming)Computing platformFunctional (mathematics)Computer animation
23:46
MKS system of unitsBusiness IntelligenceInformationComputer animationDiagram
24:18
Business IntelligenceStandard deviationImplementationSoftware testingBoss CorporationDiagram
24:51
ImplementationCodeMetadataInheritance (object-oriented programming)RecursionOperator (mathematics)Object (grammar)HoaxTraffic reportingNormal (geometry)Extreme programmingHeat transferMeeting/Interview
26:06
XMLUML
Transcript: English(auto-generated)
00:07
Welcome to this talk. As you know, my company is Letgo, where I'm a data engineer. I'm going to start with a small introduction to us as a company.
00:21
And Letgo is a second-hand market, mainly in the mobile channel that we have also a web page. And we started in 2015. And since then, we have had a phenomenal growth. So the amount of data we need to manage today is quite large.
00:43
And now, our top market is Turkey. And despite that, we have offices both in Barcelona and in Istanbul. But now, we have opened recruiting to any place in both countries, in Turkey and Spain. And as I was saying, when the company started
01:02
in 2050, like many companies, it started as a monolith. However, if growth means the same for you as for me, that means the travel. And pretty soon, we started migrating to an ecosystem of microservices. And as you know, an ecosystem of microservices
01:22
means that you have also a society of teams that need to collaborate between them. And the way to keep the collaboration between teams in terms of data was done by sending and receiving
01:40
many, many events. And at the beginning, it was more or less like a free-for-all. Everyone could subscribe to any other event. And then permissions were very coarse-trained. So either you are able to tap everything that is produced or nothing that is produced. But later on, as we were growing,
02:04
we implemented some data platform, facility, and some data bus to manage that amount of confidence that now are more than 1,000. And that meant to have security control, for example. Otherwise, it's impossible to comply with regulations
02:20
like CCPA or JDPF. And we are going to talk about some of the principles on how we made this possible. And I want to go back to the title of the talk, that is, Data Governance in Estimate, at this case. And let's start with data governance. Data governance means that there is some guarantees
02:43
that we can make. And the very minimum guarantee is that you have a minimum of quality. And one way to do that is to have some kind of format or presentation that you know that, for example, some fields are present, some other fields are optional.
03:00
They are always of the same type for a given time, and you can rely on them. But that's the very minimum. On top of that, it's very nice to be able to catalog your data so you know which event exists and what each field means. And also, especially, where the private information
03:21
of the user is in each of the fields. Because as you can see in the clouds on the right, there is a lot of things around the world of data that need to be taken care of. Some of the words, like CCPA or GDPR,
03:41
are privacy regulations. But even if we don't have those privacy regulations, we need to pay attention to these kind of things. Because if you don't manage the data, you are, the path of minimum resistance is to have access to everything from everywhere.
04:02
And that means business is, so it's not coming to comply with the regulation. You want to be able to know what the privacy regulation is and who needs to use it. So you can have, you follow the principle of minimum privilege in your plan. Otherwise, it's very difficult to manage.
04:22
And also, this kind of philosophy allows you to implement things like data minimization. It's a concept that comes from the German law that GDPR, and so many German speakers, but that long word, I'm going to pronounce it as
04:43
that's an expression, and it means data minimization. And that's a very nice principle that says that you should not design any business function or application using more data,
05:01
or more private data than the very minimum to comply with the function you want to do. Otherwise, it's a risk that you should not. And let's go, let's focus on the streaming part. If you do this governance,
05:21
that means checking the properties of your data, data integrity, data quality, and also the part of the governance related with access control and privacy. If you do it in a streaming, it's like you follow this path towards the light.
05:43
But that's, you need to do some work. You need to implement it. And the typical failure mode is if you deviate to the left, and you don't do that in a streaming, you can have that governance done in that.
06:00
So you can do your data integration overnight, and you can have your reports generating data of very distant parts of your company. But that creates silos in real time, because you can only join the data one hour later or one daylight.
06:22
And in real time, you cannot use the data, the useful data that you have in one service that is able to emit in a real time event in another. And the other field mode that we deviate to the right is when you don't have any access control, and every microservice can access the events
06:43
generated by any other microservice. And this is the access all the things reckless behavior. So let's try to find a way to go with the state towards the light at the end of the path. And let's focus also on the scale.
07:03
And by scale, I mean to this kind of thing. We can scale in the number of events and in the amount of traffic we have. That kind of scaling means that you need to have some platform to which you can add more resources
07:21
to handle more load. And the cost of your platform per user or per event doesn't grow. Because otherwise you pull the scale in load, not in, you are not viable as a result.
07:42
We have solved that by relating and putting in the center of our platform that is highly scalable. But the other dimension is more challenging. And that dimension is the organization. So you can grow 10X in traffic, but you can also grow 10X in number of teams
08:03
and number of different kinds of events and different kinds of private information to manage and different, and the relationship between the teams is not going to grow linearly, it's going to grow quadratically. And if you don't want to end with a pile of ifs
08:21
for all the special cases, you need to make some kind of separation between the data platform and the teams and find some way to manage that without doing a lot of cases. And we will see how at level we have found a way to scale linearly also in the number of teams.
08:45
And I can advance that at the center of the solution, we have a schema. And schemas can be used in different ways with different effects on the flexibility, rigidity and freedom for the team.
09:01
So we can make two axes. In one axis, we have from centralism to anarchy. In the stream with anarchy, the team can do whatever they want with the data. They don't need to talk to anyone and they can go very fast. And in the centralist extreme,
09:22
all decisions are taken in a central place and you can imagine this. And in the other dimension, we have no warranty, so there's no warranty. One of the typical things you can do is to say that your events are just Asia. You don't have any schema. And the only warranty that you give to the other teams
09:43
is that their partialization function is not going to throw one step. And that's very flexible, but it's going to make exploiting the data very difficult and manage the privacy extremely difficult.
10:00
In this corner, we have the distributed one. This is when you don't have really microservices because you need to coordinate the deployment because you need to have exactly the same version of the data. Maybe the library used to produce and consume the data everywhere. And if in architecture,
10:23
you need to deploy at the same time two services, they are not independent. And it's called, it's in fact a monolith that is broken down in several machines, but it's still a monolith. I suppose you were expecting now something, some technology up here in this corner.
10:42
The unicorn technology that gives you infinite flexibility and strong grantees. Unfortunately, we don't have that. Let go what we use is Avro because we can reverse schemas with it. Let's take a look at Avro. Avro is a serialization technology.
11:02
That means that you can convert your messages into a network format that you can use to send the message or to store it. And this is schema driven. You cannot use Avro without having a schema. And that's good because we want to have
11:22
some description of the data we are going to send. And it supports JSON and binary representation. Binary representation is a standard. That representation is very compact. And it's also very fast. And because you don't need to be splitting the text
11:42
to see where the field is at and then is with. The main point of using binary serialization format is that you need pulling to inspect the message. As I was saying, you can use Avro with this.
12:04
And it's pretty much a trade-off between readability, compatibility, and so on. And let's go, we have support for both technologies and it's a matter of what do you want to use in yourself. And one of the most important points that comes to me is that it's polyglot.
12:23
You are not tied to a single programming language or in fact, to go with two main programming languages, Scala and PHP, and we can use those schemas from both programming languages. The most important thing in this case for the flexibility
12:43
is that schemas can and do Avro. There are some rules, not every change is compatible. If you want to keep backward or forward compatibility, there are some checks that Avro library can do for you.
13:04
If you follow that rules, that gives you the flexibility to, as a producer, start producing the new version of the event, maybe with some additional field, without needing to restore or redeploy the consumers. And of course, if the change is too extreme,
13:23
you will need to give an incompatible change or better, a new version of the schema produced maybe in a separate topic, so the users, the consumers, once you access the new data, they can switch to the new topic
13:42
after they change their code. In practice, the setup we have is that when a new event wants to create a new event type,
14:02
or want to modify an existing schema, there is a central repository in which you can open up a request, and that repository is connected through Jenkins. We have a CI CD pipeline in Jenkins that is making many checks. Some of the checks are related with compatibility,
14:22
with the previous version of the schema, if you are modifying, and that's done with the collaboration of the schema registry. It's a piece of the gap ecosystem that is able to manage those schemes. Apart from that, we have many checks of conventions that we use as well.
14:41
We detect some bad practices and we provide warnings, and after all those checks are green, someone has reviewed the code request, and this is, most of the times, this is done within the same day.
15:03
We can approve it and publish the schema. In production, this schema registry is used with Kafka, and we will have producer and consumer processes because Kafka is, the brokers are aware of the content
15:24
that you have in the messages. There is something that is doing the schema registry in the runtime, that is, when you want to produce the standard library, it's going to contact the schema registry, and it's going to say,
15:41
hello, Mr. Registry, I want to produce with this schema. The schema registry is going to give the producer the idea of the schema, so you don't need to send the whole schema with a message, and you send the binary data. The consumers are going to read those events and remember that when you build your consumer,
16:04
maybe you compile against some version of the schema. If at some point you receive a new version of the schema, you take the idea, I mean the library, the client library, take the idea, retrieve the schema, and that's the evolution in place,
16:23
and that means that with your version of the schema, maybe there is a new field with a default value, so if you don't use that field, ignore it, if something was removed, then you get the default value
16:41
of the field automatically, and this is the setup that allows you the flexibility while keeping some contract. Then we have the other part of the puzzle. We want also to manage the private information,
17:01
to know where the private information is, and this is like what is well domain, because if you are a data engineer like me, and there is like 1,000 event types, sometimes it's not obvious which,
17:20
where is the email in all those 100 events, and maybe the field is not called exactly email, or maybe the user ID, a PR, but it's not called user ID, it's called sender ID, or receiver ID, or maybe just sender.
17:41
The ones that have that knowledge are the teams that produce, that curate that information, and we want also to avoid writing a pile of ifs, so the solution is to use tarring, and that tarring can be added to the schema. Then if you think about that,
18:02
the teams are already writing the schemas, so it's very sensible to ask also the team to add that tagging information while they write or modify the schema. And fortunately, Albro can be extended with arbitrary metadata.
18:23
In our case, the field we have used is called let go properties. There is some places in its type definition which you can plug these additional properties, and the official Apache Albro library
18:41
is going to parse the schema and give you that as a dictionary, so you can further process it. The easiest translation of this is what you have on the left. This is just an entity.
19:01
While on the right we have a more complex version that allows you to link the private information to some other field. And in that case, that's used because sometimes you have several user IDs, and imagine you have a domain event in which you have a chat message,
19:20
and there is the ID of the sender and the ID of the receiver, and you have the IP of the user doing the sending the message. You know that that IP is related with the sender ID, but if you don't add it to the task, we cannot do automatic processing of the event
19:40
for doing the data governance. And of course, the Apache Albro library doesn't parse this metadata. We need to do it. We have a library that is called wino, and it's called wino because it's used for telling apart
20:03
the green from the tab, like in the icon of the library. And this library is parsing those pieces of metadata, and also is able to take ACLs or from some database that is called
20:20
the data of database, that is the place in which we centralize which pieces of private information each team has access to for processing. And the data processing, the data protection officer is the figure in the company
20:41
that rubberstamps that this team is able to access. For example, imagine you have a team that is on top of the newsletter of the company. And that team will have some ACL giving them access to the email,
21:01
but maybe they don't have access to the IP of the user because they don't need. The kind of actions that this wino library is able to do at the field level is for example, remove fields or hash fields. Hashing fields is useful because you can have a team that maybe is interested in the email,
21:21
but they only want to count unique emails. So if you have access to the hash of the email, you can do that function with less risk than having the actual data. Well, once we have this, we need to run it. And this is the life cycle of the events in real time.
21:49
The services of each team or a squad, as we call them, they work with a set of topics, Kafka topics that are isolated from the rest of the team
22:02
because they don't have permission to access other ones. And that is in the tier one. Tier zero is the place where all the private information that should be evaluated is conducted. And you can see that the topic with private information are in red, with only direct information.
22:24
Then we apply the gatekeeper. This process is checking that the events have good data quality. They match with the schema, otherwise they are going to be recorded. We have clams, we have some dashboards. And the result is validated topics that have,
22:44
we have the JSON version and the arrow version. So you can, you can choose whichever. And the data pump is the piece that is using wino to do the governance in real time, sending it to the green topics. So the teams can access private information
23:02
from other teams, but only if they have permission. And we have completed the whole cycle. I want to just emphasize the four properties of this approach, that is first the search service, the meta, because the teams configure the data and that drives the behavior of the platform.
23:23
And it's meta data driven because the schema and ACLs drive everything. It's focused to availability. And finally, there is privacy safety. If you don't, if you start a new team and you don't get any additional permission, you cannot access private information from any other team.
23:41
And that's satisfied. And that's the addition that we have implemented. Now, I think we'll have a couple of minutes for questions, maybe. Ningguang. Yeah, thanks a lot, Sebastian, for this awesome talk. I think there was like some really good best practices
24:02
and really, really interesting to see how you tackle the personal private information and GDPR requirements. Let's go. Let's check if there are some questions from the audience. So I have a question.
24:21
Is there like anything that you would wish from Apache Avro, for instance, is there like a feature that you think would make your life easier? Or do you think it's pretty, yeah, Apache Avro, or do you think there's like something, it's already good for what it's doing?
24:41
We should distinguish between the Apache Avro standard and the implementation. I think that the standard is nice, as it is. But the implementation is the database and mutable
25:00
and using inheritance. And I think that because our code base is for manipulating this written in Scala, I would like to have a native implementation in Scala in which you have base classes and they are immutable and it's easier to transform the values.
25:22
Because I was writing some wrapping code to work around this and keep it the same. I think that we know, we have two implementations of wino. One is for streaming, it goes with Spark.
25:42
And we need to transform that metadata into a Spark operation. And that's a little bit tricky. And the other is with normal data points, that's a little bit easier. Imagine doing recursion with things that are very non-normal and object.
26:04
So that will be an improvement.