Building an open source data lake at scale in the cloud
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 490 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/47473 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2020330 / 490
4
7
9
10
14
15
16
25
26
29
31
33
34
35
37
40
41
42
43
45
46
47
50
51
52
53
54
58
60
64
65
66
67
70
71
72
74
75
76
77
78
82
83
84
86
89
90
93
94
95
96
98
100
101
105
106
109
110
116
118
123
124
130
135
137
141
142
144
146
151
154
157
159
164
166
167
169
172
174
178
182
184
185
186
187
189
190
191
192
193
194
195
200
202
203
204
205
206
207
208
211
212
214
218
222
225
228
230
232
233
235
236
240
242
244
249
250
251
253
254
258
261
262
266
267
268
271
273
274
275
278
280
281
282
283
284
285
286
288
289
290
291
293
295
296
297
298
301
302
303
305
306
307
310
311
315
317
318
319
328
333
350
353
354
356
359
360
361
370
372
373
374
375
379
380
381
383
385
386
387
388
391
393
394
395
397
398
399
401
409
410
411
414
420
421
422
423
424
425
427
429
430
434
438
439
444
449
450
454
457
458
459
460
461
464
465
466
468
469
470
471
472
480
484
486
487
489
490
00:00
Principal idealOpen sourceBuildingPoint cloudScale (map)Open sourceScaling (geometry)Point cloudConnectivity (graph theory)BuildingGoodness of fitComputing platformBitQuicksortGroup actionComputer configurationComputer animation
01:21
GenderMetadataData recoveryEvent horizonProcess (computing)Local GroupBitComputer configurationQuicksortCASE <Informatik>Electronic data processingScaling (geometry)Event horizonComputer animation
01:57
Expert systemLocal GroupHypermediaScaling (geometry)10 (number)NumberComputer fileView (database)Set (mathematics)Different (Kate Ryan album)Event horizonMassStapeldateiSlide ruleTraffic reportingComputing platformArmSingle-precision floating-point formatProcess (computing)Data structureState of matterGroup actionWhiteboardQuicksortSoftware developerStreaming mediaComputer animation
03:21
Data warehouseComputer hardwareSoftwareHuman migrationPhysical systemService-oriented architectureLocal GroupNumbering schemeReplication (computing)Database normalizationMetadataConsistencyComplete metric spaceAtomic numberInternet service providerBinary fileVirtual machinePhysical systemRelational databaseDatabaseComputing platformMultiplication signData storage devicePlastikkarteLatent heatFunctional (mathematics)Slide ruleFile systemData warehouseCASE <Informatik>Different (Kate Ryan album)SoftwareData centerComputer hardwareEndliche ModelltheorieElectronic data processingPoint (geometry)SequelView (database)Strategy gameSynchronizationLattice (order)Skeleton (computer programming)Right angleDialectNeuroinformatikComputer programmingGroup actionOperator (mathematics)QuicksortInteractive televisionProcess (computing)Variety (linguistics)DampingNatural numberGraph (mathematics)VacuumChemical equationAsynchronous Transfer ModeMaxima and minimaType theoryTouch typingDatabase normalizationWave packetNumbering schemeService (economics)Uniform resource locatorGoogolSystem callPoint cloudNumberConsistencyImplementationFraction (mathematics)Arithmetic meanMultiplicationComputer fileCore dumpMultilaterationCloud computingQuery languageMathematicsComplete metric spaceLastteilungEntire functionMetadataSet (mathematics)Stack (abstract data type)Computer animation
09:38
Wave packetStandard deviationArchitectureMetadataReplication (computing)Digital filterTransformation (genetics)Event horizonLocal GroupWindowScaling (geometry)Event horizonWave packetMaxima and minimaReplication (computing)QuicksortCore dumpMechanism designMassStandard deviationReading (process)MathematicsSet (mathematics)Revision controlGoogolComputer programmingMiniDiscComputer fileArchitectureMultiplication signMereologyCASE <Informatik>Mathematical optimizationExecution unitPhysical systemSpeech synthesisDifferent (Kate Ryan album)Human migrationPlastikkarteRule of inferenceDefault (computer science)Cartesian coordinate systemData storage deviceGoodness of fitTwitterSound effectComputer animation
11:27
Set (mathematics)ScalabilityReplication (computing)Data storage deviceComputer networkHeat transferProxy serverConfiguration spaceServer (computing)Client (computing)Local GroupFlow separationDifferent (Kate Ryan album)Goodness of fitOpen sourceProxy serverRadiusQuery languageComputer configurationFlow separation10 (number)Process (computing)Client (computing)Control flowMereologyBitHeat transferData storage deviceSet (mathematics)Limit (category theory)Point cloudSound effectExecution unitData transmissionCASE <Informatik>Replication (computing)Functional (mathematics)MetadataQuicksortLine (geometry)DialectHuman migrationCorrespondence (mathematics)Centralizer and normalizerOperator (mathematics)ScalabilityConfiguration spaceResultantMetropolitan area networkReading (process)MetreWritingDecision tree learningService (economics)Cartesian coordinate systemServer (computing)Price indexShared memoryScaling (geometry)FreewareComputer animation
16:32
Query languageComponent-based software engineeringPoint cloudComputer-generated imageryService-oriented architectureScripting languageAuthorizationExtension (kinesiology)Event horizonOperations researchProcess (computing)Scale (map)Parameter (computer programming)ACIDConsistencyInterprozesskommunikationCodeSoftware testingKolmogorov complexityRange (statistics)Structural loadQuery languageRevision controlProjective planeScripting languageReading (process)Table (information)Computer configurationScalabilityComplex (psychology)Mechanism designMedical imagingSemiconductor memoryService-oriented architectureCodeServer (computing)Open sourceService (economics)Multiplication signSimilarity (geometry)Point cloudScaling (geometry)Configuration spaceACIDMereologyNumberExterior algebraBitUniform resource locatorCloud computingParameter (computer programming)Event horizonOperator (mathematics)ConsistencySoftware testingUnit testingPartition (number theory)Computer fileCombinational logicCASE <Informatik>Computing platformMetadataFile systemSet (mathematics)Semantics (computer science)WritingAuthorizationSoftware frameworkWindowMathematicsExtension (kinesiology)Scheduling (computing)Entire functionRelational databasePosition operatorFilm editingElectronic data processingCryptographyConnectivity (graph theory)Different (Kate Ryan album)Internet service providerPhysical systemPlastikkarteSoftwareExecution unitSequelMetreDilution (equation)RewritingProcess (computing)Streaming mediaSoftware developerTraffic reportingData storage deviceDatabaseMassForm (programming)Computer animation
21:37
Open sourceConnectivity (graph theory)Replication (computing)Local GroupInformation managementEmailProjective planeLink (knot theory)Electronic mailing listComputer animation
21:59
Open sourceConnectivity (graph theory)Replication (computing)Local GroupLie groupIcosahedronCellular automatonRing (mathematics)CASE <Informatik>Goodness of fitView (database)Process (computing)Cartesian coordinate systemLibrary catalogMoment (mathematics)Data storage deviceProduct (business)Data recoveryComputing platformOpen sourcePoint cloudReal-time operating systemConnectivity (graph theory)Replication (computing)Physical systemContext awarenessRelational databaseTouchscreenWeightScripting languageDatabaseUniform resource locatorNumbering schemeAxiom of choiceTerm (mathematics)Multiplication signStreaming mediaValue-added networkElectronic data processingTable (information)Presentation of a groupRight angleComputer animation
25:23
Open sourceFacebook
Transcript: English(auto-generated)
00:06
Okay, we'll get started with the next talk. Welcoming Adrian, who will be talking about open source data lake at scale in the cloud. Take it away. Hello, good morning, bonjour, my name is Adrian Woodhead.
00:22
I work for Expedia Group, as you can see. I'm a principal engineer working on the big data platform and also all things open source. So yes, I'm gonna be talking about building an open source data lake at scale in the cloud. So a lot of the components we use in our data lake are open source. And one of the things we feel very passionately about
00:43
is that a lot of companies just take, take, take and build stuff on top of open source and don't give a lot back. So we've tried very, very hard wherever we found gaps in the open source ecosystem to build tools that fill those gaps, and then we've open sourced them. So I'm gonna be talking about a few of those today. So hopefully those can, you know,
01:00
other people can use them, improve them, and we also making sure the whole ecosystem is sustainable. Why it's not moving forward, there we go.
01:23
So the agenda for the talk today, I'll just give a little bit of background about Expedia Group, how we're structured, because that has some sort of impact on how we built our data lake. We'll talk about what we consider the foundation for that, how we store our data and the metadata. We'll talk about some options for high availability, disaster recovery, redundancy, those kinds of things.
01:43
We'll look at some options if you operate at a certain scale, it can be useful to federate access to your data. And then we'll look at how we enable event-based data processing, and we're gonna look at a concrete use case that we have for this. So before everyone runs out of the room screaming in terror, I promise you,
02:01
this isn't a marketing slide. This is actually important for how we have had to structure our data lake. So Expedia Group, we consist of a number of different companies that we've either, you know, developed or bought, acquired over the past 20 years or so. So we have, you know, online travel agencies, we have flights, we have hotel bookings,
02:21
vacation rentals, car rentals, all kinds of different companies. And these all operate at a different scale, they have different requirements, they all generate a lot of data, and some of them have data sets that they're just interested in for their own usage. But then we also, as a business, we like to be able to get a holistic view
02:41
across all of this. Another challenge is, as we've acquired them over the years, some are more integrated into our platform than others, some of their own technology platforms, so getting one single view onto all of this is actually quite complicated. And then to sort of make it even more complicated, the scale at which we operate across all of this is really, really huge. So we have literally billions of events coming in
03:02
every single day from all of these different companies via streams, batch processes, massive file dumps, you name it. And then we have thousands, possibly even tens of thousands of data processing jobs, ad hoc queries, reports running, joining all of this data together, producing even more data sets, and so on.
03:23
Also our data platform today, we didn't just build it from scratch in a vacuum in the last few years, we've got like 20 years of legacy in some cases that we need to deal with. And initially our data warehouse started off as what you would now probably call a traditional data warehouse. So this was a lot of data stored in relational databases.
03:44
Most of the data processing, data querying was done in SQL. And then about 10 years ago, there was the rise of Hadoop, big data, all of this stuff. So we built an on-premise Hadoop cluster, six, 700 nodes, distributed file system. And then we put Hive on top of that
04:01
as a metadata platform. And this was also very useful because Hive provides metadata services, but because of our legacy with SQL, Hive also has a SQL query engine on top of big data. So a lot of our data processing that was written in SQL could just be moved over more or less as is
04:20
and be run via Hive. What we found is that it was very hard for us running all of this on-premise on our own data center, meeting peak demand became quite a challenge. Obviously we had to send people into rack and stack machines and that becomes quite expensive. And then the whole upgrade path for the software and the hardware was very, very painful and quite expensive.
04:42
So around that time, the cloud vendors all came along and they started offering big data capabilities. So we then started migrating our primary data lake into the cloud. So in our case, we use Amazon Web Services or Amazon S3 EMR, all of this functionality. That's our primary cloud provider.
05:02
So that's what I'm gonna be, the terminology I'll be using in my talk today. But most of the concepts will apply the same if you're using Google, Azure or something else. So at very basic core foundation, what does a cloud data lake mean to us? So we obviously have a lot of data.
05:21
So we store that in a distributed file system. So in Amazon's case, that's S3. Wherever possible, we used efficient compressed binary formats like Avro, Parquet, ORC, et cetera. And then we need to store some kind of metadata about the data. So that's where we use Hive's Metastore service.
05:41
So we store the schemas in there, so all the fields, the types, et cetera. And then what's also quite important is we generally don't allow our users to go directly to the file system to access the data. And there are a number of reasons for that which I'll touch on later. But a lot of it is around the eventual consistency nature of cloud file systems because they're not really POSIX file systems,
06:02
they're actually object stores. So often you can't tell when data is complete. There are no atomic move operations and things like that. So we use the Hive Metastore as a way to register when data is available. So all our users, we direct them to the Metastore, they find the data set they're looking for, they get the data locations,
06:20
and then they can access the data on S3. The Hive Metastore, it has a backing relational database. That's mainly an implementation detail. You don't need to worry about too much. And then the smiley face represents all our users, data processing jobs, and so on. They don't always smile, but let's just pretend they do. To make this setup highly available,
06:42
it's actually fairly straightforward. You put a load balancer in front of the Hive Metastore service, you set up some kind of an auto-scaling strategy so the nodes scale out as demand changes. The backing database, you can use something like Amazon's RDS, so it handles that for you. And then what we've found is the cloud providers'
07:02
distributed file systems generally scale very, very well, so you don't need to do much yourself. So one thing again, why we've gone with such a simple setup here, we could have chosen to use some very, very fancy, specific data warehousing technology or data lake technology. But coming back to that first slide where we've got so many different users
07:21
with so many different technology platforms, we've kind of gone for a lowest common denominator approach. So generally what we've found is if you have the data in the file system, the metadata in Hive, most upstream data warehousing technologies, things like Spark, Flink, legacy technologies
07:43
like MapReduce, Cascading, and then all kinds of other tooling like Tableau, et cetera, they can all interact with this using JDBC and ODBC. So we enable a really, really wide variety of use cases above all of this, and now we've made it highly available, so that's great. But being highly available in one geographic region
08:01
or data center is a bit risky. So you obviously want to have that redundant, and ideally you want to be able to run your entire setup in multiple regions. So in order to do that, again, you can just spin up your entire setup in another region, and then you can decide whether you want to run this in active-active mode or just have a bare minimum in another region and just fail over to it
08:21
if and when a disaster happens. So the key thing from a data lake point of view is if you start operating in another region, is you have to have your data and your metadata available in the other region. So one thing you could do is you could have all your data producers set up to produce data to both regions, but now all of them have this burden on themselves
08:41
where they have to be able to synchronize, and what if the write to one region fails and another one succeeds? So we decided to not put that burden on all our data producers and instead make that a core feature of the platform. So our platform is responsible for replicating data into another region, and the key thing is that we need to do this in a very coordinated fashion
09:02
because you want to make sure that if users are in different, querying data in different regions, that they have a very consistent view of the data, and it's always as correct as possible. And what that really means is when you're replicating data or metadata into another region, you want to make sure that users in the other region can't get partial reads of the data whilst it's in transit.
09:22
So what we generally do is we only advertise data in the Hive Metastore when it's been completely replicated over, and this means there's latency, getting the data into the other region, but we generally feel we'd rather have late, complete data than incomplete or incorrect data. So to do this, we've built a program
09:42
that we call Circus Train, and the name comes from the idea of taking the Hadoop elephant and all of these other crazy circus animals in the Hadoop ecosystem, putting them on a train, and moving them from one place to another. And one of the interesting things we found when we built it and why we built it is there were no replication tools out there
10:00
that would only advertise the availability of the data in the Metastore after it had been copied. So there was this possibility of getting these partial reads. So that's one of the main core features of Circus Train. It supports various different distributed file systems, so HDFS, S3, Google's Cloud Store, et cetera.
10:21
And by default, it uses Hadoop's sort of famous standard disk CP mechanism for copying massive data sets at scale. We've also written some optimized copiers for S3 that take advantages of some of the aspects of that. And we've architected in such a way it's got a plugin architecture, so you can write your own custom versions
10:41
of various aspects of it. And then we also, there's some quite advanced features in it that can analyze the data on both sides and then make sure that it only replicates the bare minimum, which is quite important when you're possibly replicating terabytes of data every day. You want to keep that to the minimum. Initially, we built Circus Train with the idea
11:00
it would be run on like a time-based schedule, once a day, once every four hours, et cetera, which works well for certain use cases, but we found if you really want to scale this out and minimize latency between data being available and replicated, you want to be able to have an event trigger for your replications. So we built a layer on top of it called Chanting Yard, which basically monitors the Hive metastore for changes.
11:23
When data gets changed, it then triggers replications automatically in the background. So what we then found is sort of an unexpected side effect of this big cloud migration that we did across our entire company, is that these different business units, different parts of the company started moving to the cloud at different speeds,
11:42
which was kind of good for them. But then what happened is they started building their own data lakes in their own Amazon accounts, sometimes even in different regions. So we basically had this situation now where we've got data silos. So when we were on-premise and we had one Hadoop cluster, one Hive metastore, most of the data was in one place,
12:03
and people could very easily do joins and queries across all of this data. But now, in this case here, I'm just showing three of the business units. We've basically got three data lakes, and Hive was never built in a way that, it was never designed to be able to federate queries in this fashion. So we'd built these data silos, and this wasn't gonna be acceptable to our end users.
12:23
So we thought a little bit about how we're going to tackle this, how we're gonna break down these silos. So one option would be that we would tell all of these business units, you need to move everything back into one big central data lake in the cloud, which is probably, if you're operating on a smaller scale, that's probably actually quite a good approach to take.
12:41
It's quite a bit simpler. But we thought we were gonna have some scalability issues. There's certain limitations to how much you can do in one or two Amazon accounts. And we were also a little bit concerned about the blast radius of having the entire company, all these different users operating in one Amazon account. If something went wrong, the blast radius of that could take down the entire company,
13:01
which wouldn't be good. So instead, another option we thought about, we have a great replication tool. We could possibly look at all the shared data sets and then replicate them into each of these data lakes, so everyone has their own copy of the data. And if you only have a few data sets that you share, that is actually also probably quite a good approach.
13:20
But we had thousands, possibly tens of thousands. And this idea of setting up and maintaining thousands of replication jobs, all the transfer costs, increased storage costs, we decided that was a no-go for us too. So instead, we looked at how could we federate the Hive metastore. So we built an open source tool that we call waggle dance.
13:42
And the name comes from the dance that bees make when they want to indicate to other bees where to find pollen or food sources. And what this is is the Hive metastore has a thrift API that it makes available for people who want to get hold of that metadata. So we basically built a proxy
14:00
that exposes the exact same API. And what you do then is you configure that proxy with different downstream Hive metastores. And then what people can do, they query the proxy, the proxy then goes out to all the federated Hive metastores, gathers all the results, aggregates them back and presents it back to the user as if it was a single metastore.
14:22
So that's good for the metadata access. When it comes to the actual data on S3, you need to set up corresponding access permissions, which is generally fairly straightforward. And then as an end user, what you do in your client application, there's a URL, a configuration setting that you can change. Instead of pointing to your local metastore,
14:41
you point to waggle dance and then you get this federated functionality. So kind of what this looks like in practice, if you set up a waggle dance service, in this example, we've got like, what's considered a primary metastore, which it has write access to that one. We have two external metastores that are set up in read only mode. And so if a user query comes in here,
15:01
it can basically see and do joins across all three metastores as if they were just one. So what this actually looks like in practice for us, this is just an example here showing, you know, three of our business units, Hotels.com, Verba, and Expedia. And they're operating here in three separate Amazon accounts,
15:20
so that's the vertical dotted lines. And we're running in two different geographic regions, US West 2 and US East 1 in this example. And so what we do then is within a region, all the data is, you know, pretty much co-located, latency and data transfer costs are low, so we can federate data access across them, we replicate data into another region,
15:41
and in that region, we federate again. So some of the best practices we've learned from operating all of this for the past few years is generally wherever possible, we expose read only endpoints to the end users. You don't want, you know, an ad hoc query writing into some unexpected place. And similarly, wherever there's critical path infrastructure,
16:01
so ETL or streaming jobs which need to operate, you need to, you really, really need to have them running with 100% reliability again, separate all of that infrastructure from people doing queries. And then yes, whenever you're gonna federate data access within a region, you federate, and then if you're going to have a need
16:21
for the data in another region, you basically just set up one job to replicate the data into that other region, and then you federate again. What we always want to avoid is federating across a region boundary, because then you're transferring data across a region which is slow and expensive. There are other alternatives out there if you need this kind of federated query mechanism.
16:43
So there's the Presto project. That's also a distributed SQL query engine for big data. It can federate Hive, it can also do MySQL Postgres and many others. The big problem with it is there's been a huge disagreement in the community of fork. There are two versions of Presto, the same name, the foundations behind them,
17:02
so good luck picking the winner. And as you can imagine, setting up and maintaining and configuring all of this, it's quite a bit of configuration that's needed to do this. So we built an umbrella project called Apery where we aim to put all of this stuff, componentize as much of it as possible, and then you can pick and choose the bits that you want.
17:21
So we have Docker images for all the services. We have Terraform deployment scripts for being able to set up all of the networking, load balancing, infrastructure, and so on. We have a range of authorization and various optional extensions. So I'm gonna give an example of one of these optional extensions. So it's a metadata event framework that we built.
17:42
And one of the reasons we did that is that we found when you're operating thousands of data processing jobs at scale, doing this using some kind of time-based mechanism using cron and relying on when data should be there or not is very, very painful. So instead, we prefer to be able to do everything based on metadata events.
18:02
So what we do, what this framework does is it emits events whenever there's a change to the data, they go into Kafka or SNS, and then downstream people can subscribe to those. So one of the big problems we have is rewriting data at scale. So we generally partition our data
18:21
by some kind of like when the data arrived, like when hotel booking was made. Ideally, those partitions would be immutable, but that's very rarely the case. You've got late arriving data, sometimes you have updates. And one of the big problems when you have massive data sets, massive partitions, is even just doing an update takes a lot of time. So what we do to ensure we have read isolation
18:42
for queries that are running across those partitions, so when updates come in, we basically write an entire new version of the partition, and then when that write is finished, we repoint the Hive metastore over to the new location. But then you have the problem of what to do with the old versions of those partitions. And because it's such a big distributed system, we don't actually know when people have finished reading that data,
19:01
so we kind of have these orphan data sets that sit around. So how do you expire them? So what we did is we wrote a tool called Beekeeper, and what that does is that sits downstream of that metadata event framework, and it watches all the time, and it detects when one of these update operations has happened that has potentially orphaned data.
19:23
And all the data owner needs to do is they put this Hive table parameter on their table, you plug in Beekeeper onto the event listener, it finds the rewrites, and then it basically schedules the data for deletion in the future, so we have a time window of three days, just to be completely sure that all people have stopped using that data,
19:41
and it will then do the data deletions. So this is one of the places where having a central platform makes it a lot easier for all of our end users, as they don't have to do all of these housekeeping operations themselves. So there's some other alternatives that you can have if you want to be able to do these kinds of updates and have consistency in your platform.
20:02
Newer versions of Hive have acid semantics that relational databases exhibit. We have Iceberg and Delta Lake, which both use metadata files on a distributed file system instead of a database. And then there's Hoodie, which is another Apache incubating project that does something similar. They're all under very active development,
20:22
so your mileage may vary if you use any of them. We also find it's very, very important to test your data processing jobs. So there are a number of unit testing frameworks out there that we've contributed to or open sourced. So there's Hive Runner, which you can use to test Hive SQL. We built a layer on top of that called Mutant Swarm,
20:40
which gives you code coverage of your SQL, so you can find the parts of your SQL code that are missing tests. And then we have another project called Bijou, which basically spins up a Frift Hive Metastore service or the Hive Server 2 service in memory so you can write unit tests against it. So things we're looking at, what are we going to do next? What we really, really want to do is have our entire data platform
21:01
built on open source solutions that we can run where we want, wherever based on performance, cost, scalability. So hybrid cloud would be very nice. We could run things on premise or in the cloud provider. Multicloud being able to use different cloud providers for their strengths, but obviously both of those come with the cost of increased complexity.
21:20
So what we're really hoping for is the combination of Docker, Terraform, Kubernetes, that we can basically take this entire platform, deploy it at scale without too much effort, either on premise or in one of the cloud providers, Kubernetes engines. These are some of the projects I've talked about, some of the links.
21:41
If you're interested, have a look. They've all got mailing lists. We'd love to hear from you. And that is the end of my whirlwind tour of data lakes. And I think I've got three or four minutes for questions. Please stay seated while we do questions
22:01
so we can actually hear the question. Any questions for Adrian? Hi, so thank you for presentation. So you said about distributed system within the Amazon cloud. So what particular components you use for Fusto S3 or any other bundled Hadoop cluster within the Amazon?
22:24
Thank you. Sure. So yeah, so we use, for the data lake, we use S3 for long-term storage. Whilst we're running data processing jobs, we use EMR. We also use kubalt. What a lot of those do is they spin up a temporary HDFS cluster for writes
22:42
whilst the job is running. But then when the job's complete, they write everything back to S3. So the foundational piece really is S3. Any more questions? Can I throw it?
23:02
Hi, I'm wondering if your users have low-latency needs and that they don't wanna wait for the replication to finish before they can query the data. So that's a good question about users who have low-latency needs. So generally what we recommend people do,
23:21
like the people who really want access to the data should be running their jobs in the region where the data's being produced. So most of those replicas are just for disaster recovery. Those use cases can generally handle high latency. The other thing I didn't really talk about in this talk as I was focusing on the data lake aspect is we're trying to move as much as possible
23:40
to a stream-first approach. So actually a lot of the data that comes into the data lake, it arrives in real time and it goes onto, we have a big streaming platform based on Kafka. So if you have really, really, really low-latency needs, you write a streaming application that pulls the data off Kafka. So then you get kind of that low-latency use case. And then all the data that arrives in Kafka
24:01
it then gets put onto the data lake. But the latency there is usually minutes, possibly an hour. Okay, maybe one short question, or very short. Hello, do you include data catalogs in your process?
24:23
So the question is about how we catalog data? Yes. Yeah, so that's a good question. So Hive again is our basic data catalog because it has the schemas, it has the, everything's organized into databases and tables, et cetera.
24:40
We also have views on top of this. We use a product called Elation, which can basically spider the Hive metastore and pull all of that data in. And we can then also register relational databases with that and get kind of one view over it. There are a lot of other tools out there that do that. They all have pros and cons. And I haven't seen one that I'm massively happy with.
25:01
Amazon have something called Glue, which does that for you. But then there's the whole, it's great, but there's Vanderlaken, so it's really, there's no one beautiful open source solution for this at the moment that I'm aware of. Okay, unfortunately we're out of time. Thank you very much, Adrian, for this talk. Yep.