Simplifying upserts and deletes on Delta Lake tables
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 69 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67348 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 202118 / 69
11
25
39
43
45
51
53
54
60
00:00
Table (information)VacuumNumberVirtual machineAnalytic setMereologyLink (knot theory)Demo (music)Roundness (object)Mathematical optimizationBitComputer fileVacuumSlide ruleTable (information)Profil (magazine)InformationSheaf (mathematics)VideoconferencingRSA (algorithm)EmailBlogLaptopXMLUMLComputer animation
02:14
Data managementOpen setData storage deviceStructured programmingDataflowDemo (music)BitFile formatControl flowSlide ruleMachine learningComputing platformWorkloadSeries (mathematics)Cartesian coordinate systemReal-time operating systemAnalytic setCloud computingData structureStreaming media
03:20
Game theoryHypermediaSampling (statistics)Control flowSlide rule
03:49
Data managementClassical physics2 (number)MetadataBitCuboidDifferent (Kate Ryan album)Projective planeReal-time operating systemData qualityData storage deviceRevision controlPartition (number theory)Bounded variationComputer fileMathematicsConsistencyCoprocessorHill differential equationMoment (mathematics)Process (computing)Self-organizationReading (process)CASE <Informatik>Data managementDemo (music)Row (database)Partial derivativeMultiplication signConservation of energyStapeldateiMixed realityArithmetic meanComputer configurationLink (knot theory)Form (programming)Drill commandsGoogolError messageServer (computing)Physical lawResultantComputer animation
09:51
Standard deviationBuildingDatabase transactionSource codeTable (information)Mathematical optimizationData storage deviceFile formatOpen sourceComputer architectureTime travelDatabase transactionLambda calculusPattern languageMechanism designMultiplication signComputer configurationData warehouseStapeldateiCodeDemo (music)Table (information)Slide ruleConnectivity (graph theory)Reading (process)Data structureUniform resource locatorDivisorData storage deviceMathematical optimizationLoginEvoluteXMLComputer animation
12:52
Database transactionACIDMetadataFile formatSynchronizationComputer fileRevision controlSubject indexingQuery languageEvent horizonOrder (biology)Numerical analysisTime evolutionSet (mathematics)Validity (statistics)Time travelData qualityComputer fileArithmetic meanPartition (number theory)Entire functionTable (information)EvoluteMathematical optimizationOpen setProfil (magazine)DatabaseRow (database)LoginBitMetadataDatabase transactionUniform resource locatorNumberSlide ruleFile formatObject (grammar)Point cloudData storage deviceMultiplication signComputer animation
15:54
VacuumFocus (optics)Sample (statistics)Personal digital assistantPartition (number theory)Query languageReading (process)Atomic numberSource codeAddress spaceSet (mathematics)Table (information)Computer fileOpen setNP-hardCASE <Informatik>Predicate (grammar)Row (database)Uniform resource locatorDemosceneKey (cryptography)Query languageStatement (computer science)Address spaceSampling (statistics)Entire functionFormal languagePartition (number theory)Source codeMaxima and minimaMathematical analysisRight angleXMLComputer animation
19:44
VacuumEvent horizonOrder (biology)Event horizonCompact spaceVacuumBinary fileMathematical optimizationData storage deviceComputer fileTable (information)Demo (music)Computer animation
20:16
Graphic designDemo (music)TouchscreenShared memoryComputing platformCASE <Informatik>DatabaseMultiplication signTable (information)InformationNumberRow (database)Parameter (computer programming)Operator (mathematics)Source codeDatabase transactionFile formatRepresentation (politics)Predicate (grammar)Time travelEmailDemosceneTimestampCodeCondition numberSurjective functionEvoluteControl flowVisualization (computer graphics)Single-precision floating-point formatLaptopRevision controlAsynchronous Transfer ModeComputer configurationMathematicsComputer animation
24:03
VacuumString (computer science)Row (database)CASE <Informatik>Condition numberTable (information)Source codeValidity (statistics)Predicate (grammar)Computer clusterLaptopEvoluteMultiplication signFrame problemLink (knot theory)Computer animation
26:25
Computer fileTable (information)Frame problemVacuumComputer configurationMathematicsFlagLaptopTime travelNumberDemosceneConfiguration spacePoint cloudMoment (mathematics)Mathematical optimizationEvoluteMultiplication signDatabaseArithmetic meanBit rateCASE <Informatik>Revision controlData storage deviceStatement (computer science)2 (number)Object (grammar)Computer animation
28:50
Electronic program guideArchitectureBuildingStructured programmingScale (map)Computing platformSeries (mathematics)Source codeNebenläufigkeitskontrolleStreaming mediaDependent and independent variablesComputer animation
29:20
MereologyMultiplication signPresentation of a groupComputer fileOrder (biology)Limit (category theory)File formatMoment (mathematics)MathematicsDatabaseArithmetic meanMeeting/InterviewXMLUML
Transcript: English(auto-generated)
00:07
Thanks for having me here. Today, we are going to discuss simplifying upsets and deletes on Delta Lake. So that's our topic for today. So a brief agenda is,
00:20
first, initially we'll talk about the topic and then we'll discuss a bit about challenges with the data lakes. And also we'll discuss about next comes the features of Delta Lake and how it helps tackle, solve the challenges with the data lakes usually. And finally, we get into the meat of the topic,
00:41
like update, delete, and upset on a Delta Lake table and how easy it is and how simplified it is to run any of these commands and ensure your data lake is fine, ensure your data lake is pristine and could be used for downstream analytics and automation running. I'm also touching a bit about optimize and vacuum.
01:01
These are essential for ensuring you can delete some data and also ensure you can bring back the files and so on and so forth. And I rounded up by a couple of demos. These are, I'm showcasing update, delete, upset in two different notebooks. And finally, optimize and vacuum also is also part of a notebook.
01:22
So that's about our agenda for today. And finally, I leave a number of references so that you can go through them in detail. I mean, this is kind of a lightning talk, I would say. 30 minute session. There are enough ample number of videos and blog posts, which are read in my references section,
01:40
which would help you to dig deep into any of the information which you might be interested in. With that, I'm getting into the topic. So very brief about me, I'm Prashant Babu. I've been with Databricks for almost three years now. I'm EMEA practice lead for RSA, RSA stands for Ration Solutions Architect.
02:01
And my LinkedIn profile is showcased on the slide here. I would love to interest, I would love to connect with you with any of you and all of you. I'm a very, very, very brief about Databricks. If you're not aware, Databricks is a platform to unify data, machine learning and workloads, basically.
02:25
You can do everything in a single platform, which is where I'm going to showcase my demos as well. And Databricks, you might be aware already, are the original creators of Spark, Delta Lake, MLflow and Koalas. And we almost have 5,000 plus customers across the globe using our Databricks.
02:43
Again, one simple slide to explain what Databricks is and what Lakehouse platform is. You might have data in AWS Azure or GCP, the three main cloud vendors in structured format. A structured format or semi-structured format are streaming. These are the kind of workloads which could be processed
03:02
with Databricks for data science and engineering or BI and SQL analytics, machine learning and finally real-time data applications also. So all this is underpinned mostly with Delta Lake, which is what we are going to talk about a bit in detail in today's session.
03:21
This is a very, very simple sample of our customers. And you can see a couple of German heavy customers also present like Daimler, for example, or Zalando, et cetera, so on and so forth. So these are, that is a very brief about Databricks and finally getting onto the topic.
03:42
What are the typical challenges with Data Lakes and how Databricks Delta solves these challenges is what we are going to discuss in the next few slides. So as you can see, most of the, I mean, as per many surveys, in fact, MIT Sloan management review says 83% of COEs say AI is a strategic priority.
04:03
And at the same time, Gartner says $3.9 trillion business value could be created by AI by end of next year, 2022. The future is here, but there are a bit of problems here. Like it is very hard to get right and it is just not evenly distributed. So the same Gartner which predicts
04:21
3.9 trillion dollars business value, also says 85% of the big data projects fail. And VentureBeat also says 87% of data science projects never make it into production. And some companies like Uber, Google, Amazon, et cetera, are making, are having huge, huge success,
04:40
but a lot of them struggle. And most of the reasons would be around the data, are the data which is sitting in the Data Lake and which is causing some challenges. And we are going to drill down and discuss in deep dive on a couple of challenges with the Data Lakes. The first one is something very, very simple,
05:03
like appending new data using Spark into a Data Lake. But at the same time, some other processor or some other pipeline is also trying to read the same data. So that's usually essentially causes a ton of issues. And users usually want all their data, all their changes to appear all at once.
05:23
This is very, very hard to achieve, making multiple files all appear at once, or even a single file to appear in full form. And it's not supposed to be work or supported out of the box with Data Lakes. That is the first and foremost problem with Data Lakes.
05:41
The second problem is about modifying existing data is very, very difficult. I mean, take the classic case of GDPR pair, someone sends a request for deleting their data, I mean, from any of the organizations. And that implies you have to read all the data and then filter that particular row
06:01
or those particular rows from the data and then rewrite the data into the Data Lake. So that is again, a huge, a big, big problem. I mean, GDPR and CCPA for that matter. So there are many, many manual techniques which are applied and which are very, very unreliable. One of which we are going to discuss here today in the demos.
06:21
And the third option, the third challenge with Data Lakes is jobs failing midway. I mean, most of the big data pipelines and Spark pipelines, you would easily understand half of the data appears in the Data Lake and rush might be missing. So it is usually a problem with jobs failing midway, which caused this particular challenge.
06:43
Another problem is mixing of batch and real-time. That is usually turns out to be a chalice hill usually. And it is very, very tough to mix them and it leads to a lot of inconsistency. And one of the variations of the first problem is with appends, but at the same time,
07:02
streaming also adds a bit more inconsistency and basically you're reading partial results, if I can say so. Fifth topic, fifth challenge being, it is very, very costly to keep historical versions of the data. I mean, usually all the regulated organizations need some or many of the versions of the data
07:22
to be available in the Data Lake. That is going to be costly as well as leads to a lot of governance issues as well. I mean, auditing and governance issues as well. And it is very hard to do. Six challenges about difficulty to handle large metadata. I mean, if I used Hadoop HDFS for that matter,
07:45
where you would have a huge amount of data in your HDFS that internally causes, I mean, that's because a huge amount of data implies large metadata is to be stored at name node, for example. And all such problems would magnify the moment
08:01
you use petabyte of data in the Data Lake. And it's very, very tough. And even the metadata itself lands into gigabytes and gigabytes of data. This is one of the most classical problems, I would say, like too many small files, too many files. I mean, because of your using streaming, for example,
08:21
that implies too much of data is landing in and you are processing the data at a very, very breakneck speed, like every 10 minutes, every five minutes, for example, or even every minute. And you're saving that data into the Data Lake. That implies you're storing two tiny, small files. Too many small files are sometimes gigantic files also.
08:42
I mean, either of them are usually a big, big challenge. And most of the time is spent by Spark just opening and reading the files rather than opening and closing files rather than reading the file usually. On the same note, it is very, very tough to get a great performance. And it has to be manually done,
09:01
and it is error-prone to get great partitioning and ensuring manual techniques applied to get a very decent performance, not so great performance, I would say. It's more of getting a decent performance here. And finally, data quality issues. Like usually, I mean, it's not that data evolves,
09:22
and that implies as the schema evolves, the underlying storage would either have to read that data or store the data, and that implies the downstream pipelines would have a problem in reading that particular data, which has different metadata, different columns compared to the earlier data. So all these are usual challenges
09:43
which you will face with any of the existing data lakes which are stored in any of the formats like Perquet, for example. So this is where data comes into the picture. And I'm going to explain why Delta Lake solves these particular problems, whatever we discussed, the main challenges,
10:01
and how it solves also is what we are going to discuss now. So first and foremost, it is built on open format, and it is open source. You can find all the code of Delta at delta.io. It is basically an opinionated approach for building robust data lakes.
10:21
And what I mean by that is, it has its own transaction log mechanism. I mean, I will briefly show in the next slide itself where how the transaction log looks like. So it brings both the best of data warehousing and data lakes all together into one single format. And it helps for ensuring
10:43
the downstream reading is perfectly fine even when you're writing some data to the same table, same table, not same location. Databricks Delta adds reliability, quality, and performance to data lakes. How it does is what we are going to discuss in the next few slides.
11:00
Delta Lake is comprised of only three important topics. One is Delta Tables, which is where the data is stored, and the Delta Optimization Engine, which is where it allows to do mergers, upsets, deeds. It allows to do vacuuming and optimizing,
11:20
so on and so forth. And finally, Delta Lake Storage Layer. So these are the three components of Delta Lake. Now, to add on top of what I briefly mentioned before, Delta Lake offers all these important features like asset transactions on Spark. I mean, you can ensure that whatever you're writing to Delta table,
11:42
that will not be read by another pipeline which is reading at the same time. I mean, that is all transaction isolation is maintained on Delta Lake. It allows for unifying streaming and batch with the same table. You could write to the same table, and a batch can also write to the same location,
12:02
and also streaming also can write to the same location, and it allows both the patterns to be done at the same time. So basically Lambda architecture being resolved just by using Delta format. And it also talks about schema enforcement, and where required, you can also enable schema evolution,
12:23
which is what we have a simple demo showcasing that particular feature today. It also allows you to do time travel, like you can go back in time and look at the data, and who processed it, who added it, and on which cluster, on which date, so on and so forth, could also be seen on using time travel.
12:43
Upsets and delays are one of the major important factor, or important options of Delta. Structure sharing support also is available. So just going back to all the main challenges, how Delta Lake tackles those challenges is what we'll discuss in the next few slides, basically.
13:02
So there are asset transactions, and all of these, the first five challenges are resolved by Delta Lake, by using asset transactions. Each and every table, when you write data, it sits in the cloud object storage, or HDFS for that matter.
13:21
And there is a small metadata folder which gets created, like as you can see here in the location, slash path, slash table, slash data, slash underscore data, underscore log. That's the folder location. I mean, wherever you write a table, say customer's table, for example,
13:41
and within that customer's table, there will be a sub folder created by name, underscore data, underscore log. And within that, you would have for each transaction, there will be a separate file created, a JSON file created. So which is what is the heart and soul of Delta, basically. So whenever you write any entry, any row,
14:01
or delete, or merge, or do anything on a particular table, all that is recorded as transactions in that particular table. So going ahead, yeah. And finally, whenever the number of transactions increase, what Databricks Delta does is it checkpoints them as a pocket file.
14:21
So that is also done implicitly by Databricks. You wouldn't need to do that. You wouldn't need to worry or bother about that. So this is how it does all this. Like it is hard to append data, and all these problems are resolved just by using Delta. And as we discussed a bit about time travel, it allows time travel and all the transactions are recorded
14:41
because of the transaction log. And it will allow you to go back and play, basically play forward. Difficult to append, difficult to handle large metadata. All large metadata, as I mentioned, metadata is stored in open pocket format, and it is resolved by just by reading the file. And also portions of it can be cached
15:03
and optimized for fast access. And this is a huge, huge problem usually. Too many small files or poor files. This is where things get very, very, very interesting. Like with Delta, you can just write a simple command, optimize, so-and-so table. That implies it will been back all the entire file
15:22
and entire data in that particular folder, where possible to one GB each, one file each. And it works within the partitions also. So this way we are going to resolve the problem of too many small files as well. Finally, data quality issues like schema validation and evolution.
15:40
Delta supports a schema validation as well as evolution, even in merge scenarios, merge as an absurd scenarios as well. So this is what we are going to discuss today. I mean, that exactly the same topic we are going to talk today. Updates, deletes, and upsets on a Delta Lake table.
16:01
So after the nine challenges we discussed, I'm going to touch upon hard top end data, modification of existing data being difficult, and finally, too many small files, too many small file problem, as well as poor performance. So what are the sample use cases for updates, deletes, and upsets?
16:21
So first and foremost would be, whenever you want to do a delete or merge, there might be a problem. There might be a case where someone sent a request for a right to be forwarded. So GDPR compliance should be the, might be the one of the simplest use cases you can imagine. And duplication, deduplication. You'd like to redo your entire Delta Lake,
16:43
even that is simply easily possible with Delta. And finally, what are the challenges with this? Without Delta Lake, it is inefficient, probably possibly incorrect, and it is very, very hard to maintain and underlabel any of these upsets.
17:01
Mostly, more so with merges, it is very inefficient, and it is very, very manual to do it. So, which is what we come to the first topic, which is update, update on our Delta table. Now, if you see the syntax, it almost looks like exactly what you would do in a RDBMS query, RDBMS,
17:22
any of the RDBMS you might have used. So the key features are updates in E column for the rows that match a predicate, which is what, it's a pretty simple statement to do it. And similarly for delete, exactly the same, like what you would do in a RDBMS, delete from so-and-so table,
17:41
where some column predicate is what you would provide. I mean, in both the cases, updates the column values for the rows that match a predicate, but if you don't provide any predicate, it updates all values for all rows, whatever you're mentioned here, like update languages, set name is equal to Python 3 if it's specified. If you don't take predicate, it will just blindly update everything.
18:00
So that's the same thing even for delete. If there is no predicate given, then it deletes all the rows. And which is, finally, we come to the important topic of upsets. So merge without data lake would be very, very painful to just to walk through the simplest possible approach, which is approach two with merge,
18:21
analyzing the updates on a table and find out the partitions to work, right? That will be the first step and read all the data in the relevant partitions in the target table, then joining these two tables, or at all those partitions in exchange and location, and then atomically publish. This is what it would look like. I mean, a merge will look like without Delta Lake.
18:43
And how merge is resolved with Delta Lake, it is a pretty simple segment to do it. Like if you do merge, if you have a customer stable and if you have an update stable, now you would like to do update some customers whose customer ID and source ID are present.
19:01
And you have a new address for all those customers. You can just do this. Like in this case, we are both doing an update set. And if it is not available, then we are inserting. So basically, up cert, update and insert, update or insert is what is happening. And we can also do, in fact, delete also in this.
19:20
So behind the scenes, what merge does is, it basically does an inner join between update and target. And it is not doing it on the entire data. It's actually going and looking at the min and max files of max values of the file and getting those values and trying to do some intelligent analysis there.
19:41
So I'm not walking through everything here, but I'll just walk through so that I can get the demo sooner. Optimize and vacuum are very important concepts, as I said. It is bin packing, compaction and also, it allows data skipping with optimized events where so-and-so date and Z-order by.
20:02
And similarly, vacuum. You have vacuum. It is pretty simple to do. Vacuum and so-and-so table. It will clean up all the old untracked files of Delta so that it can limit the storage cost. Now, you quickly jump onto the demos. It is pretty, pretty simple demos I'm using.
20:21
This is a simple cluster I'm using. This is DeltaWiz platform, by the way. And I'm using a simplest possible use case here. Sorry, I think I'm sharing wrong screen. So I'm showcasing update columns of a Delta table
20:44
and delete rows of a Delta table. So basically what I'm trying to do here is I have a small dataset. Like I have a small dataset where I have Spark, Databricks, and Dettler. I mean, by mistake, someone wrote a quote which caused the pipeline
21:04
to write incorrect values. Like as you can see here, it is Dettler. So this is a oversimplified example per se so that we can walk through the use case to explain what Databricks Delta does.
21:20
So what I'm doing here is I'm just writing the data to a Delta table in the Delta format and providing a path. So that implies that it is writing to the external table, basically. Now I'm displaying the data from the Delta table here. As you can see, it is coming as Dettler, as you can see.
21:40
Now, as we saw earlier, what I'm trying to do here is I have an ID column which I know there is a tree and for which there is Dettler here. This is where we are going to do some magic like updates on the table, set some column is equal to some value, where condition. I mean, you can specify either this condition
22:02
or this condition, but basically both are exactly the same. And it is automatically doing everything behind the scenes. So it is also showing number of affected rows here. This is how Delta performs in real world. So you can see the value got changed. And the same thing which I'm going to do here is I'm going to delete that particular row from Delta.
22:24
So again, it shows how many rows it got affected. I mean, this is a more simplified example, but you can get the gist out of what I'm going to do here and let me display the table again. So you get Spark and Databricks because we deleted ID three here.
22:43
Now, behind the scenes, as I mentioned, Databricks is maintaining a transaction log which is what we'd look like. And this is the visual representation of transaction log. Every operation, I mean, this is my email ID and this is my user ID and this is the timestamp I'm based out of London.
23:01
So it is showing GMT basically here. And you can see what kind of operation was done and what are the predicates, what are the option parameters, et cetera, et cetera. All this information is at a single snapshot like a single source of truth basically. And if I want to do some time travel, for example,
23:20
I can go back in time and play on the data. Like you can see here, I initially ingested data as data log, so which is what it is showing. Now, the next step is showcasing Delta. So if you see in the third, when we deleted the third row, we can see Delta is not present,
23:42
but if you go back in time, we can see that. And finally, the most recent version is that you can see the most recent version. So this is how it is so pretty simple and name and easy to do with Delta update, delete and all this very, very simply. So let me go to my next notebook,
24:02
which is mode schema evolution. Probably I'm zipping through because I just have five more minutes. So this is schema enforcement during merge. And the use case here is, I have two columns, ID and name. The latest data has three columns, ID, name, and year.
24:22
So a new column has been added to the dataset year, which is what we are showcasing in this particular case. But at the same time, my requirement, my business case is I want to insert all the data, merge all the new data to the existing data table, but also enforcing the schema. So that implies I need to discard the new column
24:43
in the Delta table. So if I can quickly run through the entire notebook, what I would do here is, I'm just running the entire notebook and I have a couple of tables, source and target. Same as before, I have three rows here. I mean, again, or simplified example so that it is easy to explain.
25:02
Now I wrote the data into a Delta table and I'm displaying the data here. Now, the new data frame, after a couple of days, assume after a couple of days, has a new column as year. Now there are two use cases here. One, you want schema enforcement strictly adhered,
25:23
schema validation is done, and it shouldn't add this new column into the data lake. So which is what we are seeing here. So basically, I'm doing a merging to some table using a source table based on a condition of, predicate condition of target.id is equal to source.id.
25:43
So I have id true, which is already present in the data. In the new data frame, id two has a column 2013. Now, when I run merge syntax, merge command on this table with this syntax, you can see it's pretty simple. It will look at all the rows
26:01
and populate and update rows where it's required. And where it is, that particular row is not present, it will insert the row. So going back and you can see, it is schema is enforced strictly. So there is no new column here. And if I go to the third notebook,
26:21
the my final notebook here, here is where I'm doing a schema evolution. It's exactly the same notebook with no changes at all, but only change I'm doing here is there is an option. If you want schema evolution to be available, even in the merge,
26:41
you just need to enable auto-merge as a syntax. The moment you set this config, what Delta is doing behind the scenes is, it is allowing you, like as mentioned before, it's exactly the same data frame, the first data frame, as well as the second data frame. The second data frame also has a year column now.
27:01
Now, with the same exact statement, what I'm doing here is, because I enabled schema evolution, you could see year is coming out to be the value what we ingested. So basically, we have replaced Databricks earlier value with 2013. So that's how the updates are happening.
27:21
Behind the scenes, you can also look at look at the way it is working. And finally, let me go to the last thing here, which is showcasing time travel and optimization back here. So this is, as I mentioned, create table and merge, which is what we are doing.
27:41
At the same time, we can do time travel. I mean, we can go back to zero and one. Zero didn't have the year column here, but now the latest version has a year column. And finally, there is an option called optimize. As I mentioned, optimize, you can see the beauty of optimizing a single command. Like here, I have a number of files as three.
28:02
The moment I run optimize command, if I do a describe detail again on the same table, the optimize, it will optimize all the files into a single file. I mean, in this case, we are using a smaller file, but that's how it is. And finally, the last thing which I wanted to showcase is vacuum. Now, before I run vacuum, there are so many files.
28:22
The moment, I mean, Databricks doesn't allow you to run vacuum as is. So you have to enable, if you want to run it as zero, retain zero hours, you have to enable a special flag. Once you enable special flag, all the untracked files will be deleted from the cloud object storage or the local storage.
28:42
So these are the three notebooks, three use cases, which I wanted to showcase. And if you have any questions or anything you would like to know, I would be very happy to answer them. And these are, I am leaving further references so you can take a look at it.
29:02
This is a very new book, like the three early release chapters that were released just last week. You can take a look at this, the new book called Delta Lake. And finally, Learning Spark also has a chapter on Delta Lake. So there are a couple of docs and docs and webinars and so on and so forth.
29:21
I'm leaving all this for your reference. Please do let me know if you have any questions. And thanks for having me. Thank you very much for giving me time today. Thanks again for the presentation. It was really interesting to see all the things that Delta can do.
29:40
It was pretty interesting format and well, the tools you chose is pretty cool. I'm just going to check to see if there are some questions. There were not. So maybe I kickstart with just one from mine. What are, I mean, well, we talk about all these really nice things that Delta can do. What are some limitations that Delta has at the moment
30:00
or things that you plan to improve or add in the future? Delta is evolving as we continuously like, I mean, because mergers usually cause a lot of problem. I mean, they create multiple small files. So as time goes on, Databricks is adding more and more features into Delta like low shuffle merge, for example.
30:20
I'm just giving one simple example, which will allow, which will not remove the ordering, which is there in the local files. I mean, the files which I've read. So that it will not read the files in a different order, rather it will retain the exact same order, which was there in before, because of the Z-ordering it will help for data skipping, for example.
30:41
So as time progresses, Delta is adding, I mean, Databricks is adding more and more features into Delta like change data feed is one more new feature, which is coming in, which is in private to be correct.