How to avoid columnar calamities: what no one told you about Apache Parquet
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 69 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/67361 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Berlin Buzzwords 20215 / 69
11
25
39
43
45
51
53
54
60
00:00
Programming languageQuicksortBitProduct (business)Bookmark (World Wide Web)Information managementStrategy gameComputer programmingControl flowXML
00:42
Source codeMachine learningSystem programmingFormal languageProduct (business)Strategy gameComputer programmingDemo (music)Limit (category theory)Vector potentialMachine learningNeuroinformatikSpacetimeQuicksortFile formatView (database)Term (mathematics)Demo (music)Physical systemIntegrated development environmentLimit (category theory)Java appletDependent and independent variablesVector potentialMereologyNumbering schemeRight angleComputerXMLUMLComputer animation
03:11
File formatFile formatLevel (video gaming)ComputerComputer configurationVector spaceGroup actionElement (mathematics)Library (computing)Multiplication signOperator (mathematics)Sinc functionDampingCodeLoop (music)Computer hardwareParallel portQuery languageInformation engineeringQuicksortSingle-precision floating-point formatRow (database)String (computer science)Analytic setTable (information)Context awarenessProgrammer (hardware)CASE <Informatik>System callSubject indexingDatabaseComputer animation
06:50
Core dumpLine (geometry)Cache (computing)Web pageHauptspeicherData storage deviceSequenceMiniDiscBand matrixSemiconductor memoryCoprocessorMultiplication signReading (process)Core dumpBlock (periodic table)Greatest elementCycle (graph theory)HierarchyNeuroinformatikCache (computing)Programmer (hardware)Computer hardware2 (number)Level (video gaming)Computer programming10 (number)Fundamental theorem of algebraQuicksortSpeicherhierarchieFlow separationPredictabilityLine (geometry)Operating systemPoint (geometry)Range (statistics)Closed setDiagramComputer fileExtension (kinesiology)Uniform resource locatorMiniDiscWorkstation <Musikinstrument>Random accessBefehlsprozessorOrder (biology)LaptopSequential accessComputerExecution unitDatabaseRight angleSlide rulePairwise comparisonWeb pageDiagramXMLComputer animation
14:40
Local GroupSummierbarkeitDatabase transactionPhysical systemSubsetRow (database)CASE <Informatik>Cache (computing)Reading (process)Observational studyRight angleQuicksortNeighbourhood (graph theory)Local ringHypercubeSpeicherhierarchieTimestampElectronic data processingRepresentation (politics)Multiplication signLine (geometry)Semiconductor memoryFile formatComputer fileService (economics)PlotterTable (information)Intrusion detection systemDistanceJames Waddell Alexander IIOrder (biology)MiniDiscBefehlsprozessorSpacetimeQuery languageSequenceDiagramComputer animation
17:31
Local GroupSummierbarkeitComputerPhysical systemFile formatRight angleAnalytic setMereologyLevel (video gaming)Row (database)Propositional formulaQuery languageDatabaseData storage deviceComputer animation
18:22
Level (video gaming)SimulationMetric systemData modelContinuous functionProduct (business)Integrated development environmentInformation engineeringDifferent (Kate Ryan album)Interior (topology)Phase transitionEndliche ModelltheorieLevel (video gaming)Virtual machineQuicksortWave packetExploratory data analysisFocus (optics)Analytic setProcess (computing)Loop (music)Physical systemSoftware developerMereologyComputer animationXMLUML
19:21
Metric systemLevel (video gaming)SimulationData modelContinuous functionSource codeRow (database)Data storage deviceProcess (computing)Database normalizationComputer fileQuicksortFunction (mathematics)Fatou-MengeMereologyFile formatIntegrated development environmentNumberCombinational logicPoint (geometry)Representation (politics)Query languageBitRight anglePairwise comparisonProduct (business)Diagram
20:58
SummierbarkeitLocal GroupTimestampNeighbourhood (graph theory)SpacetimeRow (database)BitPartition (number theory)Database transactionMetadataPredicate (grammar)SubsetComputer fileData compressionQuery languageData dictionaryField (computer science)Key (cryptography)PhysicalismSubject indexingData storage deviceSingle-precision floating-point formatSet (mathematics)Lie groupMereologyTimestampSequenceGroup actionMathematicsStack (abstract data type)File formatRight angleCASE <Informatik>String (computer science)Multiplication signComputer animationDiagramProgram flowchartXML
23:55
Revision controlData typeMetadataField (computer science)Neighbourhood (graph theory)TimestampBinary fileLocal GroupSingle-precision floating-point formatGroup actionProcess (computing)Partition (number theory)Directory serviceComputer fileSet (mathematics)MereologyCASE <Informatik>EntropiecodierungCodeLine (geometry)Multiplication signField (computer science)BitData compressionType theoryMetadataRow (database)Utility softwareFunction (mathematics)CountingTotal S.A.DatabaseQuery languageSpacetimeData typeTimestampProgram flowchartComputer animationXML
25:48
Level (video gaming)Continuous functionDemo (music)Focus (optics)ImplementationInformation engineeringComputer fileIntegrated development environmentResultantString (computer science)Projective planeError messageElectronic mailing listMorley's categoricity theoremControl flowComputer configuration1 (number)Type theoryLibrary (computing)Representation (politics)Data compressionFrame problemComputer animationXMLUML
27:25
Web pageRevision controlSoftware repositorySoftware developerArrow of timeGroup actionDemo (music)Computer animation
27:54
Integrated development environmentMachine learningAxiom of choiceDefault (computer science)File formatProcess (computing)Function (mathematics)Cellular automatonComputer fileKernel (computing)GUI widgetDemo (music)LoginLeakDirectory serviceNeighbourhood (graph theory)String (computer science)TimestampOnline helpSpacetimePrice indexMetadataUtility softwareMeta elementData dictionaryCodeBitBinary fileObject (grammar)Type theoryData conversionMorley's categoricity theoremMiniDiscTerm (mathematics)Latent heatAbelian categoryTable (information)InformationComputer configurationPartition (number theory)CountingData dictionarySet (mathematics)Field (computer science)MetadataQuicksortMultiplication signElectronic mailing listMereologyString (computer science)Computer fileLengthCodeRight angleDirection (geometry)Type theoryObject (grammar)Neighbourhood (graph theory)Representation (politics)TimestampCategory of beingArrow of timeGame controllerFile formatRow (database)CodeFrame problemTable (information)SpacetimeMorley's categoricity theoremRoundness (object)InformationReading (process)Endliche ModelltheorieCASE <Informatik>Exploratory data analysisBlogComputer animation
33:25
NeuroinformatikMultiplication signResultant
34:00
Neighbourhood (graph theory)Cache (computing)Local GroupSummierbarkeitCore dumpSimulationData modelData miningComputer clusterMaxima and minimaConvex hullRow (database)Interactive televisionComputerLink (knot theory)Goodness of fitWebsiteFile formatLengthSoftware repositoryData dictionaryQuery languageAnalytic setLaptopCache (computing)MereologyDemo (music)Branch (computer science)Multiplication signControl flowInternet forumPrinciple of localityIntegrated development environmentEmailSelf-organizationQuicksortArrow of timeRevision controlStack (abstract data type)Touch typingComputer fileTwitterOperator (mathematics)CodeBlogXMLUMLProgram flowchart
36:15
Product (business)File formatWorkloadStrategy gameBefehlsprozessorNeuroinformatikCache (computing)Right angleGraphics processing unitGoodness of fitProcess (computing)Subject indexingCASE <Informatik>Normal (geometry)ComputerDatabaseVirtual machineDifferent (Kate Ryan album)Overhead (computing)Semiconductor memoryControl flowTask (computing)Query languageCartesian coordinate systemBitSummierbarkeitPerspective (visual)ResultantQuicksortFrame problemXMLUML
Transcript: English(auto-generated)
00:09
Buzzwords is one of my favorite conferences. And today I'm going to talk about one of my favorite topics with computing myths. And it's actually one of my favorite kinds of computing myths.
00:22
And that people can do because they recognize what breaks down and how they're false. So here's a little bit about me. I'm currently working on data science, product strategy at Nvidia. In past roles, I've done data science and engineering management and emerging technology in sort of the programming languages,
00:43
distributed computing and space. So I came to machine learning data science. That's gonna sort of influence the background that I provide in this talk as well. So that's sort of the high-level view
01:00
of who's talking to you. I should mention that while I work for Nvidia, I don't speak for Nvidia. These are my own opinions in this talk. But I wanna start with a question and anyone can provide their own responses in that, but I want you to think about it.
01:20
And that's where some of them is about computing that we still believe. And to give you an idea of the kind of thing that I'm thinking about here, one of these such myths is, write once, run anywhere for Java, right?
01:40
It's mostly true. It's not entirely true, but it's true enough to be useful, right? And once you understand why it's true and also where it breaks down, you can really get a lot of the benefits out of the Java ecosystem. So think about the myths that you've learned were myths
02:01
and that you still believe anyway. And we'll talk about a couple of these in this talk. So here's what the rest of the talk is gonna look like. We're gonna start with some background, sort of explaining how computer systems work and talking about why things have to work
02:22
the way they do to be efficient. And we'll sort of motivate columnar formats in general in that prologue. Then we'll talk about why you might wanna use Apache Parquet in particular. We'll look at how Parquet gets good performance,
02:40
both in terms of space and speed. And then we'll see some of the limits of one of the myths that we'll look at about Parquet, which are some of the potential performance and interoperability problems you might run into using Parquet in a polyglot environment. And then finally, with a story, we want a good denouement,
03:02
we want a happy ending. In this talk, we have a demo instead. So we'll look at a demo of sort of how to identify these problems and work around them. So I wanna start by talking about why columnar formats in general, and the title of this talk does include the phrase, what no one told you about Apache Parquet.
03:20
I have to apologize because I'm gonna start by level setting with some things that people may have told you already about Parquet, about columnar formats more generally, and about some background about computer systems to illustrate why these formats make sense. But first, I have some more questions.
03:40
If we were all in the same room, this would be easier, but this is sort of something to think about. For those of you who have Python background, I'm gonna look at this code example. We're gonna see two ways to change every element
04:02
in an array. The first one is to loop through explicitly and loop through each element by index,
04:22
and then multiply each one by four. The second way is to use a vectorized operation in Python and multiply the entire array by four at once. Those of you who are Python programmers, any thoughts about which of these is gonna be faster?
04:46
So it's nice when it works out this way, but the shorter code is also faster in this case, because since we're not doing explicit looping, we can operate on essentially every element in the vector in one library call, and we get a lot of benefits,
05:02
and we can take advantage of parallelism in hardware and all sorts of other advantages as well. So this is a question about which is faster. This is one for people with a Python background, maybe the data scientists in the audience. I have another one for the data engineers in the audience, which is if we're looking at an analytic database, which one of these kinds
05:21
of queries is gonna be more common? And we have a table with three floats and a string in it. And the first option is that we're gonna do some aggregates on one of the float columns, and we're gonna group by the value in the string column. The second option is that we're gonna combine
05:43
every value in a row in some way into a single value, or that we're gonna use every value in a row. So I mean, you might say neither of these is more common. These both sort of look suspicious, but the question is, are we more likely to operate on every value in a few columns in an analytic context,
06:02
or are we more likely to operate on every value in a row? And the answer is really gonna be it depends, but I think in a lot of analytic contexts, you're much more likely to operate on every value in a column than you are to need to worry about every column in an individual row.
06:23
So with that context of sort of why we want to worry about doing operations a collection at a time, and that we think that maybe operating on columns at a time might be faster than operating, dive a little deeper and see why this is faster
06:43
on computer systems to operate this way, and why we can benefit from these things being more common on computer systems. So to do this, we're gonna look at the way memory is organized in a computer. And we have, in any computer, we have a memory hierarchy where we have very small,
07:01
but very fast memory at the top of the hierarchy, and we have very large, but very slow memory at the bottom of the hierarchy. And what this looks like is this, and this is sort of adapted from a contemporary real processor, but the values are sort of turned into ranges in some cases so that it's a little more general.
07:22
But these are recent numbers from within the last few years. So at the top of the memory hierarchy, your processor has a bunch of registers. These are individual locations that hold individual values, and you have hundreds of these per core. So you have a few kilobytes of registers per core,
07:41
and you can access a register in a single cycle. So if you have a three gigahertz processor, this is a third of a nanosecond. If you have a four gigahertz processor, this is a quarter of a nanosecond. This is extremely fast. This is as fast as you can do anything. Some registers take more than one cycle to access.
08:01
Again, we're sort of dealing at a high level here. The next level of the memory hierarchy is level one cache, where we have tens of kilobytes, 32 or 64 kilobytes per core. We have instructions. So we program and data in separate caches,
08:21
and these caches are organized in what we call lines. So this means that the cache is addressed 64 bytes at a time. Now, this is not as fast as the registers, but we have more of it. So it takes four or five cycles. So again, between one and two thirds, one and two nanoseconds to access a value in L1 cache.
08:44
Now, one thing I want to call out in particular about the L1 cache is this aspect of it being organized in lines. And what this means is that you can't just load a single value into a cache. If we want to load a value from memory and have it in our cache so that we can put it in a register or operate it on it quickly,
09:02
we're actually going to load not just that value, but the 32 or 64 bytes surrounding that value, right? We have to load at that level of granularity. But again, this is very fast and this is why it can be very fast because we have these restrictions on how we use it and because it's small. So next up we have the L2 cache
09:23
and modern processors have rather a lot of this. They have around a megabyte per core and you have a computer architects can choose whether or not to include the values in the L2 cache or not. In many designs, they do include the values in the L1 cache.
09:40
Again, these are organized in lines, but they're about two to three times slower than the L1 cache. Now in a modern processor, we don't just have one core, we have several cores. And these diagrams are more or less to scale
10:00
at this point. We have several cores and we have a cache that's shared across all of those cores and that's the L3 cache. So in some designs, this includes the values that are in the smaller caches above it in the memory hierarchy. In many designs, it doesn't, but in those designs,
10:21
it can include what's called a victim cache, which is that if something is put out of the cache, it will land in the L3 cache that you could bring it back because sometimes things that expire from a cache may get put back in pretty quickly. And we can access this still relatively quickly, but 10 to 20 times slower than a register,
10:41
or I'm sorry, 10 to 20 times slower than the L1 cache and so 40 to 80 times slower than the registers. When we look at main memory, the memory on our CPU looks tiny by comparison.
11:01
In a typical workstation or a desktop computer or a laptop or even a cell phone at this point, you have tens to hundreds of gigabytes of main memory. Obviously your main memory is shared across cores in most conventional consumer computers. And you can address this memory in pages,
11:20
a unit of memory that the operating system and the processor cooperate to manage. And this is much slower, right? Accessing a value in main memory takes between 150 and 400 cycles. So the way to think about this is that if you have a four gigahertz processor and everything you need to do
11:40
requires loading a value for memory, you're only gonna be able to use one 400th of your performance, right? If it takes 400 cycles to access main memory. So we really wanna have as many values as we operate on in the caches. We want to exploit the fact that we're loading data
12:01
that are close together in those caches. And then finally, memory itself is even dwarfed by the kinds of disks that we have. And the disk can be hundreds of gigabytes or terabytes, but the disks are actually, the disks are actually measured in milliseconds
12:20
rather than in nanoseconds. So there's a typo on this slide. It's not 150 to 400 cycles. This is milliseconds. So this is a million times slower than memory at this point, or than caches. So we really need to be careful about accessing the disk.
12:41
If we're gonna access the disk all the time, we're really not gonna be getting the best possible performance we can get from our computer system. So we need to access the disk in a way that's effective. Now, what is effective? Well, you might think, well, a disk, I mean, I have files on my disk. I can access wherever I want. I can seek to some point on the disk and read
13:01
and I can seek to some other point, but that's not the most efficient way to do it. And an interesting quote from almost two decades ago from a database pioneer, Jim Gray, in an interview, I think still holds true today. And this is, again, 2003 context, but Gray was looking forward to a future
13:22
where we might have 20 terabyte disks in commodity hardware. And he says, well, if you have 200 accesses per second on a disk and if you only read a few kilobytes every time you hit the disk, it'll take you a year to read all the data on that 20 terabyte disk. But if you go to sequential access
13:43
and you read more of the disk at once and you read the disk in order, you can actually read through that disk in a day rather than a year. So it's really a remarkable advantage you get from treating the disk like a sequential access device like we were talking about with the caches
14:01
where you're reading contiguous blocks of memory, you wanna read contiguous blocks of your disk. Gray's takeaway here was that programmers have to think of the disk as a sequential device rather than as a random access device. And you might say, well, we have SSDs, we have NVMe now, this point actually still holds maybe to a lesser extent
14:21
than it does with spinning disks, but it still holds that you're gonna get the best performance by accessing things that are in order and close together. This is sort of a fundamental principle of computer systems. It's easy to predict what's gonna happen next. It's easy to do the right thing with what happens. So how does this apply to data processing? Well, let's look at an example dataset
14:43
that we'll use for the rest of the talk. And our case study here is for a hyper local payment service for places that are within walking distance of Alexander plots. So if we look at this data, we have timestamps, we have user IDs, we have transaction amounts, and we have the neighborhood
15:00
that the transaction took place in. And if we think about these logically as a table, we might wanna picture them like this. Now, if we had these in a row oriented format where we're gonna pack these on the disk a row at a time where values in rows are close to one another, values in rows have that sort of locality,
15:21
then it might look like this. So in this row oriented representation, our data are packed in together pretty nicely. This probably doesn't take up as much space as it would to sort of have a more human friendly representation, but let's see how this representation works
15:42
for running an analytic query of the sort that we wanna do in a data processing system. So this is just a very simple one. This is how much money has each user spent. And in order to do this transaction, we have to scan through the whole file. And for every row, we have to get the user ID
16:01
and the amount, and then we add up those user IDs for those amounts for each user. So we're gonna see something that looks like this, where we're only accessing some subset of the data. Now, this is actually sort of worse than it appears,
16:21
because remember all of those things we just talked about with the memory hierarchy, right? You're reading data sequentially. So you're reading a lot of data that you're ultimately not gonna use. And in fact, you're reading data into caches that you're not gonna use. So you're necessarily accessing your fastest, most precious memory in a really wasteful way.
16:41
So we can't just read the bytes that we're interested in. We have to read the disk sequentially. And to get that data from main memory into our CPU, we need to read cache line size chunks. So in this case, the representation of our first row is 38 bytes long. If we have 64 byte cache lines, we're gonna use three cache lines to read five rows.
17:03
And nearly all of that very fast memory is gonna be wasted, because we only care about the 14 bytes I've highlighted here that contain our user ID and the transaction amount. I hope this seems like we can do better, right? And in fact we can, and one of the ways we can do better
17:20
is to transpose our data. So instead of storing a record for each row, we store a file of records for each column, and that will look like this. So if we're implementing a query that accesses only two of these columns, we don't need to read the other values,
17:41
and we don't need to care about them, right? We're accessing values we care about, we're accessing them sequentially, and we're doing everything basically as quickly as the system will allow at every level. This is what computer systems were designed to do. So if you remember nothing else about columnar storage from this talk, remember this, because analytic queries are more likely to do something to every value in a column
18:01
than to do something to everything in a row, columnar formats can be far more efficient than row-oriented formats for analytic databases. So there are lots of other advantages to columnar formats, we'll talk about those soon, but for now let's talk about the high-level value proposition for Parquet. Let's look at the Parquet myths.
18:21
And there are two parts to this myth that I wanna call out. The first one is that Parquet is ubiquitous. If we think about a typical data science discovery workflow, you have a lot of different stages from sort of deciding whether or not you even have a problem to solve, to data engineering and model training, to finally building a production system
18:40
that has to sort of live and evolve with the data that you're seeing in the real world. And you also have a bunch of people working on these systems. You have data scientists and business analysts working on the sort of problem-defining and exploratory analytics part of the problem. You have data engineers focused on that early stage
19:00
of making the data accessible, available, clean and efficient. Then you have that sort of inner loop of machine learning model development that a lot of people focus on. And then finally we have a production deployment. And in each of these phases, people are gonna be using different tools that work well for their environment.
19:20
So a lot of data engineering jobs happen in the JVM and that sort of big data Hadoop ecosystem. A lot of data science is happening with tools like Python, R and Julia. And then in the production environment, it's really the wild west. It's gonna be a combination of a lot of these things, as well as some specialized tools
19:42
that are maybe written in C++ that are for high latency or low latency serving, for example. Now, the fact that Parquet is available in all of these environments is a huge selling point. And that's sort of the myth, right? That you can use Parquet everywhere. We'll see where this myth breaks down in a little bit.
20:04
Another advantage is that Parquet creates smaller files and is thus more efficient. So here's a recent example from my own work that's pretty representative. If we have 50 million records of synthetic payments data, a little more interesting schema than the one we saw in our example, that's about two and a half gigabytes of CSV
20:21
and under 500 megabytes is Parquet. So CSV to Parquet is a totally unfair comparison because CSV is a textual format and there's way more overhead to store numeric values and all kinds of values. But the amazing thing is that even if we compress this CSV file with gzip,
20:41
the output is still bigger than the Parquet file. So even if we're exploiting redundancy in the text of our records, Parquet is gonna come out ahead and that Parquet representation is gonna be directly useful for supporting queries, whereas the gzip CSV is not. And those smaller files will lead to faster processing. So let's look at how Parquet
21:00
actually accomplishes some of these things. Again, consider our tabular-oriented data that we've transposed into a columnar format. So if we're doing analytic queries with these data, we've already gotten some benefits just by separating things into columns because we only have to read the values we care about and they're next to each other. So they're spatially local to one another.
21:22
But there are more things we can do as well. The first thing we can do is if we have repeated values in a field like timestamps, as this is a very common case, instead of storing each one explicitly, we can store them as runs. So I can say, instead of having this first timestamp twice, I can say two and then the value
21:41
and so on for these other examples. This can save us a lot of space and time. The second thing we can do is to store low cardinality values in a dictionary and replace each value with its key in the dictionary. These keys are typically gonna be a lot smaller than the value they're mapping to, especially if we're talking about strings.
22:01
So this can save us a lot of space. And here we've done this with neighborhoods. So instead of storing each neighborhood specifically, we keep an index of neighborhoods and we just store a dictionary of neighborhoods and we just store the index of each particular transaction neighborhood in the column. Now, whether we've used these encoding tricks or not,
22:22
we can also use a general purpose compression algorithm to compress each column so that we're saving some additional space. Another thing Parquet can do to improve performance is what's called predicate put. And the idea behind predicate put metadata for each of these is indicating some values
22:43
that are in them. So for this, if we've partition file into else and we get it looks like this, wanna consider subset of the top. I don't even need to, so our data, we can ignore that grouping.
23:02
Similarly, if we're interested in a value that only appears in a subset of our data, like in this case, we only have Kreuzberg in the set of records on the left, we don't need to look at the set of records on the right. So I said, logical file, and this is a little bit of a white lie.
23:21
In some cases, we can actually have multiple logical files in a single physical file and Parquet calls these row groups. It's the columns for a subset of rows in a dataset as well as the metadata for each. So if we had a single Parquet file, we might have two row groups in that file. And this doesn't change anything
23:41
about the way predicate pushdown works because we can treat these as logical files. The query engine can seek to a particular part of the file and read the values that we're interested in sequentially. Another wrinkle is if we're dealing with Parquet files generated on a cluster from a Spark or Hadoop job.
24:01
In this case, we actually will have a directory full of Parquet files that we can treat transparently as a single Parquet dataset. And each one of these is gonna correspond to row groups generated from a partition of our original dataset. So we can inspect the metadata of a dataset stored in Parquet
24:21
with the Parquet tools command line utility. And this is a command that provides several subcommands to examine metadata encodings, compression, and actual data in a Parquet file or directory. It'll provide a ton of output. We're gonna look at it a little bit at a time. So here we can see that we're dealing with a Parquet file that has multiple parts that was generated with Apache Spark
24:41
and that Spark stored some metadata about the schema. The next part of this file shows us Parquet's schema for this file. And as we read on in the output, we have the metadata for each row group. This has a lot of useful detail in it, and it can tell us about how a query engine can handle our data efficiently by examining that metadata.
25:02
First up is the row count and the total size of values in the row group. And then for each field, we have the data type, compression type, and compression ratio. And we can see in this case that the timestamp encoding saved us over 50% of the space
25:20
relative to the raw column values. We also see the column encodings here. So we see that we have dictionary encodings for a lot of these. And finally, we have some metadata about the kinds of values that we have in each column. So these things put together can really provide us some benefits
25:44
as query processing engines. So let's see where the myths break down though. And we really wanna focus on this handoff between data engineers and data scientists where Parquet's ubiquity breaks down.
26:00
And if we think about taking from a JVM-based data engineering pipeline to a Python-based feature engineering pipeline, we might be thinking about going to Pandas. So the first problem you run into is availability. So Pandas has an API method to let you read a Parquet file but you'll have to install some other libraries to use it.
26:23
Now, Pandas gives you two options in this error message, fast Parquet and PyArrow. There are trade-offs between each. In practice, I've found PyArrow has been the best for my projects and that's what I'll focus on in the rest of the talk. Another problem is the capabilities of your implementations. So this is the compression types that Parquet supports.
26:43
Some of these will get you really great results if you're in an environment that supports them but Snappy and gzip are gonna be the ones that are most widely available and those are a good safe bet until you know about what the environment you're running in is gonna support. Another problem is that if we're reading a Parquet file
27:03
into something like a Pandas data frame, these dictionary encoded strings can get materialized on read. So instead of this nice compact dictionary encoded representation which we might wanna use as a categorical in Pandas, we have this long list of strings, right?
27:20
This is a problem that we'll see how to work around in the demo. The last problem is related to Parquet tools itself. Parquet tools is super useful but it's been deprecated upstream and removed from the Parquet repo. So if you look for it, this is what you'll get. As a workaround, you can pull an older version of Parquet tools from Maven or install it via the Homebrew tool.
27:41
And there's also an Arrow based version of Parquet tools that runs in Python under development. So I wanna quickly go through and see some of these problems in action with our demo. And what we're gonna see is loading a data frame from Spark
28:03
and then seeing how Pandas inappropriately materializes these dictionary encoded fields and how we can work around that. So here we have a Parquet file in Spark
28:21
and we're gonna look at the schema here and then look at the first 10 rows and see if it makes sense. So we have our amount, we have our neighborhood, our timestamp, our user ID, we have everything we expect to see there. Now, if we look at the first 10 rows, again, this is looking about like we expect. We have a few more neighborhoods in this dataset
28:43
but the data look pretty sensible. So now we're gonna go down and read. We're gonna look at the Parquet metadata again with Parquet tools and we're gonna go ahead and read that into Pandas.
29:08
As you can see, we have the size, the value count, the sort of field metadata and the encodings, all that we expect to see there.
29:27
Now, when we actually read these into Pandas though, so we see crucially we note that our neighborhood is dictionary encoded, right? And when we read these into Pandas, we're gonna see an issue.
29:49
We notice that the neighborhood actually takes up less space than the value count because of the dictionary encoding. So that's a real advantage there. And when we have the metadata
30:00
for what values are in that column, we also see that the timestamp again is compressed because of the run length encoding.
30:22
Okay, so if we see what happens when we try and read these into Pandas, we get a Pandas data frame like we would expect with the readParquet method. I have PyArrow installed here. That looks pretty good. But these strings have all been materialized. So instead of having a nice Pandas categorical,
30:42
which we could use more or less directly for exploratory analysis or to train a model, we are actually storing these as Python objects. So they're even taking up more space than they would as strings in this case. So we've had a problem with the round trip here.
31:07
So we could actually convert that column to categoricals, which we're gonna do here. And we see that if we save the categorical valued column from Pandas, we actually can recover that type information that we're interested in.
31:23
But this sort of defeats the purpose of having an interchange format, right? If you say, well, I'm gonna hand this file over to a data science team, and they're gonna have to rewrite it to make use of it efficiently. So if we look at this round trip here, we're gonna get the types we expect.
31:41
So we see that this is a category instead of an object. So we can use the PyArrow API directly to sort of have more control over how we read this in. And as we can see, we can read a table with PyArrow.
32:12
And if we look at that table, we have a string rather than a Python object, which is a step in the right direction. And the arrow is a columnar representation.
32:25
So we know that that's gonna be dictionary encoded as well. But if we convert this to pandas, again, we would go back to an object. So what we wanna do instead is specify
32:42
that we wanna read some columns and preserve that dictionary metadata. So I'll show you what this looks like here. And we're just gonna specify a list of columns that we're gonna read as a dictionary.
33:01
So now if we look at that table, we see that it's a dictionary. If we convert it to pandas, we're gonna maintain that categorical type. We're running a little short on time. So I have some more code that's available from a blog post that you can see on how to inspect Parquet files,
33:22
but we'll skip that part of the demo and we'll just go ahead to our conclusions. So thanks everyone for your time. I hope that you're thinking more about some computing myths that you've discovered are false
33:41
and that you realize are useful anyway. And I think that Parquet is not perfectly ubiquitous and it doesn't always get you perfect performance, but if you're careful about how it works, you can get really excellent results. So here's what we talked about today.
34:01
We first talked about the organization of computer systems. We talked about these principles of locality. We talked about how these things fit together and how to get the best performance out of computer systems by reading things sequentially, operating on values that are close together. And we saw how a row oriented format
34:24
can have a really bad performance for analytic queries and for caches. And I hope everyone is noticing that I'm actually choosing a column oriented format for these summary slides. The next thing we looked at was how Parquet can get good performance by techniques
34:43
like column encoding, run length encoding and dictionary encoding, and then predicate pushdown to only consider the parts of files that a query is actually interested in. And then finally, we looked at some interoperability challenges
35:02
that you might run into taking Parquet from an environment like the JVM to an environment like the Python data ecosystem, for example. And we saw some solutions by using the Arrow APIs
35:20
directly to sort of work around some of those challenges. So again, I cut the demo a little bit short, but the full interactive notebook version of that demo is available from a link on my blog, chapeau.freevariable.com. If you search that site for Parquet, you'll get a link to the GitHub repo with that notebook
35:42
and there's a buzzwords branch on that repo that has the Berlin payments dataset on it. So please keep in touch. Twitter and GitHub are great ways to reach me. I'm at will be on both, and you can send me an email at willbe at nvidia.com.
36:02
And I think since we have a break now, I think the moderators are telling me we have time for a couple of questions. So I'd appreciate any questions you have, folks. Thanks again for your time. Thank you for the great talk. It was definitely a great introduction on different pieces of this tech as well.
36:20
And we do have a few questions while we have like a break after this one. So first question is basically, what is your opinion on ORC versus Parquet? There's like another format. That's a good question. So, I mean, from my perspective, there are advantages to both. The question is ORC versus Parquet.
36:42
And I know some people have had great success with ORC. I've used Parquet because of its ubiquity, really in more applications. And I think Parquet has sort of a little bit better coverage in the ecosystem.
37:01
But certainly there are performance advantages to ORC in some applications and it's always worth measuring and checking. Yeah, sounds good. And maybe another question from my side. Now we're talking a bit asynchronously, right? So you mentioned like CPU and caches and everything else. I've seen some companies also advocating
37:21
and saying like, hey, your data computations is gonna be like one GPU, right? And even not talking about like machine learning kind of computations, but like a normal search and indexes and sums, right? What do you think about that? Is it something that makes it faster or overhead of GPU is actually like taking it away? So, I mean, this is a great question. Like I mentioned that I work on Apache Spark
37:43
and some of what I work on at Nvidia is actually the product strategy for Apache Spark on GPUs. So, I mean, again, I'm not speaking for my employer. I think that columnar formats really open up accelerated computing to a lot of these applications, right? And it's, I mean, it's a case where GPUs can accelerate
38:03
what we think of as traditional database work. I mean, it's one of these things where in a lot of cases the devil is in the details, right? You have to be careful about how you use things. But if you have data that are in columnar formats, if you have queries that operate a column at a time, and if you can do enough work,
38:22
I mean, there is some cost in getting data from main memory to GPU memory and back. If you can do enough work over a task to amortize that out, you can get good results. So, I've seen in realistic Spark workloads, acceleration of 3X to 5X on data frame jobs, right?
38:42
With Spark on GPUs.