Apache Arrow and Substrait, the secret foundations of Data Engineering
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Anzahl der Teile | 141 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 4.0 International: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben. | |
Identifikatoren | 10.5446/68639 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
EuroPython 2023133 / 141
8
17
22
26
27
31
42
48
52
55
56
59
64
66
67
72
73
77
79
83
86
87
95
99
103
105
113
114
115
118
119
123
129
131
135
139
140
141
00:00
Funktion <Mathematik>ZeitrichtungSystemplattformRahmenproblemProzess <Informatik>DateiformatZeitrichtungBildschirmmaskeFormale SpracheElektronische PublikationBitVektorpotenzialSoftwareDifferenteProgrammierumgebungFehlermeldungHalbleiterspeicherGemeinsamer SpeicherInformation EngineeringXMLComputeranimation
02:27
ZeitrichtungDateiformatFundamentalsatz der AlgebraVererbungshierarchieÄhnlichkeitsgeometrieHausdorff-DimensionArray <Informatik>ZeitrichtungDatentransferFahne <Mathematik>MultiplikationsoperatorBitmap-GraphikCASE <Informatik>SoftwareAnalytische FortsetzungStandardabweichungZeichenketteFlächeninhaltDatensatzZeiger <Informatik>AlgorithmusObjekt <Kategorie>DatenstrukturNichtlinearer OperatorVektorraummodellHalbleiterspeicherFunktionalSchnittmengeValiditätMultiplikationDateiformatInformation EngineeringSondierungRechenschieberImplementierungSoftwareentwicklerVererbungshierarchieBitZahlenbereichDienst <Informatik>Transformation <Mathematik>ComputeranimationXML
06:54
TabelleRahmenproblemÄhnlichkeitsgeometrieArray <Informatik>GruppoidRelationale DatenbankKomplex <Algebra>SchnittmengeDifferenteTransformation <Mathematik>ZeitrichtungTypentheorieHalbleiterspeicherDateiformatUmsetzung <Informatik>ZweiServerBildschirmmaskeAnalysisMini-DiscZeichenketteSondierungKardinalzahlVersionsverwaltungElektronische PublikationArray <Informatik>AppletTechnische InformatikCASE <Informatik>Virtuelle AdresseEindeutigkeitKlasse <Mathematik>TabelleMetadatenSoftwareClientKundendatenbankZahlenbereichNichtlinearer OperatorStapeldateiBitMultiplikationInformationsspeicherungDatensatzFunktionalMultiplikationsoperatorProtokoll <Datenverarbeitungssystem>MomentenproblemLeistung <Physik>SoundverarbeitungKette <Mathematik>ComputerspielRahmenproblemGemeinsamer SpeicherGrößenordnungFunktion <Mathematik>TeilbarkeitGruppenoperationDatenkompressionBootenOrdnung <Mathematik>Treiber <Programm>Reelle ZahlZählenComputeranimation
16:16
InformationsspeicherungDateiformatZeitrichtungProzess <Informatik>DatenbankFormale GrammatikROM <Informatik>Keller <Informatik>RechnernetzMini-DiscFormale SpracheTeilmengeDifferenteSoftwarep-BlockAbfrageDateiformatMAPProtokoll <Datenverarbeitungssystem>Automatische HandlungsplanungGebäude <Mathematik>SprachsyntheseEinsTermNeuroinformatikGlobale OptimierungZeitrichtungStochastische AbhängigkeitBildschirmmaskeGemeinsamer SpeicherMomentenproblemElektronische PublikationNichtlinearer OperatorCASE <Informatik>DatensatzProzess <Informatik>ComputerspielMereologieFirmwareDatenbankMini-DiscFilter <Stochastik>DialektWort <Informatik>Treiber <Programm>ResultanteDokumentenserverVersionsverwaltungAbstraktionsebeneProjektive EbeneSpeicherabzugVirtuelle MaschineFunktionalBootenImplementierungComputeranimation
24:03
Interface <Schaltung>DatenbankZeitrichtungAbfrageProtokoll <Datenverarbeitungssystem>Treiber <Programm>DialektSQL ServerStandardabweichungCOMGeradeCodeTypentheorieSpeicherabzugParserTabelleOnline-KatalogHochdruckSchlüsselverwaltungSyntaktische AnalyseService providerDatenbankHalbleiterspeicherAnalytische MengeCASE <Informatik>AbfrageFront-End <Software>ZeitrichtungMereologieResultanteFormale SprachePhysikalisches SystemDateiformatTabelleSelbstrepräsentationInformationsspeicherungGebäude <Mathematik>p-BlockKartesische KoordinatenVirtuelle MaschineShape <Informatik>GeradeGemeinsamer SpeicherServerElektronische PublikationMAPTreiber <Programm>CodeUmsetzung <Informatik>Zeiger <Informatik>TermTransformation <Mathematik>Inverser LimesSchreiben <Datenverarbeitung>DialektTechnische InformatikBildschirmmaskeATMDateiverwaltungDatenfusionProdukt <Mathematik>RahmenproblemSpieltheorieMini-DiscProgrammierumgebungPunktUmwandlungsenthalpieSichtenkonzeptVerschiebungsoperatorMetadatenDienst <Informatik>ComputeranimationXML
31:49
GeradeDatenbankCodeTypentheorieSpeicherabzugParserAbfrageTabelleOnline-KatalogHochdruckMailing-ListeSchlüsselverwaltungSyntaktische AnalyseÜbersetzer <Informatik>AbfrageTechnische InformatikOpen SourceDateiformatTabelleElektronische PublikationMAPHalbleiterspeicherCASE <Informatik>GruppenoperationResultanteDatensatzService providerMeta-TagZeitrichtungDifferenteCodeAnalytische MengeGeradeDatenbankGüte der AnpassungPhysikalisches SystemUmsetzung <Informatik>PunktZählenImplementierungClientServerSoftwareApproximationDämpfungZahlenbereichZeichenketteBitParametersystemProgrammierungApp <Programm>IntegralFunktionalLesen <Datenverarbeitung>Einfache GenauigkeitCompilerSelbstrepräsentationParserGeschlecht <Mathematik>Ganze ZahlXMLComputeranimation
39:36
ZeitrichtungMultiplikationsoperatorDateiformatZeitrichtungComputeranimation
40:22
SchlussregelAutomatische HandlungsplanungBildschirmmaskeElektronische PublikationZeitrichtungAbfrageService providerCASE <Informatik>DateiformatProgrammierumgebungPunktwolkePhysikalisches SystemPhysikalische TheorieAnalytische MengeWort <Informatik>MultiplikationsoperatorStrömungsrichtungRechenschieberDatenbankSoftwareProjektive EbeneParallele SchnittstelleVorlesung/Konferenz
43:44
Computeranimation
Transkript: Englisch(automatisch erzeugt)
00:04
So I would like to talk to you about two technologies that I think have the potential to shape the future of data engineering. And those two technologies are Apache Arrow and Substrate.
00:21
Apache Arrow, for those of you that don't know what it is, it was born as an in-memory format to represent data in a way that it's understood by all libraries and languages. So it was a form of polyglot and aniostic format
00:41
to share data between different software. During the course of years, it evolved from just a format to be a library that allows to work with that format. So now it became a library that supports data interchange,
01:01
a library that supports modifying the memory data, supports transmitting that format over the network, saving that format to files or logging back from files, any kind of IO. It provides a built-in compute engine that is able to work on top
01:21
of that format and all kind of capabilities. So at the beginning, I would like to introduce you a little bit of Apache Arrow because it's a way that we can start collaborating across very different languages. In the past, when you wanted to share some data
01:42
between different libraries or environments like you wanted to share your Python data frames, Pandas data frames with Spark or something like that, those software had to go through a process of copying the data, converting it in a format that they understood,
02:03
and then they allowed you to work over that data. With Apache Arrow, what Apache Arrow is trying to do is instead allow all those software to share the same data without having any cost of copying or converting the data
02:22
because they all understand the Apache Arrow format. So the beginning of what is the foundation of the Apache Arrow data, it's obviously arrays. I hope no R developers got offended by the previous slide.
02:41
So the foundation of the data that we can represent in Arrow is obviously array because most of the times for data engineering and data science, you care about columns of data. You want to run an operation on a set of multiple rows. So that is represented in our case as Python users as a pyarrow array.
03:06
At this layer, the pyarrow is not too much different from what might be already used with NumPy. So we can perform a standard kind of operations like creating an array. That array can contain any kind of data like numbers,
03:25
strings, or structures. And you might be asking, so why would I ever use pyarrow given that many libraries already understood the NumPy format? Well, because NumPy was born without thinking too much
03:44
about reliability of that data at the time. So the format is not very optimized for transmission over networks or keeping the data fast when you want work of reading memory.
04:01
And for example, one typical case where NumPy does not perform super well is when you want to work with strings or when you want to work with structured objects or when you want to work with missing data. In all those cases, NumPy does not provide a built-in implementation
04:25
that goes into optimized workflow. It provides a, it relies on Python mostly. So if you store a string, it will be a Python object. And that means that when we want to work over the data
04:41
in the array, we will be slower because instead of having a continuous area of memory, we will have just pointers to Python objects and so we cannot really implement any super performant algorithm that works to them in a vectorized way or something like that. While instead for PyArrow or better, for the arrow format,
05:04
everything is stored in continuous areas of memory. PyArrow tries to give that as a general guarantee for any kind of data. That includes strings. So for example, down you can see the way strings are saved. That means that if I want to look for a string in the array,
05:23
I just can perform a linear scan over the data. If I want to perform a transformation, I can apply a vectorized function to all the strings or those kind of things. And also, differently from NumPy, it provides built-in support for missing data.
05:42
So if you have a row of your data that is not available, you can flag it as missing through a bitmap that arrow keeps. That allows you to both flag as missing data, which you say, well,
06:03
I could just store zero or something like that to signal a missing value. Yes, you could, but that's what happens if you add a zero as a valid data. Or what happens if you want to know why that data is missing? For example, a very common case when working with surveys is
06:21
that you have missing data, but there are multiple reasons why it might be missing. It might be missing because the person didn't answer the question. Or it might be missing because the person answered, I don't know. Or it might be missing because the person answered, I don't want to answer. So by having both the values array and the validity bits,
06:44
we can store, yes, it's missing. And as a value is missing before because the person did not answer or something like that. Then obviously you say, well, but I don't usually want to work with just one column.
07:02
I usually want to work with more complex set of data and understand the relations between the different columns or something like that. Great. So that's why our also provides support for column of data mostly in the form of tables, but also in the form of record batch or multiple ways you can manage table of data
07:27
depending on what you need to do. Usually when working as a user, you will end up using tables because those are the most similar to what you can do in something like a panel starter frame or other libraries
07:41
that you might be used to. The foundation of a table is not the array itself. It's actually a chunked array. That's because you might be appending more data, extending the columns or performing those kind of operations, and the fact that we can use a chunked array means
08:00
that those operations are zero cost because we don't have to reallocate the whole array. We just append the chunk at the end of the array. And here is a simple example of a table. In general, you just create a table from a set of columns assigning a schema to the table
08:21
and the schema is no more than names and usually types, but the types are already known by the array themselves. On top of the tables, you could be wondering, great, I have arrays, I have tables, like what's the matter?
08:40
Why should I use them? Well, first of all, because as we said, if you represent your data in RO arrays or RO tables or RO record batches, you will be able to share the data across different libraries without occurring any cost of conversion.
09:01
And then, obviously, in case you don't want to share the data with something else, but you want to run some very simple kind of exploration of your data, RO implements a built-in compute engine based on the RO format itself. So at the moment, we have seen how RO allows us
09:21
to store data on disk. Usually, people will rely on the Parquet reader, which is very optimized for logging data from Parquet into the RO format itself. But if you want, for various reasons, there is the native RO disk format itself. That would allow you to do things like memory mapping your arrays
09:44
to disk, because on this format is equivalent to the memory format, so there is no cost of loading or conversion. For that reason, it's usually very fast. But it will be bigger. Parquet implements a whole set of compression algorithms.
10:02
The dictionary encoding, another form of more effective and efficient storage. But by virtue of the fact that you can memory map the data on disk, the RO native format will allow you to easily work with data that is bigger than your memory, because you just allow the camera
10:22
to swap in and swap out from the memory mapped file. And we have seen that once we load data, we have it in the RO format, which allows us to share that data with many different libraries. And you might be wondering why should I even share that data
10:40
with different libraries? Well, one very simple use case that can give you an idea of why this might be helpful is if you have a Pandas file, now Pandas is based on Apache RO, for those who don't know. Since version 2.0 uses Apache RO for various use cases.
11:02
But before version 2.0, if you wanted to load the Parquet file on Pandas, it was faster to load it in RO and then convert it to a Pandas data frame. It was an order of magnitude faster, so not just a small improvement.
11:20
And that's because the Parquet loader was more optimized for loading data in the RO format. And then there was no cost of converting that data from RO to Pandas because Pandas understood the RO format not evenly. Well, in reality, there were some cost of conversion like if you add the strings, as we have seen,
11:42
Pandas relies on NumPy and NumPy does not really have great support for strings. Or if you add some kind of columns, they will require conversion. But for most basic types like numerical types or things like that, they will acquire no cost of conversion at all.
12:04
And then we have seen that we have a compute engine in RO. So if I want to perform transformations or analysis on top of the RO data itself, yes, I can ship that data to Pandas and perform all my transformation in Pandas. Or I could just run the same transformation
12:22
in RO itself using the server compute engine which ships together with RO. Generally, the server compute engine is faster than Pandas for most use cases because it's implemented not evenly in C++ with vectorized operations.
12:40
While most of compute function in Pandas are optimized because they are implemented in Cython but are generally not as fast as the server operations. And then we have seen that RO is not just a format for disk memory but also for networking. So if you want to ship the data across the network,
13:02
you can rely on RO flight which is a format to share data across networks. Here is a brief example of what you can do if you use the asset building engine to perform operations over a table. Generally, you have the basic operation like group by aggregate
13:23
or filtering or joins, those kind of things. Pandas is a much more richer set of operations. And as I told you, it's generally very fast. So in this case, for example, we can see how finding the unique values in NumPy array.
13:43
This is a very common kind of operation. If you want to know like how many people, how many occurrences of a value I have in array, which might mean how many people answer the specific question of my survey going back to the example of surveys,
14:00
then you want to run a unique count. And doing that in NumPy will take one and a half second. Doing the same exact operation over the same exact data on arrow will take 0.4 seconds, let's say. So you can see that it's usually three, four times faster
14:21
than running the same operations on NumPy. And I told you a little bit about flight. And flight is a protocol that we can use to ship data. So you can easily implement a flight server and flight client by relying on the classes that are built into pyR itself.
14:40
And flight is mostly based on gRPC. So it's a binary only protocol where it uses gRPC for the metadata and it uses the arrow format for the data itself. The reason why flight is important is because we, as most of arrow, is optimized for performance,
15:03
for shipping data across different libraries with very little overhead. And when I talk about very little overhead, I mean for real. Like in the case of shipping the data using the ODBC driver for something like five million records or one billion records,
15:23
you can see that the difference is huge. Like in the case of five million records, if I use ODBC it will take something like 60 seconds. If I do the same exact thing using arrow flight, it takes like, I don't know, five seconds. I don't remember the exact value, but that's the order.
15:43
And if we start talking of bigger and bigger data, that difference increases. The reason is because we have no cost of conversion ever. Like the data in the network layer goes without any conversion cost.
16:01
The data out of the network layer into the memory of the receiving end goes without any conversion cost. And the data is generally stored in a very effective and optimized form. So we have seen that with arrow we can load the data
16:21
from disk very fast, we can manipulate it very fast, and we can ship it across the network very fast. So we have in practice all the building blocks we might need to work with the data. So when working with data, there are two things that we have to do. One is obviously having a way to represent
16:42
and manipulate the data itself. And the other is being able to express the manipulations or the queries that we want to run on the data. So we know that all libraries, many libraries can support the arrow data.
17:02
So they can understand any data you give to them in the arrow format and they can give you back data in the arrow format so you can then forward it to something else. But that would be uncomplete if we were only able to ship the data without being able
17:22
to tell the library what to do with it. So that's the reason why apart from the whole arrow stack itself, we also have the support for substrate. And the substrate is a language that allows you to represent the query plans itself.
17:42
So as much as arrow is a lingua franca for sharing data across different software, substrate is a lingua franca to share queries across different data with different software. So it's a way to implement API independent
18:04
and language-aneostic form of queries. Imagine something like SQL but where there are no dialects. Everyone understands the same exact queries. And imagine something like SQL but it's binary, so optimized
18:21
for computation and optimized for software instead of humans. And imagine something like SQL that is decomposable. So I can take one query and split it into sub pieces. And why that is important is because imagine a word
18:41
where you send a query to a central compute engine and that compute engine splits the query in smaller pieces and sends, for example, the filtering part to the data loader side. For example, some people are thinking of implementing support for basic substrate filters directly on the disk firmwares.
19:07
That means that when I need to load data from a file, I will just take the filter part of the substrate query and forward it to the disk firmware. And the disk firmware will already give me back the pieces
19:22
of the file that only contain the rows matching that filter. Then I can forward the rows to the compute engine which will run the rest of the query. Or, for example, if there is a query that performs a machine learning operation and I want to use Spark ML
19:43
to perform that specific part, that specific function, I can ship the data with arrow and the piece of the query with substrate to Spark. And they will process that part of the query over my data. And all this is for free because we don't have
20:04
to convert the data or the query in any other format because they understand substrate queries and arrow data not evenly. Well, I made you an example of Spark because it was easy. But the support for a substrate in Spark is something they're working
20:23
on right now. There is an Intel-based project named Gluten which is a fork of Spark that supports substrate. But the work of implementing substrate is now happening in their core repository of Spark itself.
20:42
So in the future, they will, we will be able to just use PySpark with substrate. And when I have those queries, so when I have substrate and I want to ship those queries to different software, how can I do that? Well, that's the case where FlySQL comes into play.
21:04
So FlySQL is a version of Fly which is optimized for shipping queries and shipping the results of those queries back from the compute engine. So for example, you could use any driver that is able
21:22
to speak flight query and use it to ship the query in substrate format or SQL if you want. FlySQL obviously, as you can guess by the name, also supports SQL. And once you ship that query, the endpoint,
21:40
the receiver will answer you back with arrow data. And that means that we have both the benefits of substrate and the benefits of arrow in terms of performance and compatibility of the data. And make it easier to work with FlySQL because FlySQL is
22:02
in practice at very low level. It's at network level. So yes, you could use it. It's not super hard but it's also not very convenient. That's the reason why ADBC was born which is the Arrow database driver.
22:21
And ADBC is in practice able to connect with any database that supports arrow data. And especially if they support FlySQL. For at the moment, you could connect with ADBC to Postgres, to DuckDB, SQLite, and many more.
22:42
And the reason why this is important is because if you go through ADBC, your benefits of all the performance improvements that you get when connecting through an arrow native format. So in practice, you gain the same exact performance that you did get when using flight as we have seen.
23:03
Well, it might not be on pair because there might be cases where the Postgres protocol might be less effective, less performing than flight itself. But for general cases, I think we can say that you will get as much performance improvement as you could expect from flight itself.
23:28
So the main difference if you are wondering between FlySQL and ADBC is that both are arrow native protocols. ADBC can use FlySQL if you want.
23:41
But like ADBC is general. As a user, you don't know and you don't care how you are connecting to the database. Like ADBC hides all that abstract, so that for you. While FlySQL has to be implemented specifically for that database.
24:01
So if you use FlySQL, you need to speak the exact dialect of that database. And not just in the form of the queries that you send because the query dialect is something you will have to add arrow to also when using ADBC. But also in terms of how the database ships that, the metadata,
24:22
how the database ships back the pointer to the query result. And all that is not abstracted for you by FlySQL. As I told you, it's a fairly database-specific driver.
24:42
So it's a lower level than ADBC. But if you use ADBC, you in practice get the benefits you use to get from ADBC or JDBC. But with the performance of an arrow native system. So to recap what we talk about so far, we can say that we have a way
25:09
to connect to send queries to our compute engine. And the most common ways as a Python user you probably want to use are pyarrow itself, which is tightly coupled
25:25
with the arrow from the engine. So if I send a query to run over my data, pyarrow only supports our arrow, so we run that query on our arrow. And the benefit is just that you have no,
25:43
you don't go to any query conversion cost. You don't go to any like cost of shipping the query and the data back and forth. Pyarrow will directly share the, let's say, memory pointer to the data with a server.
26:00
A server will perform your query and share with pyarrow the result. That means that for some cases it might be effective. But that means also that it won't be easy to scale it because it's a local engine that runs on your own system. It's not something that you can deploy remotely and deploy
26:23
like maybe 20 nodes or something like that. The reason why I'm mentioning it is because frequently it's a very good way to work when exploring data. You have the same exact data that is in our format somehow, maybe because you loaded it from disk, maybe because you got it back
26:43
from another application. And you can directly use pyarrow on your machine to view the data, perform basic manipulation, filtering it, aggregating it, trying to understand what's the shape of that data. But when working with more production-ready environment,
27:04
you probably want to use something like IBIS. For the people that don't know IBIS, IBIS is a, it's a front-end to perform common analytics on top of different compute engine or backends.
27:24
So for example, if I have data in Postgres, I can use IBIS to connect to Postgres, design my query, write it in IBIS, send the query to Postgres and get back the data. IBIS support not evenly getting bad data
27:43
in the arrow format or in the pandas format. So if you are used to pandas, you can just get back your ends as a pandas data frame, your results as a pandas data frame. And IBIS obviously supports also pandas as a backend.
28:02
So from the data you get back, you can perform further transformations using IBIS. That is convenient because it allows you to write code base that is agnostic to your backend. Suppose that one day you are executing your queries on Postgres
28:21
and after one year that you are running on Postgres, you find there are limitations and you cannot further improve your application and then you can shift to something like DuckDB and you won't have to rewrite any line of code because both of them have support for IBIS.
28:41
And how does IBIS communicate with those backends? It's by using SQL or substrate for the backends that understand one of them. So it means that the queries that you will write in IBIS will be compiled to substrate or SQL
29:03
and shipped to the backend. How do they get shipped? Well, we are in the Arrow ecosystem so they can get shipped using ADBC, they can get shipped using FlySQL, or in case of backends that don't support any of the Arrow native format,
29:20
they can be shipped with a dedicated custom backend that usually IBIS provides. Inside IBIS there are like many backends, like I think more than 10 of them or something like that. Once you have your queries in Arrow format and in Substrate format,
29:42
you can send them to any compute engine that is compatible with those two. So for example, using IBIS I could ship that query to a server itself if I want, or I can ship that query to DuckDB or to Metabellox or to QDF or to Data Fusion.
30:02
That means that I am able to choose depending on what the solution that works best in my use case. Once those compute engine receive the query, they will be able to execute it on any data that is supported by Arrow itself.
30:21
So for example, you can run the query server on your .pk files or something like that, and you can have your .pk file stored in any place that is supported by the Arrow file system, so like S3 or things like that. So we have seen how we have many building blocks that if you think about that,
30:43
in practice are all the pieces that you need to implement a database in common cases. So one little example that I want to show you is just a game, it's not something serious that I'm suggesting you should be using in production,
31:00
but it gives you a very concrete sense of what can be done using Arrow substrate and all the libraries that are part of that ecosystem. So what you can see here, it's a small database that I've written in like 50 lines of code. So for the database, we are going to use Parquet
31:24
as the storage format for our tables, we are going to use substrate as the representation of our queries, we are going to use SQL as the language that as a user you can use to define your queries,
31:42
and we are going to use Arrow tables as the format that we rely on to get back the results of our queries. So well, the first few lines of code are obviously very simple, they are just related to setting up a way for the end user to pass the query.
32:03
So like we just accept the query as the first argument of our program, of our app. Then the next step, we want to understand which tables are available, so which files we can run the query and what is their schema.
32:23
So we just scan for every Parquet file and we read the schema. Why we want to do that? Before we have an actual query to run, well, that's because to be able to compile the query to something that the Compute Engine understands,
32:41
you will usually need the schema, because the Compute Engine needs to know if the rows are an integer or a float, so in case you are on a minus 10 filter in your query, they know if they should provide, use the implementation of minus 10 for floats or integrals,
33:02
because obviously, as you can imagine, for integrals, implementing something like minus 10, it's very simple because they are precise. Implementing it for floats requires a bit more work because the value you are passing might not compile to the same exact value that the Compute Engine has in memory,
33:22
even though you've wrote the same number. There might be an approximation occurring when parsing from the string to the actual memory format. And that might lead that the data you have in your packet file might be slightly different than the data that you have wrote in your query itself,
33:44
due to the approximation that is native in floats. Once we have our query as the argument and our tables, what we can do is to parse the SQL, and we can do that using IBIS.
34:03
As I told you, IBIS is a tool to represent queries, and by virtue of it being dedicated to queries themselves, to representing queries, it also provides a parser for SQL, because in some cases, you might not be willing
34:21
to write a query in the IBIS format itself. If you ever use SQLAlchemy, the IBIS query language is very similar. But maybe you have an old query into a legacy system, and you don't want to convert it
34:42
to the IBIS format as the first step. You first want to set up everything, and when you see that the system is working, you can start slowly converting your queries to IBIS. And that's the reason why IBIS has native support for SQL. So I can directly run a SQL query by passing the string to IBIS.
35:06
And for that reason, IBIS also supports a compiler from SQL to substrate. So once we have parsed the query to IBIS, we can tell IBIS substrate to compile the query
35:22
that we parsed to substrate itself. And once we have the substrate format for the query, we can run it on the Acero Compute Engine, which we get for free in PIR itself. So the only two things that we need to provide
35:40
to Acero are the query itself and the table provider. The table provider is just a lookup function that will allow Acero to know if there is the name of a table in the query, it will allow Acero to find the table,
36:01
so the actual PIR table that matches that name. So it's just a lookup map between names to tables, to PIR tables. And usually, in this case, you can see that the way I do the work is very straightforward. I just look for a parquet file
36:21
with the same exact name of the table. So in your query, you will write the name of the parquet file instead of a table name. And I read that table. So here you can see that I'm actually reading the data only for the tables that you mentioned in your query.
36:41
So I'm not going to read everything in memory and then run queries on top of it. We only read in memory the single tables that we need. Once we have sent our query to Acero, we can get back the answer in our format. That means that the answer will be another table
37:02
with only the columns or rows that we specified in our SQL query. And in this case, I also convert the table to Pandas for the just reason of printing it in a way that people are more used to.
37:20
So if we want to see these in action, given a few parquet files that I had, I tried to run it. And the first case is a simple query that finds every person living in Canada and asking for the first name, surname, and title
37:43
and limiting the results to five because I just want to explore how the data is done. And if I run my queries through the little code base that we have seen before, it will obviously answer me with that data, with a preview of the data in a table format.
38:02
But I can also run more complex things. So suppose that I want to know the gender of all the people that I have in my database, I could write a SQL query that just grew by count to find the occurrences of all the people,
38:22
of the gender of all the people that are living in Canada and I want to know the count. That's it. Those are two very simple queries, but the purpose was to show you how with just 50 lines of code, we were actually able to implement the working SQL database. The next step, if you want,
38:42
might be to implement network support, so decoupling the client and the server by using FlySQL, for example. And at that point, you will see how you have a wall stack based on arrow that never involves any cost of conversion of the data. And if I want,
39:02
I can easily replace those three lines of code around query with something different like Metavelox. For the people who don't know, Velox is a very optimized compute engine that Meta wrote for their own analytics and has been released as open source
39:21
and it's going to be the foundation of Presto, which is the database system for Meta in the future. So that's all. I hope I gave you a good overview of everything you can find in the arrow ecosystem. There is obviously much more than just the things that I've shown you.
39:42
And I hope you will go out and experiment with it because more and more libraries are transitioning to the arrow format and to substrate so that they can share the data with each other. And for example, you can run XGBoost on top of your arrow data.
40:05
You can run many different libraries or you can use geoarrow to represent geographic data. The ecosystem is growing. So this is the right time to start trying it. Questions? Amazing.
40:21
Thank you so much. Wonderful. So do we have any questions from the audience? Wonderful.
40:41
Thank you for the talk. I know some people who are running analytics on Amazon Athena, which lets you store files in various formats on S3 and then run SQL queries or what have you. Do you know if, because the format supported include ORC,
41:03
Rob, Parquet, which seem to be Apache projects. Do you know if there is any work to add the arrow formats or is maybe the Parquet format close enough that it would be simple for Amazon to add or alternatively,
41:21
is there a path for experimenting with this, using arrow or the IBIS stack to try running analytics using this instead of Athena?
41:41
Well, there are solutions that are going to do the same things you will do with Athena on other systems. For example, DuckDB is a good example. It's a database design for running analytics on top of Parquet files.
42:00
And arrow library built-in provides support for many formats. As you mentioned, we have Parquet, we have ORC and we are growing support for things like Iceberg. So we are trying to cover all the use cases that people might have their format into.
42:22
Obviously, it's a different word. It means moving from a cloud system to something that you might have to manage yourself. But I think there are cases where companies are starting to provide cloud-based solutions.
42:41
I think that the DuckDB people recently founded a company which is Mother Duck. And I think that in their plans is to provide cloud environments for DuckDB. So yes, in theory, if you have your files on S3 and they are in Parquet format, you don't even need to touch your Athena deploy.
43:02
You can just take a software that supports Arrow and have it connect to S3 around the queries on top of your Parquet file. So you can start experimenting with an Arrow solution on parallel to your current Athena deploy. Okay, thank you. Wonderful.
43:20
Thank you. So that's all the time we had for Q&A. I'm so sorry. And you can find Alessandro around on the corridors. Just grab him and ask your other questions. Also, we had a request for your slides because it was really good to be uploaded on Discord. So there we go.
43:41
Thank you so much again, everyone. Thank you. And please, thank you.
Empfehlungen
Serie mit 22 Medien