We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

The MariaDB Jupyter Kernel

00:00

Formal Metadata

Title
The MariaDB Jupyter Kernel
Title of Series
Number of Parts
637
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The reliability and performance of Open Source database solutions can’t be contested. They are fast and they are stable, but they fell short of integrating nicely with the latest data science and research workflows. If you store your data in MariaDB or even other database systems from the MySQL family, I am proud to let you know we have been working hard to create a beautiful Open Source friendship between the MariaDB Server and the Jupyter ecosystem. The MariaDB Jupyter kernel aims to improve the interaction with your data, it allows you to run MariaDB directly in a Jupyter notebook. You can display the results of your favorite SELECTs in Matplotlib graphics or export them into a DataFrame in another Python Notebook through database specific %magic commands. This talk covers the current state of the MariaDB kernel, the existing features, how to install and use it and how we imagine it should look like in terms of functionalities in the near future. It also explains the inner structure of the kernel, how to pass different configuration options and shows demos for some of the most common use cases. There is no background knowledge expected to understand the content of this talk, if you've ever used a Jupyter notebook, a SQL database or both, or you'd just love to learn about these technologies, you're more than welcome to attend.
179
Thumbnail
20:09
245
253
Thumbnail
30:06
294
350
Thumbnail
59:28
370
419
491
588
Thumbnail
30:18
ConvolutionGroup actionCodeFeedbackSoftware testingTerm (mathematics)Structural loadComputer configurationServer (computing)Client (computing)Normal (geometry)Query languageInformationElectronic visual displayInterface (computing)Web browserCircleLaptopSelf-organizationTimestampComputer fileEmailLine (geometry)Cellular automatonLocal GroupMetric systemServer (computing)Software developerNumberTotal S.A.Open setAnalytic setAuthorizationDemo (music)CircleCommitment schemeInterface (computing)Table (information)Web browserQuery languageLine (geometry)Computer fileGroup actionType theoryResultantElement (mathematics)Computer configurationLaptopConvolutionElectronic mailing listBitBeta functionPresentation of a groupSoftware bugFlow separationProjective planeTerm (mathematics)Patch (Unix)Repository (publishing)WindowLoginCodeInternational Date LineDifferent (Kate Ryan album)Client (computing)Normal (geometry)FeedbackInformationCASE <Informatik>Artistic renderingNeuroinformatikInstallation artCellular automatonStatement (computer science)Self-organizationComputer animation
ConvolutionCartesian coordinate systemEvent horizonFile formatFrequencyElement (mathematics)Local GroupIcosahedronFrustrationMetric systemOpen setQuery languageInstance (computer science)Electronic visual displayComputer configurationDemosceneLibrary (computing)Personal digital assistantResultantNumberType theoryAuthorizationSet (mathematics)Selectivity (electronic)Library (computing)Order (biology)Pie chartCartesian coordinate systemData storage deviceOnline helpComputer fileFrequencyFile formatState of matterInformationMetric systemTable (information)FrustrationRepository (publishing)Element (mathematics)Open setRow (database)Query languageConvolutionComputer configurationLaptopCountingMultiplication signGoodness of fitDescriptive statisticsSimilarity (geometry)Server (computing)Right angleMereologyGroup actionComputer animation
ConvolutionServer (computing)Link (knot theory)Web pageInstance (computer science)Statement (computer science)Revision controlLaptopQuery languageTerm (mathematics)Cellular automatonDemosceneClient (computing)Communications protocolCodeQuicksortMessage passingPoint cloudLocal ringTelecommunicationDatabaseComputer configurationInformationLogicDiagramInstallation artServer (computing)ConvolutionDatabaseClient (computing)TelecommunicationCellular automatonCodeComputer configurationMessage passingMultiplication signBitMereologyRepository (publishing)LogicProjective planeLaptopStatement (computer science)Web pageLink (knot theory)Instance (computer science)Query languageDemo (music)Term (mathematics)Knowledge baseWeb applicationWeb browserPoint cloudQuicksortCommunications protocolSystem callParsingSocial classResultantObject (grammar)Core dumpParsingOperator (mathematics)Computer wormComputer architectureGraph coloringElectronic mailing listPresentation of a groupComputing platformComputer animationProgram flowchart
Element (mathematics)Computer animation
Transcript: English(auto-generated)
Hi everyone, and welcome to the MariaDB Dev Promote FOSDEM. My name is Robert, and I'm a server developer at the MariaDB Foundation. My talk today is basically a follow-up for my MariaDB Jupyter kernel presentation that I gave during the MariaDB Fest conference.
You'll see here the main features in action, installation demos, and we will chat about the internals of the kernel. I'm happy to announce that our first beta release is out. It took a bit more than expected, but among other work in the Foundation, we managed to keep a steady pace, and here we are.
This release contains all the features we intended to have initially, and probably all the bugs we did not as well. But no worries, that's how our project started. And also, we added some more features that seemed natural to develop along the way.
After the MariaDB Fest talk, we got our first community contributions, and that was an amazing feeling. I would like to thank Seth Shonad from TileDB, who contributed several bug fixes and also automatic versioning based on setup tools, Mike Waterman for a code formatting patch,
and Jeremy from Shop.com, who gave me excellent feedback and described his planned use cases thoroughly. I'm really grateful, thank you for this. In terms of the main features we have in this release, MariaDB Notebook behaves exactly like a normal MariaDB command-like client,
with minor glitches here and there. We added magic commands that create different types of charts for the data of the last query, lsmagic that acts as a list of available magics, and also some usage information. We also provided the ability to pass a JSON configuration file
to tweak some behaviors of the kernel, automatic detection of running server, and the ability to start one if none is detected. And the last but not least, we added data frame-like rendering for tabular results. Let me show you some demos to see practically how to play around with the kernel.
To install the kernel, you need to run pip install mariadb-kernel, like this. Installing. Pip is getting all the dependencies required.
And it's done. Now we need to make the mariadb kernel visible to JupyterLab. And we do this by running the install script like this. Yep, the kernel spec is installed.
As you can see here, I already have JupyterLab installed on my computer. Here it is. Let's open it. Some logs here. And switching to the browser window.
This is the main interface of JupyterLab. And here you have the mariadb kernel. If you go on the small circle from the right, you can see kernel idling. And typing the testing command. It works.
Next we will talk about running some live commands in a notebook. If you suffer from cold fingers, like I do during live demos, notebooks are a great tool to help you cope with this. You can still give the impression to the users that you didn't fake the results,
but you also don't have to type live, because you already have everything filled up. So let me walk you through this example notebook. We have a dataset, a table called committers,
that contains all the commits that happened in the mariadb server repository between 2009 and 2019. And this table contains, for each commit, a timestamp, the name of the author, the organization email,
how many files the user changed, how many lines the author added, and how many lines the author deleted. So we want to perform some queries and some analytics on this table.
So here are the columns. We can execute this statement to get the table columns. And if we want to compute the top 10 individual committers for this timeframe, this is a markdown cell. If you execute it, it will be displayed nicely, like this.
So to compute the top 10 individual committers, we select the name, we group by the name of the author, and we count the total commits in each group.
We sum the total of files changed, and the number of lines added and the number of lines deleted, and we order the results by the number of total lines deleted. I'm not sure what's a good metric here,
let's say, for example, this one. And we limit the results set by 10. And we get a nice table where we see the top 10 committers. As you can see, the table is formatted like a Pandas data frame. And the conclusion here is that Monty didn't work enough.
Moving on, as you can see, I perform the select, another markdown element, and then we can check out what the kernel can offer in terms of magic commands. And we do this by using the lsmagic command.
And as you can see here, it is listed all the magic commands of type line, and all the magic commands of type cell, which is none. We have some ideas there, but it's soon to follow.
So we have line, bar, py, df, and lsmagic. You also have some help and some guidelines on the right to help you get started. And let's assume we are under lsmagic command.
So let's assume you want to see the proportion of total commits of the top 10 contributors we completed above. Because the kernel stores the result of the last select, the magic command here, py, operates on that result set.
And by using the total commit column as the y-axis and the index, the name column, we also disable some stuff for styling. We get this nice py chart
that displays the proportion of the total commits for this dataset. Now we have another dataset, and this one is obtained by crawling the
information from the MariaDB server repository from GitHub, using the GraphQL API. And we have it in MariaDB, it's basically a table that has a bunch of rows, like for each row which represents a pull request,
you have the state of the pull request of the event, which is merged, an ID, the title of the pull request, the URL, when it was created, last updated, if it was closed, a bunch of labels in JSON, the pull request number,
if it was merged by whom, the author of the pull request, and some other stuff like all the comments and things like that. So, this is the format of the dataset. To see who created the most pull requests,
oh, I forgot to add that this dataset is also completed for the time period between 2009 and 2019. So, to see the most prolific pull request creators, it is something similar to the other example, and that is
select from the dataset the author, and account for each group by the author, account for the number of elements in each group. Also, we order the result by the by the number of PRs the author created,
and we limit the result by 10, and we get this. We can see here Daniel Black is in the top, Kevin, Alexey, Ian, and so on. And, let me execute this.
The results, another markdown for description. And now, I want to compute another metric, which is something I invented. It is called the contributor frustration metric. Basically, this query will tell you
what are the number of pull requests that are still open, for each individual. And this kind of makes me think that the greater the number, kind of the greater the frustration for the contributor,
because the contributor has many, many pull requests that are not being watched. So, if we run this query, we can see the top five, for instance, of people and their number of pull requests. The result set is ordered by the
pull request still in the open state. So, you can see here, in the top again, Daniel Black, Alexey, Ian, and so on. Good. To display this in a chart,
using one of our magic commands, we can run the part magic command, with the x-axis, the author column, and the y-axis, the PR still open column. Also some options for styling,
so that the chart is displayed nicely. If we execute it, you can see the part chart resulted. The kernel behind the scenes, which uses Matplotlib to generate these type of charts,
they are not the best looking, in my opinion, or maybe I don't know how to configure the library properly, to make them look more modern. But for such use cases, where you want to use some fancy libraries for displaying modern looking charts,
you can export your dataset in Python, with this magic command. And now the last select was written in a CSV file, and you can go in Python,
in a Python notebook, and you can import this dataset, and, I don't know, use a library like Altair to generate modern looking charts. I have a Python notebook opened here. All you have to do after you exported the dataset is,
you need to import pandas, you need to read the CSV file, and if you print it, you'll see that the dataset is here. You can also go and, I don't know, do another, run another select, let's say the one initially here.
And we have this. We run another one tf dataset set2 .csv We execute this one, we go in the Python notebook, and we do
tf2 .csv It's the cold fingers dataset2 .csv And if we execute this one, you get the dataset here.
So, that's pretty much it. Next, we will see another demo, in which we try to exemplify how you can run MariaDB with just one click, which is a nice feature that the kernel brings.
It is possible to run SQL against the MariaDB server with just one click. It was extremely simple to integrate the MariaDB kernel with mybinder.org platform, either by clicking this link in the documentation page on Knowledgebase, or you can click on the
Try It Out badge on the kernel GitHub repository. Let's click on this. As you can see, my binder opened and it tries to transform the kernel repository into a running notebook. The execution for me here is mostly cached, so it is a bit faster. The first time you will try to
take a bit longer till the notebook is opened. And here it is. This is an example notebook. I wrote to get you familiar with the main kernel features. You can execute, for instance, the version statement. And it is updated.
Or you can change it to, I don't know, any other query. That's it. So, it's possible to run MariaDB with just one click. Next time you are debating your friends over a weird SQL statement, you can quickly try it from
your phone, for instance. That's all I have in terms of demos for this session. Let's next dive a bit deeper into how the kernel works internally. This is how the kernel fits inside the Jupyter Notebook ecosystem. In the top part of the diagram, you have your browser with
the JupyterLab or notebook web application. When you execute a selling notebook, the web app sends a message to the server side of the notebook. And that contains as payload the notebook. And when the message hits the notebook server, it is pre-processed and immediately forwarded to the already spinning
MariaDB kernel. Behind the scenes, the kernel implements the Jupyter client protocol to deal with all sorts of messages like execute, shutdown, autocomplete code, and so on. Please note that the kernel is able to connect to any
remote on your own infrastructure or in the cloud. This is a question one of our users had when he wanted to try the project. Getting into a bit more details, here is the overall architecture of the MariaDB kernel. The orange color represents the kernel core part.
The blue side represents the communication side with the MariaDB database. And the green side represents the parsing and the magic operation. And during the startup phase, the kernel creates a client config object that parses
the JSON configuration file. Then it passes this object to a newly created MariaDB object. The MariaDB client object then tries to connect to the server given the credentials from the config. And if it can't connect, it lets the kernel know. And then the kernel spins up
the MariaDB server class that you can see here. And this is the startup phase. When the user executes a selling notebook, the kernel receives the message containing the code of the cell. When this message is received,
the kernel creates a code parser object and passes this code to it. The code parser object then does three things. One, it parses the code and finds the SQL code and the magic commands in there. The second, for each magic command that it found,
it uses the MagicFactory class to create the exact magic object that is responsible for the logic of that magic command. And third, it returns back to the kernel the SQL code that it found in the cell and the list with all the newly constructed magic objects. When the kernel receives this information, it walks over each
magic object and calls its dot execute method. And after that, all the magics are dealt with. The kernel then gets the SQL code from the parser. It sends it via the MariaDB client class to the MariaDB server and waits for a result back.
When the SQL result is ready, the kernel sends it back to Jupyter. As it is shown in the diagram, we designed this kernel so that the magic logic is as independent as possible so that people can easily come and contribute their own magic to the project without
a tremendous amount of effort or knowledge of the entire kernel. And as you've seen in the demonstration, the project is packaged for API and it takes very little time to set it up. You can find more installation tips in the kernel documentational knowledge base.
So that's all I have to say. I hope you enjoyed this small intro into the MariaDB Jupyter kernel. And thanks a lot for taking the time to watch.