We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Wrestling Python into LLVM Intermediate Representation

00:00

Formal Metadata

Title
Wrestling Python into LLVM Intermediate Representation
Title of Series
Part Number
146
Number of Parts
169
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Anna Herlihy - Wrestling Python into LLVM Intermediate Representation The LLVM Project provides an intermediate representation (LLVM-IR) that can be compiled on many platforms. LLVM-IR is used by analytical frameworks to achieve language and platform independence. What if we could add Python to the long list of languages that can be translated to LLVM-IR? This talk will go through the steps of wrestling Python into LLVM-IR with a simple, static one-pass compiler. ----- What is LLVM-IR? The LLVM Compiler Infrastructure Project provides a transportable intermediate representation (LLVM-IR) that can be compiled and linked into multiple types of assembly code. What is great about LLVM-IR is that you can take any language and distill it into a form that can be run on many different machines. Once the code gets into IR it doesn’t matter what platform it was originally written on, and it doesn’t matter that Python can be slow. It doesn’t matter if you have weird CPUs - if they’re supported by LLVM it will run. What is Tupleware? TupleWare is an analytical framework built at Brown University that allows users to compile functions into distributed programs that are automatically deployed. TupleWare is unique because it uses LLVM-IR to be language and platform independent. What is PyLLVM? This is the heart of the talk. PyLLVM is a simple, easy to extend, one-pass static compiler that takes in the subset of Python most likely to be used by Tupleware. PyLLVM is based on an existing project called py2llvm that was abandoned around 2011. This talk will go through some basic compiler design and talk about how some LLVM-IR features make our lives easier, and some much harder. It will cover types, scoping, memory management, and other implementation details. To conclude, it will compare PyLLVM to Numba, a Python-to-LLVM compiler from Continuum Analytics and touch on what the future has in store for PyLLVM.
CompilerSubsetTupleKnowledge representation and reasoningArmLinear regressionLattice (order)Loop (music)Linear mapDigital filterLevel (video gaming)Operator (mathematics)Structural loadDot productImplementationLibrary (computing)Wrapper (data mining)Exception handlingGeometryMachine codeAbstract syntax treeMathematical analysisCode generationKeyboard shortcutVariable (mathematics)TrailType theoryRegulärer Ausdruck <Textverarbeitung>InferenceFluid staticsSingle-precision floating-point formatMachine codeProgramming languageVector graphicsElectronic mailing listString (computer science)Function (mathematics)ProgrammschleifeConditional probabilityPairwise comparisonGreatest elementLine (geometry)Different (Kate Ryan album)OvalUsabilityInterior (topology)Range (statistics)3 (number)Run time (program lifecycle phase)LogarithmThomas BayesPresentation of a groupSystem administratorProgramming languageMachine codeBefehlsprozessorCompilerOpen sourceFlow separationAlgorithmDifferent (Kate Ryan album)SubsetInheritance (object-oriented programming)Square numberSystem callLibrary (computing)Software engineeringMathematicsMathematical analysisSoftware frameworkParameter (computer programming)Multiplication signType theoryDefault (computer science)Hacker (term)Student's t-testIntegerLetterpress printingImplementationPairwise comparisonFunctional programmingMathematical optimizationLine (geometry)RootGradientRewritingCartesian coordinate systemWrapper (data mining)ResultantEmailBitSampling (statistics)MereologyFront and back endsElectronic mailing listInternet service providerCuboidCompilerVirtual machineSource codeSeries (mathematics)Electronic signatureMessage passingCountingChannel capacityStack (abstract data type)Boolean algebraPointer (computer programming)Arithmetic progressionProduct (business)Memory managementVector spaceLevel (video gaming)Element (mathematics)SurfaceTrailExpressionPower (physics)Variable (mathematics)Logical constantGene clusterAssembly languageOrder (biology)Order of magnitudeSymbol tableCross-platformMetadataJust-in-Time-CompilerFluid staticsOperator (mathematics)Keyboard shortcutComputer-assisted translationBenchmarkBuildingElectric generatorUser-defined functionSpeicherbereinigungPoynting vectorError messageSpeicheradresseDecision theoryKnowledge representation and reasoningDirection (geometry)Block (periodic table)Intermediate languageParsingNetwork topologyTypinferenzExpert systemString (computer science)Compiler constructionFunction (mathematics)Proof theoryAbstract syntaxBackupSoftware repositorySlide ruleCombinational logicAbsolute valueLinear regressionDebuggerBranch (computer science)Numeral (linguistics)UsabilityForm (programming)WhiteboardContinuum hypothesisTupleNumberView (database)Electronic data interchangeBit rateSinc functionComputing platformRun time (program lifecycle phase)Prime numberWebsiteWordSelectivity (electronic)Standard deviationFeasibility studyArmArithmetic meanLimit (category theory)Exception handlingProgrammschleifeWritingCurvatureConcurrency (computer science)Point (geometry)Coefficient of determinationExtension (kinesiology)Abstract syntax treeGoogolContent (media)Video gameClient (computing)Special unitary groupFeedbackConnectivity (graph theory)DiagramComputer programmingCASE <Informatik>Computer animation
Transcript: English(auto-generated)
The next presenter is Ana Herlihy. She's an engineer at MongoDB. Thank you. Can you hear me? So my name is Ana Herlihy, and I'm gonna talk about PyLLVM,
which is a small compiler from a subset of Python to LLVM IR that was built as part of a larger project while I was a student. So I'm not gonna get up here and tell you that this is the best little compiler you've ever seen and that you should all be using it in production. But I will tell you that it is a super cool project
that was built in a true open source capacity. So basically, I found someone's code that was abandoned about five years ago and breathed new life into it. So if there's a moral to this talk beyond the technical content, it is that you have to be careful with your open source projects. You may think that they are dead, but someone like me will find them on the Google code archives,
and you will once again have to answer emails. So a little bit about me. I am a software engineer for MongoDB in Stockholm, Sweden. I work on a couple different open source Python and MongoDB libraries, but this is actually some work I did while I was not employed by MongoDB.
So first, a bit about what you're in for. I'm gonna talk about the motivation behind this project and the very specific requirements that we needed to meet. I'm gonna go through the features, do a little bit of talking about related work, mostly Numba, which is a specializing just-in-time compiler from Continuum Analytics,
and I'm gonna do some analysis and benchmarking before concluding. So how did I get involved in this project? Well, Tupleware is a distributed analytical framework built at Brown University that is for users who want to run analysis on large data
and have their algorithms be automatically deployed on a distributed cluster. So I'm not gonna go too deep into the details about how Tupleware actually works, because that is an entirely different project done by entirely different people, but what I worked on was integrating Python
with the Tupleware front end. So there are plenty of distributed and analytical frameworks out there, but what makes Tupleware unique is that it aims to be both language and platform independent. So what does that mean? So language and platform independence means
that we want you to be able to write your algorithms in any language you want. Say you have an engineer that is convinced that they want to write their algorithms in 4chan, we want to support that, and if they have weird CPUs on their clusters, we also want to support them. So how do we do that? We do it with LLVM. So LLVM is a lot of different things.
It is officially a series of compiler and toolchain technologies, but what we care about is LLVM IR, which is an intermediate representation, which basically means it is a way of representing code in between the source language and the machine dependent assembly code. So as you can see in my diagram,
we have all these different languages that are all really popular, and they all can get compiled down into LLVM IR, which is, the dragon is the LLVM logo. And then it once again gets compiled down into a bunch of different CPUs. So a huge benefit of Tuperware is that there are a ton of existing compilers
for a bunch of different popular languages. So the goal of this project was to basically prove that you could tightly integrate a higher ordered language such as Python with Tuperware, which is written almost entirely in C++. Usually I would go into why Python, but since we are all here, I'm gonna assume we are all properly indoctrinated.
So how does it fit in with the Tuperware project itself? So you have the blue boxes, which is what the user provides. You need to provide an algorithm that you wanna run on your data, and you need to provide a workflow. So like map, reduce, combine, et cetera. The yellow boxes are what I worked on.
So first we needed to just be able to call the backend operators from Python. So we used Boost for that, which is pretty straightforward since we were using Boost for a bunch of other stuff in Tuperware. And then we needed to compile our UDFs, which stands for User Defined Function, into LLVM IR.
This is, in my opinion, the more interesting part and also the more difficult part. So I wanted to give you an idea of what a sample Tuperware usage would look like. So you have a linear regression algorithm here. Don't pay too much attention, there's probably a typo in there. But the idea is that if you take a look at the actual code that's being compiled,
you can get a pretty good sense of what you would need, what subset of Python you would need to support. You basically don't need super fancy Pythonic functionality like decorators, list comprehensions, et cetera. What the usage of Tuperware is for is machine learning algorithms, which tend to be very easily
optimizable mathematical functions. So I chose to focus first on primarily static type and verbal code. And then in the red box, you have how you would use the Tuperware package. So pretty much the easiest workflow you could come up with first you load your data, map your algorithm, and then call execute.
So one level down, if we look at the Tuperware library itself, you have, first thing you do is you try to compile the UDF using PyLVM. If that fails, then we use a backup method which basically leverages the Python C API to produce an executable.
So the idea is we don't want the user to have to worry about whether or not the code they have in their UDFs is technically covered under PyLVM or not. We just want them to feel like it's all covered, but most of the time, we want to be able to use our primary compiler. If the backup method fails, chances are you have bad Python in your UDF, and so we raise an error.
So, PyLVM itself. When I first found this project, like any good engineer, I first Googled to see if hopefully somebody had done the work for me. But what I found was a project called Py2LVM, which was an unfinished proof of concept
that basically showed that you could take Python syntax and convert it into LLVM IR using Python. So, Py2LVM uses LLVM Py. It can get kind of confusing because there's really only so many ways you can combine Python and LLVM, but LLVM Py is a wrapper around the C++ IR builder tools.
So it's basically just a way to call the LLVM tools from Python. What's kind of cool about this project for me is that when I found it, which was about two and a half years ago, there was pretty much only this project. There wasn't really anything else out there, and when I was writing the slides not so long ago,
I did some more research, and it turns out that a bunch of other people have forked the same repo and have taken it in totally different directions. So as an engineer, that's very satisfying to see how other people have solved the problems that I also have tried to. One second, sorry.
So, like I said before, we really wanted to be able to anticipate the common requirements of a tupleware user. So we decided to focus on easily optimized mathematical functions. As I go through each feature of PyLVM,
I'm gonna talk about the design decisions we made in the interest of simplicity. So someone was gonna take over this project after I left and they needed to be able to read what I had written. Feasibility, which we had deadlines. Mostly I needed to graduate. And then usability, which I think I might have already said.
But basically we want our user to be able to write Python comfortably and not have to worry too much about the syntax. So this is an overview of what the compiler's design looks like. This is where the compiler experts in the audience might get a little bored because this is all very standard. So if you wanna compile code,
your first step is you need to parse it and build an abstract syntax tree, which is basically a tree representation of what the code is doing. So the parsing is handled by Python's compiler package, which is very nice for me. This is where your syntactic errors would be caught
in the parsing and AST building stage. So for example, say you were coding and your cat walked across your keyboard. That would be a syntactic error, unless you have an especially adept cat. So the next step is semantic analysis. So this is where we actually dig into what the compiler is doing,
or sorry, what the code is doing. So we use a visitor to visit all the nodes of the AST, and then we call LLVM pi for actual code generation. This is where semantic errors would be caught. So something really important about LLVM IR and other intermediate representations is that it's an SSA,
which stands for static single assignment. Basically, it means that if you have a register, you can assign to it only once before it is frozen. That means that if we wanted to implement our Python just based on registers alone, we would ask our users to write their Python in SSA, which is totally not reasonable.
So what do we do? We use a symbol table. So symbols are basically variables. They contain the memory location where the data is stored, the name, type, and some other metadata. We keep track of our symbols using a symbol table, which we implement as a stack
to make scoping much easier for us. This part was implemented before I started contributing. So LLVM IR provides numerical types as well as five derived types. I'm gonna talk about how I represented the various Python types as LLVM IR types.
So since LLVM IR is a statically typed representation and Python is not, it is necessary to be able to infer the types of expressions before we actually have to evaluate them. So to do this, we used the infer type function,
which was partially implemented before I started contributing. But the idea is that given a node on the AST, you want to recursively descend it until you find something that tells you what the type of the expression will be. So say you reach a leaf node and it's a constant. That's pretty straightforward. You know what the type is already. Say you have a variable.
Then you have to look it up in the symbol table. If you have a function call, you also have to look it up in the symbol table or check with a list of intrinsic functions that are built into the language. So numerical values, pretty straightforward. Integers are integers. I added booleans.
Not anything too exciting there. Next up is vectors. So we provide support for four element immutable floating point vector types, which was implemented before I started and was also a huge part of why I chose to build my project on this one. Because these vectors, which are based on the LLVM IR vector type and our pass by value,
are super common in machine learning algorithms. They just come up time and time again. So having this implemented for me was a huge benefit. Next up is lists. So lists are a bit of a work in progress because implementing efficient, dynamically resizable lists is an open problem.
So another issue with lists is, like anyone who has ever programmed C before knows, if you create a list of pointers and then you return that list from a function, the pointers immediately go out of scope. So how do we deal with this? Well, I chose to just move it onto the heap
and then be very careful when removing it from the stack. But garbage collection and reference counting is something that is way beyond the scope of this compiler, but could potentially be added one day. This is one of those places where we kind of went deeper and deeper into the problem before taking this stuff back and realizing this is not our battle that we want to die on.
Next up is strings. So strings, luckily enough, are basically lists of characters and characters are basically integers. So we got this one for free. Next up is functions. So unfortunately, Python functions don't usually
indicate what type they are returning, if anything. So we have to be able to determine what the type of a function definition will be before we actually generate the code. So this is one of the places where we do two passes. We do one pass to determine what the return type of the function is, and then one pass to actually
generate the code once we already know what the function signature will be. So function arguments have the same issue, except it's actually much harder to infer what function arguments are, given a function definition. So this is one of the places where I did what is, in my opinion, kind of a dirty hack.
But I basically took advantage of the default argument syntax to allow users to indicate what type they were passing in as their parameters. The chances that somebody's writing an ML algorithm and they don't know what type that their parameters are, or they need a dynamic type, is pretty low. If type hints had been a thing when I wrote this,
maybe I would have considered trying to use those, but it was not. So next up is intrinsic functions. A huge benefit for writing algorithms is to not have to import separate math libraries. So if you can call square root, absolute value, et cetera,
without having to do any importing, really nice. So the LLVM IR Builder actually provides functions for square root and other things like that, but unfortunately, because they are newer features, and LLVM Pi, which is the wrapper around the C++ tools, is an older project that is not currently supported,
they're not accessible. So we could either choose to rewrite the whole library, or we could just try to get around it. So what I did to get around this problem is basically define in the header the calls
to these C++ backend functions, add them to the symbol table, and then at a later time, it'll get linked so that we can call those functions that we don't directly have access to. So this works for the math functions, and it also works the same as print.
Print, getting print working was probably one of the better feelings I have ever had, since I had already spent so much time working on this compiler, and I could tell if my executables were segfaulting, but I had no idea if they were doing what I expected. So when my mystery executable started printing stuff to standard out, stuff that indicated it was working,
it felt really, really good. So branching and loops, LLVM IR has compare and jump instructions, as well as relatively straightforward ways to define blocks of code. There are some idiosyncrasies. If you're curious, I recommend you look up what fine nodes are, but I'm not gonna go into detail, because they are, they can be kind of tricky.
So related work. So the obvious comparison is with Numba, which is a just-in-time compiler from Continuum Analytics that moves code from Python to LLVM IR. The thing about Numba is that the primary purpose is not to generate LLVM IR.
It is to run your Python code really, really fast. So Numba actually moves the Python code into six different stages, and LLVM IR is only one of those stages. So you actually have to do a bit of gymnastics in order to pull it out at that stage. Another big difference about Numba is that it's a JIT,
so it's lazy, which means that it doesn't actually compile code unless you run the code, which makes benchmarking and comparing PyLVM with Numba quite tricky, because you can't run code with PyLVM. You can only compile it. So the bottom line is that Numba is a much,
much more mature and comprehensive project. Like I said, would not recommend you use this in production, but I would totally recommend you use Numba. Another huge benefit of writing our own compiler is that it was built in-house. So basically, I needed something to work on, and I've always wanted to write an LLVM compiler.
So for analysis, since what the project looked like before I started was entirely C++, and the user had to write their UDFs in C++, what I wanted to make sure was that by adding this Python front end, I didn't make it harder on the user. So I picked four sample algorithms
and took a look at what the resulting usage would look like for C++ versus Python. As you can see, it's not very different, which is actually kind of a good thing. So the huge benefit of using Python is not because you would expect your algorithms to be shorter, it's that you're freed from worrying about
memory management and pointers and things like that. So for actual benchmarking, we wanted to compare PyLVM and Numba, but unfortunately, because of the JIT thing, it's complicated, and instead of going into the data, since I would have to do a lot of explaining, I'm just going to tell you that Numba is slightly faster,
and PyLVM is slightly slower, but it's only by two or three times. It's not in order of magnitude, which is important because compared to the cost of analyzing the workflow and deploying to the distributed clusters and actually doing the analysis, the compilation from Python to LLVM IR happens once.
So it's very, very small relative to everything else. So as long as we're not in order of magnitude slower, then we're fine. The other thing I wanted to compare was the actual generated LLVM IR from Clang, which is what we were using before, and PyLVM.
So something else about the way that LLVM IR works is that you generate unoptimized LLVM IR before passing it into the backend, and these incredibly powerful optimization passes are going to happen before it's converted into machine code. So the idea is that no matter who generates the LLVM IR,
the surface level differences should pretty much be wiped out by those optimization passes. So we used those four sample algorithms. We compared the runtimes, and the results are here.
The blue is PyLVM, the red is Clang, and the y-axis is time. So as you can see, pretty much across the board, PyLVM is a bit slower than Clang, which is actually pretty good, considering Clang is a huge project that is very mature
and has a lot of resources. So the actual results re-range from quite slow to not that bad, but k-means has a particularly large spike. So I wanted to figure out why k-means was so much slower. Turns out that my ingenious header hack
actually probably cost us the performance because I had assumed that calling these header functions would get optimized away. But what it looks like is that the optimization passes weren't actually able to get rid of that one level of indirection. So every time you call square root, instead of just calling square root, like you would in Clang,
you call a function which then calls square root again. So that's a problem because we can't rewrite LLVM Py, and we can't contribute to it since the project is long dead. So not much to do there
except for maybe pull out LLVM entirely, which is a big undertaking. And since I don't actively work on this project anymore, if one of you wanna do it, that's awesome, and I totally encourage you to do it. So last to conclude. Overall, we were pretty successful in proving that you could tightly integrate Python
with this distributed analytical framework. There is still a ton of future work to be done. And like I mentioned, I'm not doing it right now. I'm currently working on other MongoDB stuff. But if anybody is interested in contributing, I really encourage it. Also, if anybody is curious about a relatively simple compiler,
I recommend you look up the source code because I find it pretty easy to follow. I have a lot of people to thank. Thank you to Tim, who is my advisor, Alex, Andrew, and Kaihan, who are the grad students who wrote Tupleware. Thank you to the people who wrote Py2LLVM. I hope you don't mind that I've been mucking around in your code.
And thank you to all my mentors at MongoDB for encouraging me to go out and give talks. So I have some time for questions, but I'm also going to make a brief and somewhat entirely unrelated announcement. I'm currently working on a library for reading data directly from BSON,
which is the MongoDB data format, directly into NumPy. So if you use NumPy and MongoDB, or one or none of them, or you want to tell me about something, I would really appreciate any feedback. So please come talk to me or tweet at me.
Amazing. Thank you. Hi, so thanks for the talk. I just wanted to mention that in Python 2.7,
you do have type hinting, although in docstrings. Cool, I totally missed that two and a half years ago when I started on this project, but maybe it's not too late to add it.
You said strings were lists of characters. Does that make them mutable in your LLVM implementation? Pretty much. Yeah, the only time that the real distinction comes out between strings and lists
is in the print function, so we know whether to print characters or integers. Hey, thanks for the talk. I was very curious, did you try benchmarking
compared to the scikit-learn implementations of the same models? No, I didn't. I actually looked into it, but was immediately overwhelmed by just how large the project is and how moving Python to unoptimized LLVM IR seemed very hard for me to pull out,
so I eventually just decided to focus on Numba instead. Did you consider using the LLVM lights library
the Numba guys have written for talking to or for generating IR code? Yes, that's on the list of things that we wanna replace LLVM Pi with, but when I picked up this project, it was already using LLVM Pi and not LLVM light, so I figured that it would take
a significant amount of time to swap out the existing IR builder findings, and so decided that my time would be better spent adding features. Okay, more questions?
Okay, thank you very much. Thank you.