Add to Watchlist

How CPython parser works, and how to make it work better

13 views

Citation of segment
Embed Code
Granting licences Cite video

Automated Media Analysis

Beta
Recognized Entities
Speech transcript
few people these but it is about some that they did with my previous employer
when any of them they asked me to remove from the slides and mention their name so I'm here with no business affiliation and unlike problem most books in this conference this is not about writing stuff with Python this is about
life and himself and the specific about the C Python which is the reference implementation and even more specific about the parser which is part of this implementation so what is the by comparison when they put by the the will this is the
picture that came up but actually passes
the 1st is the stage of the interpreter which converts the source quoted in tool are straight which the compiler that compiles and might call time they might code is what the interpreter actually rounds now what this western parts are like the same as
any mature programming language it has formal grammar which looks something like
this if this link open has formally define production 3 starts at the symbols for a 3 different kinds of inputs every line defines a new nonterminal symbol and the notation here is a bit more complicated than the BNF notation that promise us usually written in something more like regular expressions so we have something like plus modifiers and somewhere star modifiers for repetitions
as well as parentheses for grouping and that is the other ingredient is the header file with the definition of all tokens but has 57 types of tokens numbered 0 to 56 and the numbers above 256 are reserved for non-terminals which are the ones that are defined in the grammar then we have parser-generator that based on the formal grammar and a token definition producers a file which looks like this it's just a definition of site stating dated doesn't have any code in it all it does is he defines the DFA wonderfully for every nonterminal that the grammar defines so every did if a is defined as an array of states every state is defined as an array of arcs and every hour has a label and this nation states the also
I thought mystring is duplicative but it's not really the sorry did the process 4 of the 5 primes but this again did you see the proof
of the fact that I can show OK then I'll have to scroll back this is what the grammar looks like with their productions and the right hand sides that looks like a regular expressions and that's what they had a file like with the that the token definitions I don't know how to make it the plaquette again let me try and
right now at this set assessing myself what I'm showing so those to go back each so this state is there that defined as an array of parts of the going out of the state and each arc is defined as in now each dephasing is defined as Arab states that each state is defined as a way of arms going out with the 1st item is the arc label which is more or less the same as the symbol number and the 2nd item is the destination state and in fact we have a special table which converts a symbol ladies into all arc labels because symbol number 1 it corresponds to all keywords and all identifiers so we also have look much which specific keyword it corresponds to so that each each 1 of those you get a different arc label number even though they're all sharing the same symbol 80 and the near the very end of this file we
have the fission of the themselves and for each the way we also have the
starts which shows which tokens the so this specific nonterminal can start with and this is so sophisticated
enough so I prepare a short demonstration
of how this works for very simple Python program which looks like this we pass it as a single input so that's the opera production that record person it according to an upper parse tree starts with they just mobile single-input and the DFA which pass a single input is that if a number 0 which looks like this and in the initial states which estate number 0 we have 3 arcs going out of it so the
1st arc with label number 2 response to the new line next characters simple statement you lose 3rd are 2 component statement and we're looking at that if tokens in the inputs and
according to the start since we know that simple statement cannot begin with if if it's not a new line of compounds that we can begin with if so we transition along the last hour are the combined statement no to the parse tree and before we actually transitional state number 2 we switch
to the new DFA which versus compound statements forward composite but here we have many options so in the initial state we have 9 arcs going out of it only the 1st hour which is for the statement can start with them if spoken so we add if statements notably pass through transition along this and so on now we switch to the BFA which passes if statements its sophisticated enough that it didn't fit on the slide we because you say that it's definition is
all sophisticated heavenly by parenthesis under repetitions and optional parts it In the initial step we have only 1 hour which with a label 97 corresponds to the if spoken because it's a terminal symbol we just added to the
parse tree consume it from the input the switch to the next state understand the same DFA these next state once again has only 1 are going out of it but this 1 it's for a test nonterminal the starts at is that it can be begin with the number of such as for the 2 so we are the test mode
switched of the give which should have 1st who passes test and so on it calls for an awful long time before we actually get back into the same DFA and now we're in state number 2 past 1 Arik which consumes the Cologne Open from the input transitions to the next state once again we have only 1 arc
responding to sweet they starts set confirms that sweet can begin from print token so we added sweet node of the parse tree switched all DFA for sweet once again it takes an awful long time before we get here
and the interesting bit is now we have 3 arcs going out of the state number 4 arc with label 98 corresponding
to the terminal LP if 99 to the terminal else and the next open in the input is the new line so it's neither of these and the arc labeled 0 means that is a final state so if we did much anything which has are just done
with this DFA and go back to the previous 1 so we go back to composite on rear ended
up in stay that only has a 0 are which means no matter what the next token in the input is we done and going back to the previous 1 go back to the real difference was started from there is only 1 arc with label number 2 which is funny line so we consume the land from the input add a note to the past
3 and the transition state number 1 which is fine final so we are at end of input word in the final state of the would be a phase which means that they are since succeeded where so what has just
happened we had the program which was 27 bytes long tokenized until 7 tokens and this is the parse tree which we produced for it 64 nodes takes the with 109 as much memory as the source
code itself and it's not so much at 307 because of almost doesn't range we see that we have a huge the like serious sulfur a dozen nodes we a single child each the in all these big because so that Python has a rather primitive faster which has eluded did defined grammar like this world it looks is seriously instead of
repeat in all steps that have shown just yeah that's more important the bit because the
past together can take a hundred times as much memory as the source code if I have a
moderately big the book by that's a source file of say 30 megabytes that the a person what would not even be able to all all this into memory because they think it's effect would would would would be too big so that was my motivation for all this work I had 30 megabyte file file and I couldn't possibly run it through Python we showed
these this is CST stand for concrete syntax tree there is a standard model
called faster and you can show good the CST sir license so structure of nested lists and every this structure has 2 elements the 1st 1 is the symbol lady so you'd say 0 tool 56 for terminals and something above
256 for terminals so the sec 2nd item is the spoken string for thumbnails forces subtree In an honest at least 4 of nonterminals the departed the condition for passing module the actually says that people shouldn't use it because there is about a module called st stand for abstract syntax tree which is a 2nd kind of syntax tree that
Python uses and they used example shows what the st looks like for the same way example Python code this so that it's much more compact it's almost human readable and the point is that unlike C C which reflects how the program is represented in the source code they st reflects what the pro problem actually intends to do so it's actually the st that the compiler runs on to generate the bytecode
the so the since 2nd
kind of syntax tree actually also has a formal grammar which is what this formal grammar looks like a different format now we have each each kind of St no define and what its children should be how many and what are their tapes and the funny bit is that there is no generator which which takes this file as input this
file is just formal documentation but the parser is handwritten tokens the yeah
yeah so this hand written parser is actually through 52 100 lines long
it's a bit of a very old and thus to
quote which states from 4 thousand 5 to 2007 nobody touches that nobody remembers how it works and the this the general attitude is the if it ain't broke don't fix it the that if it is the it has been working for 10
years time has proven that it does work OK yeah and what was once again the AST 4 hours a sample code is the very simple structure of not 9 nodes which I'm showing here but
but so that this is what these 2nd part actually looks like has
that the example is function that the passes X for a type of modes and what it does it has a loop organized with the label and I go to from the middle of the function something
that Dexter wouldn't approve and what it actually does is that for each note which only has 1 child it knows that there is nothing interesting for this node so nothing to do we just give you the next so we see that 2 thirds of the nodes in the CST have 1 child we don't do anything with them we just skip the the next node what I thought was the why do we have all this useless modes occupying the memory if we can eliminate them at the time when we generate the CST than at the same time whereas savant two-thirds of the memory and also the 2nd person becomes simpler because it doesn't have to
be concerned with this given all this useless modes this is what an example of them compressed CST looks like it's actually small enough so it fits on 1 slide without scrolling and at the same time we have this modified function which no longer has the awkward loop actually doesn't have
any looks at all what it does is that in each branch of the switch it can answer that the note has more than 1 chip child mean we can do something meaningful with this not and but just keep it now the the this
shows the serialized CST is
made before and after my fixed and you can see that the CST has becomes a so small that the ball was becomes human readable the catch here is that the only
purpose of CST still be biased by the 2nd pass until the st and then it's discarded for the memory would never use it again so
if we have a woman in scripts such as a web service for example that Bolton in that split-second run for hours and days and today doesn't matter will how much memory the syste takes because it's only there for the 1st split-second on the other hand if we have a system utility which Francis for a split-second and finishes immediately and the performance of this utility would be dominated by the performance of the past and that was my case I my script was just look up the value in
the 30 megabytes the chill dictionary span indeed there where where they
might fixed of the convo Python parser memory consumption of of my script were reduced by a factor of 3 so I was curious how Hawaii much would affect other standard benchmarks and this of the results 1 of them has shown the improvement by
3rd a few of them have shown around a dozen per cent improvement and nothing has but 1st but actually they use benchmark package which I was using
was the prediction October last year on the grounds that the results it was shown or inconsistent and meaningless for some purposes like I'm I'm not that performance analysis of some sort I don't know what a deep and wanted to on but it had this been superseded where they by performance but the package and these are the results
wizard is something got a little bit better something but a little bit 1st but the big picture is that often changed my intuition is that type of problems intentionally runs at a few warm-up loops before as a certain the timer and this way
indention layout avoids at time and the loading and personal of the source code I didn't show can inside the book by performance it's beacon sophisticated enough but nevertheless even this result and from that my by chance no unintended side-effects doesn't break anything and at the same time it enables a use case which may not be relevant to most of the Python programmers but at least the if I bump into it maybe other people's other a people would bump into it as well yeah so
my desire was to show this but for the community to get it included into Python how somebody that's this is it opens a bug in the buck occur because that's where all the discussions happen where all the reviews happens and they also have a is
there and you're allowed kills this sound being once a month if your bike is not getting enough
attention so I opened tool issues in the vector to occur and what 1 of them at the 2nd 1 of them actually got accepted and committed the the book me a 3 three-month waiting for a review and what it does was and the 2nd hand written parser knows that the input that it passes all this comes from the 1st class so
it's guaranteed to conform with the 1st formal grammar so it doesn't have to validate that it conforms to it it knows for sure that for example for the statement the 1st child is always give token then the 2nd child is always
test but 1st a module can take an arbitrary serialize syste as input and run it through these of 2nd person and then if it doesn't conform to the formal grammar of the python binary which expression not at Memorex so they had to implement the validation for this CST themselves how they implemented this was like this
I'm showing just litigation for 3 different types of nodes but it it was all more or less the same and they are chicken that the number of the children the motives and the types of the children a correct and then recursively this into each
child node but and the callable think function for the child node when I was looking at this code which had to be updated with that every change to the grammar too much the update grammar I thought maybe we can what the generated from the grammar maybe we can write some DFA face which would traverse the CIA to see if it's valid
and than what occurred to me is that we already have this be a face this are that they face which pass around you can just run them on the CST instead of the the token in a stream of the input so this
way I could eliminate 20 400 lines of hand-written code the just 7 2 lines of code for running the difference which we already have in by comparison and which are guaranteed to stay in sync with the positive because they are what the parser runs but after a 3 months
individual my other Pige was
rejected by with a 1 recent himself Sanders because because it's so big unsophisticated and nobody has a link you to review it a 3 months I shouldn't hold my breath for anybody to jump up and review it
and if nobody cares as much as I do about random 30 megabytes by conscripts through the interpreter than stuff like for me whether this is the motivation well I'm
here maybe you can and for this so it will be built 2 6 4 1 5 uptake of that they can we control reviewed and from that it's OK that had committed Oklahoma's explained what might prejudice when that
sample if the only purpose of the CST is to get but but biased by the 2nd pass and discarded whether we have
to store it in the memory at all maybe you can generate 1 sees you know that the time handle it
in the MST parser and discarded immediately that way instead of improving its memory consumption by a factor of 3 we're improving it infinitely tool takes 0 memory
what the current impostor actually does
is it has school entry points 1 string knowledge a wonderful file objects it but passes these input tool 1st of function which runs a look inside the it's extracting 1 the token at the time from the input Boston and this extracted token tool bypasses had open function and this at token kind of consumer function when maintains its state in
the state structure which most importantly has this step all the EDF face the keeping track of the current state for each DFA and is the pointer to
the CST rule not so it's a Benson you is the most to build this tree after a personal use them up at end of input it returns to the C route to the caller and the caller processes
the 2nd function called by st from node objects which is essentially the 2nd pass it recursively that traverses this 3 and 4 for each subtree of the CST
it generates the corresponding subtree of the St the important bit here is that the whole of CST he's available by that time 2nd parser around so it can analyze modes out before the it can be analyzed so as the 1 node repeatedly and
so on my suggestion is so replaced
the by Steve from node object with a different function something like bias due from tokeniser which would more or less the same but for the for getting each CST not you would call this next node functions function which would make the passes states all the parameters for the tokenizer which we will just pass on pulled the tokenizer and to generate
from note in the output which means only node type and mode string but no children and as will return them a number of Europe indicate in the position of the new node relatively tool the previous node so whether it's the 1st child of a parent node and next season another child mold or whether a the previous what was the last child and we're going back to the buyer and the difficult question curious what what are we going to do with the bypass a module because we no longer have the ECC is to use memory there is nothing that it would be serialized until the structure of nested lists and I'm not sure anybody is
actually using the person model even the especially so that it's the command up as deprecated well all analyzer that I have seen our using the ST because it's
much easier to use but we have them some effort we can make the parser itself call next Motorika recursively to materialize the CST FIL sterilize it into all these nested lists and
then to discard the whole thing just for compatibility with the existing consumers so there are some and there are some difficulties here the
minor difficulties that the power still can no longer have a loop run inside
we have to change it to a generator function and accept the look of it until the process state structure but the main difficulty is that we have to rewrite all the 50 to 100 lines of the 100 comparison this is the example of the right and 1 function in this way this is for parser for slice
nodes the slices when you have like square brackets and 2 to 1 or 2 or 3 numbers so separated by colons it inside them and see that they all parser knows all kinds of interesting things about the American but parser can only check each note once and he has to go on these nodes in order which actually simplifies the code
good the question is it would obviously take some time to rewrite the 50 to 100 lines this way maybe several weekends many weekends islands the we want to do this and if for example i takes a several we can still that do this will there be somebody who wants to review that patch including the right and offer 50 to 100 lines of code that's actually all I have and so
on I put my questions into what
will this is the image that came up so I search be fair and be and thank you very much so and questions
but when you have the problem tokenizing year 30 bytes worse file what was the limit that you hit to use simply run out of memory was there some internal structure a static size it the form yes at that I was doing is on my laptop which only has 2 gigabytes of memory so if you bought more memory ahead more sparse and I would have been a problem and in the future we deserve even larger than the are it's not going to be a problem for hardly anybody Bloom if you want to run Python on your phone and it always we're problem a lot in this in the um rather than rewrite all the conversion from the concrete parse tree to the abstract syntax tree is
it not possible to some migrant of grammar definition that to generate occurred because it does look awfully repetitive courage you I
agree that that does look repetitive and I I think that the I can actually show this the like most grammars
for yet for example include the because of beautiful call to with the each production that the parser generator it
inserts into the parser maybe something similar can be done to this book as the OWL file to help the generate and the parser but this would change the syntax of this file which is something formal and somebody outside of the bars so of book by country may rely on this exact structure so you yes what while k the same the in the structure of the grammar it it's not possible if if we a give ourselves they availability to rewrite the grammar than it would be possible but we may be breakends somebody else's code we
and I have once and this is also and if the entire problem is having maintain our I think you could
become 1 this year clearly qualified this thank you broader it's worth the supply from tree move to get for 5 months ago so all the Hg links are now 5 months of data you Sorokin European this web page is a sheet of hypernetwork Python is now has and get hot and has been for 5 months so we're now looking at how models yeah I'm Icelandic is actually all of than 5 months because this work has been going on since the year ago but I'm sorry what is an outdated links this to look and the new questions in the something to them and to thank
Intel
Software
Parsing
Pairwise comparison
Implementation
Video game
Roundness (object)
Multiplication sign
Source code
Interpreter (computing)
Parsing
Mereology
Parsing
Compiler
Programming language
Product (category theory)
Token ring
Parsing
Bit
Line (geometry)
Symbol table
Positional notation
Linker (computing)
Formal grammar
output
Regular expression
Subtraction
Formal grammar
Capability Maturity Model
Default (computer science)
Email
Process (computing)
Computer file
State of matter
Token ring
Code
Token ring
1 (number)
Parsing
Core dump
Number
Prime ideal
Proof theory
Formal grammar
Website
Data type
Formal grammar
Arc (geometry)
Product (category theory)
Identifiability
Computer file
State of matter
Angle
Set (mathematics)
Mereology
Symbol table
Number
Table (information)
Formal grammar
Right angle
Regular expression
Subtraction
Arc (geometry)
Chi-squared distribution
Computer programming
Product (category theory)
State of matter
Token ring
Syntaxbaum
output
Parsing
Formal grammar
Arc (geometry)
Number
Group action
Computer virus
Fluid statics
Token ring
Connectivity (graph theory)
Statement (computer science)
Syntaxbaum
Dependent and independent variables
Letterpress printing
Line (geometry)
Number
Arc (geometry)
Slide rule
Group action
Suite (music)
State of matter
Set (mathematics)
Correspondence (mathematics)
Fitness function
Letterpress printing
Mereology
Symbol table
Radical (chemistry)
Social class
Fluid statics
Computer configuration
Statement (computer science)
Statistics
Convex hull
Arc (geometry)
Arc (geometry)
Group action
Suite (music)
State of matter
Multiplication sign
Syntaxbaum
Letterpress printing
Open set
Number
Fluid statics
output
Statistics
Software testing
Asynchronous Transfer Mode
Arc (geometry)
Fluid statics
Suite (music)
Token ring
State of matter
Multiplication sign
Syntaxbaum
Vertex (graph theory)
Bit
Letterpress printing
Arc (geometry)
Number
Radical (chemistry)
Suite (music)
State of matter
output
Letterpress printing
Line (geometry)
Intel
Social class
Word
State of matter
Phase transition
output
Letterpress printing
Line (geometry)
output
Subtraction
Arc (geometry)
Number
Computer programming
Read-only memory
Token ring
Code
Source code
Range (statistics)
Token ring
Syntaxbaum
Letterpress printing
Number
String (computer science)
Function (mathematics)
Vertex (graph theory)
Formal grammar
Vertex (graph theory)
output
Read-only memory
Computer file
Multiplication sign
Source code
Token ring
Code
Sound effect
Bit
Letterpress printing
Number
Error message
Read-only memory
Function (mathematics)
String (computer science)
Vertex (graph theory)
output
Computer virus
Suite (music)
Standard Model
Element (mathematics)
Token ring
Syntaxbaum
Electronic mailing list
Code
Parsing
Letterpress printing
Symbol table
Syntaxbaum
Radical (chemistry)
Number
Error message
Read-only memory
Function (mathematics)
String (computer science)
Vertex (graph theory)
Information
Data structure
output
Data type
Point (geometry)
Bytecode
Computer programming
Suite (music)
Code
Forcing (mathematics)
Source code
Parsing
Core dump
Letterpress printing
Syntaxbaum
Abstract syntax tree
Compiler
Radical (chemistry)
String (computer science)
Module (mathematics)
Information
Thumbnail
Condition number
Data type
Computer virus
Computer file
File format
Tape drive
Code
Abstract syntax tree
Parsing
Bit
Exponential function
Syntaxbaum
Abstract syntax tree
Pointer (computer programming)
Formal grammar
output
Formal grammar
Default (computer science)
Computer file
Token ring
Euler angles
Code
Parsing
Abstract syntax tree
Bit
Letterpress printing
Line (geometry)
Parsing
Syntaxbaum
Touch typing
Units of measurement
Formal grammar
System call
Mapping
Code
Multiplication sign
Sampling (statistics)
Code
Abstract syntax tree
Parsing
Usability
Mereology
Syntaxbaum
Vertex (graph theory)
Data structure
Loop (music)
Formal grammar
Default (computer science)
Read-only memory
Message passing
Decision tree learning
Loop (music)
Multiplication sign
Vertex (graph theory)
Usability
Data type
Functional (mathematics)
Loop (music)
Repeating decimal
Asynchronous Transfer Mode
Slide rule
NP-hard
Multiplication sign
Strut
Syntaxbaum
Usability
Branch (computer science)
Functional (mathematics)
Video game
Dressing (medical)
Loop (music)
Statistics
Asynchronous Transfer Mode
Dressing (medical)
Suite (music)
Parsing
Information
Letterpress printing
Data type
Standard deviation
Read-only memory
Scripting language
Abstract syntax tree
System software
Syntaxbaum
Message passing
Web service
Benchmark
Read-only memory
Personal digital assistant
Utility software
Data compression
Scripting language
Standard deviation
Read-only memory
Divisor
Scripting language
Abstract syntax tree
Data dictionary
Benchmark
Parsing
Syntaxbaum
Benchmark
Read-only memory
Data compression
Scripting language
Resultant
Standard deviation
Source code
Benchmark
Read-only memory
Data compression
Mathematical analysis
Parsing
Quicksort
Prediction
Resultant
Sound effect
Source code
Numerical digit
Multiplication sign
Source code
Control flow
Parsing
Bit
Sound effect
Programmer (hardware)
Loop (music)
Benchmark
Personal digital assistant
Data type
Backup
Software bug
Vector space
Patch (Unix)
Formal grammar
Statement (computer science)
output
Parsing
Abstract syntax tree
Backup
output
Parsing
Syntaxbaum
Social class
Serial port
Validity (statistics)
Patch (Unix)
Expression
Binary code
Syntaxbaum
Parsing
Abstract syntax tree
Syntaxbaum
Number
Formal grammar
Module (mathematics)
Vertex (graph theory)
output
Software testing
Backup
output
Subtraction
Data type
Mathematics
Validity (statistics)
Code
Vertex (graph theory)
Formal grammar
output
Streaming media
Functional (mathematics)
Syntaxbaum
Pairwise comparison
Polygon mesh
Code
Line (geometry)
Code
Parsing
Line (geometry)
Term (mathematics)
Parsing
Syntaxbaum
Maxima and minima
Synchronization
Message passing
Subtraction
Formal grammar
Curvature
Patch (Unix)
Linker (computing)
Electronic program guide
Interpreter (computing)
Parsing
Term (mathematics)
Normal (geometry)
Message passing
Units of measurement
Read-only memory
Message passing
Divisor
Multiplication sign
Sampling (statistics)
Parsing
Syntaxbaum
Point (geometry)
Computer file
Token ring
Mountain pass
Multiplication sign
Token ring
Parsing
Stack (abstract data type)
Functional (mathematics)
Syntaxbaum
Root
String (computer science)
String (computer science)
output
Object (grammar)
output
Proxy server
Data type
Trail
System call
State of matter
Mountain pass
Token ring
Parsing
Stack (abstract data type)
Rule of inference
Syntaxbaum
Root
Pointer (computer programming)
Network topology
String (computer science)
output
Deterministic finite automaton
Vertex (graph theory)
Data structure
output
Routing
Data type
System call
Mountain pass
Multiplication sign
Token ring
Syntaxbaum
Parsing
Bit
Stack (abstract data type)
Functional (mathematics)
Traverse (surveying)
Parsing
Syntaxbaum
Message passing
Root
String (computer science)
Vertex (graph theory)
Deterministic finite automaton
Vertex (graph theory)
Object (grammar)
output
Data type
System call
Token ring
Mountain pass
Parsing
Parameter (computer programming)
Stack (abstract data type)
Parameter (computer programming)
Functional (mathematics)
Syntaxbaum
Traverse (surveying)
Telephone number mapping
Message passing
Root
String (computer science)
Function (mathematics)
Vertex (graph theory)
Vertex (graph theory)
Object (grammar)
output
Subtraction
Read-only memory
Inheritance (object-oriented programming)
Scientific modelling
Civil engineering
Electronic mailing list
Parsing
Parameter (computer programming)
Function (mathematics)
Syntaxbaum
Number
Traverse (surveying)
Telephone number mapping
String (computer science)
Vertex (graph theory)
Module (mathematics)
Data structure
Proxy server
Position operator
Asynchronous Transfer Mode
Civil engineering
Syntaxbaum
Parsing
Parameter (computer programming)
Parsing
Syntaxbaum
Power (physics)
Traverse (surveying)
Telephone number mapping
Loop (music)
Function (mathematics)
Loop (music)
Data type
Pairwise comparison
Process (computing)
State of matter
Code
Poisson-Klammer
Parsing
Line (geometry)
Generating function
Functional (mathematics)
Parsing
Number
Program slicing
Order (biology)
Vertex (graph theory)
Right angle
Data structure
Units of measurement
Loop (music)
Data type
Flow separation
Code
Multiplication sign
Patch (Unix)
Right angle
Line (geometry)
Loop (music)
Medical imaging
Laptop
Read-only memory
Computer file
Syntaxbaum
Formal grammar
Data conversion
Data structure
Limit (category theory)
Form (programming)
Abstract syntax tree
Product (category theory)
Computer file
Code
Mathematics
Formal grammar
Insertion loss
Mereology
Data structure
System call
Parsing
Web page
Network topology
Linker (computing)
Scientific modelling

Metadata

Formal Metadata

Title How CPython parser works, and how to make it work better
Title of Series EuroPython 2017
Author Skrobov, Artyom
License CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
DOI 10.5446/33817
Publisher EuroPython
Release Date 2017
Language English

Content Metadata

Subject Area Information technology
Abstract How CPython parser works, and how to make it work better [EuroPython 2017 - Talk - 2017-07-12 - PythonAnywhere Room] [Rimini, Italy] The part of CPython core that parses the Python source code is some very old and convoluted code: the time has proven its reliability, but few CPython hackers understand (or care) how it works, or even what exactly it does. There is, however, a good reason to care: for short-running scripts, the performance of CPython may easily be dominated by that of parsing the source code. The talk will describe the two parsers that are involved, it will explain how these two parsers build two different kinds of syntax trees, and then show how the structure of one of the trees can be amended to reduce its memory footprint threefold, with only minor changes necessary in its consumers. It will also suggest other, more invasive improvements, which can yield even better savings. The talk will assume fluency in C and a basic acquaintance with CPython core internals, and will give the attendees an introduction into hacking the parser, guiding their way through to the very tangible end result of reducing Python overall memory consumption by up to 30%, measured at standard micro-benchmarks

Recommendations

Loading...
Feedback
AV-Portal 3.5.0 (cb7a58240982536f976b3fae0db2d7d34ae7e46b)

Timings

  771 ms - page object