Re-Discovering Python's Regular Expressions

Video thumbnail (Frame 0) Video thumbnail (Frame 442) Video thumbnail (Frame 4243) Video thumbnail (Frame 4633) Video thumbnail (Frame 5505) Video thumbnail (Frame 9462) Video thumbnail (Frame 11152)
Video in TIB AV-Portal: Re-Discovering Python's Regular Expressions

Formal Metadata

Re-Discovering Python's Regular Expressions
Title of Series
Part Number
Number of Parts
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Ilia Kurenkov - re-Discovering Python's Regular Expressions As Armin Ronacher pointed out in a recent blog post, there is more to Python's regular expression module than meets the eye. His post made me wonder what other “hidden gems” are stashed away in Python’s `re`. In the talk I share what I’ve learned about the inner workings of this extremely popular and heavily used module. ----- Anyone who has used Python to search text for substring patterns has at least heard of the regular expression module. Many of us use it extensively for parsers and lexers, extracting information . And yet we know surprisingly little about its inner workings, as Armin Ronacher demonstrated in his recent blog post, “Python's Hidden Regular Expression Gems”. Inspired by this, I want to dive deeper into Python’s `re` module and share what I find with folks at EuroPython. My goal is that at the end of the day most of us walk away from this talk with a better understanding of this extremely useful module. Here are a few examples of the kinds of things I would like to cover: - A clear presentation of `re`’s overall structure. - What actually happens behind the scenes when you “compile” a regular expression with `re.compile`? - What are the speed implications of using a callable as the replacement argument to `re.sub`? - re.MatchObject interface: `group` vs. `groups` vs `groupdict` To keep the talk entertaining as well as educational I plan to pepper it with whatever interesting and/or funny trivia I find about the module’s history and structure.
Computer animation Regular expression
Asynchronous Transfer Mode Functional (mathematics) Implementation Module (mathematics) Digital electronics Direction (geometry) Maxima and minima Parameter (computer programming) Unicode Total S.A. Arm Computer programming Web 2.0 Object (grammar) Flag Damping Regular expression Endliche Modelltheorie Compilation album Chi-squared distribution Module (mathematics) Repetition Software developer Computer file Computer program Bit Word Computer animation Lie group System programming Interface (computing) Module (mathematics) output Summierbarkeit Natural language Quicksort Regular expression Library (computing)
Implementation Module (mathematics) Freeware Computer file Connectivity (graph theory) Multiplication sign Orientation (vector space) 1 (number) Coma Berenices Parameter (computer programming) Mereology Unicode Semantics (computer science) Thermische Zustandsgleichung Population density Semiconductor memory String (computer science) Regular expression Position operator God Module (mathematics) Scripting language Repetition Computer file Bit Variable (mathematics) Disk read-and-write head Mathematics CAN bus Spring (hydrology) Computer animation Pattern language Quicksort Object (grammar) Regular expression
Implementation Functional (mathematics) Chemical equation Multiplication sign Limit (category theory) Computer programming Cache (computing) Computer animation Bit rate Software framework Endliche Modelltheorie Regular expression Compilation album
Group action Code Multiplication sign 1 (number) Sheaf (mathematics) Water vapor Mereology Formal language Computer configuration Different (Kate Ryan album) Network socket Personal digital assistant Flag Endliche Modelltheorie Position operator Thumbnail God Area Email Software developer Electronic mailing list Bit Sequence Type theory Arithmetic mean Process (computing) Phase transition Right angle Pattern language Cycle (graph theory) Quicksort Figurate number Regular expression Arithmetic progression Spacetime Slide rule Functional (mathematics) Service (economics) Divisor Computer file Bit Subgroup Rule of inference Number Wave packet Power (physics) Natural number Simplex algorithm String (computer science) Operator (mathematics) Energy level Code refactoring Integer Data structure Mathematical optimization Metropolitan area network Dependent and independent variables Standard deviation Matching (graph theory) Key (cryptography) Total S.A. System call Subgroup Word Computer animation Personal digital assistant Thermal radiation Video game Speech synthesis Object (grammar) Table (information) Library (computing) Flag
OK the next is 1 of my favorite and I think a lot of other people of the than the regular expressions but we've got Ilya right here but that's a rather than plus and
you have high thank you so mean doing and I am indeed you to talk about regular expressions in Python this quickly say a couple words about myself I'm just finishing my 1st year in a master's program at University of Potsdam I think a lot with the natural language toolkit libraries in Python and made very very small but exists see Python and mapped I also work at the gym which is a German start-up that's easy digital technology to really change how we interact with thickness but my talk is not really related to things that you i in November there was a close by I mean you guys I'm sure all know as the author of fast and change it to you in a bunch of other useful Web libraries see you wrote and about how he used an undocumented feature of their regular special model to you improve his lecture before and we that I thought had what other he gems are there in regular expression model that we don't know about and so went through a little bit and compiled a bunch of things that I thought were interesting and that's what I'm going to present the talk will consist of the following things I'll just give a very brief short history of some the models development will talk a lot about compilation then the sum over all of regular the argument flags and finally I will talk about the natural thing to do is think of what to do with the accident direct history by so the current implementation of the regular expression module in Python is after 3rd attempt at tackling this problem at 1st came with a model called rejects felt was sort of more similar to walk in the sense that it was a deterministic engine and it was very basic functionality then people heard about about Perl and they said we want the same in from and the regs modules phased out and replaced by the module that kind of GRE as the back and the origin that to use a little bit unclear to me I think it's probably could have and finally purity was optimized and this can be written from scratch as SLU it's called necessary because it was written by Frederick
loans and from secret lives amusing that's where this comes since then for about 15 years it's really just inputs and the only major feature that was added to the accord support and other that was just basic politics so it's it's kind of old so of as
far as coordinates and you can see that it consists of a of a C. module and of Python component and sometimes if you put the 2 next to each other as kind of which ones with just from the way so another another feature that that sort of got carried over the and i'll mention later when we talk about about so enough about history let's now little sort of written on a real problem and we let's tackle something that's been bothering humans for
a very long time the search for God and I think the most appropriate place to start searching for that is actually the alive so let's take the King James the since freely available and it's just formidable text file and really easy just to load into memory of string and then let do its thing so we just import and former that's wonderful we get some results interesting but we kind 1 expenditures you want from start looking at me maybe other text what what other gods can find in some of text so I let's say we try the New American Bible or let's try the Wall-Street-Journal just for the heck of and we can think about is that until we're blue in the face but you probably you guys are all itching because I'm to rewriting God all the time and it would be nice if I didn't have to if I want to change this regular expression you have to go and change in 50 places so let's reuse pattern the naive way to do this is to just say that 2 of the variable and then plot variable in everywhere we we had before um another way to do it is to compile it into something mysterious called them pattern object and then use the methods on this pattern object to search and the question is why why would you want compile what we want to do this but instead of just using script and the of several arguments you with density 2 of encourages the official documentation as we can modify the search scope scope a little bit so we can have apparently not search is different from orientation is something you can give them In this file and start position and positions you can serve as part of a spring is that of holes that's cool does need some other people say it improves readability this kind of question of semantic that's the little girl I'm instead of going to 0 in on an argument that it's you know it's like Oracle and that's uh the kind of fast but I'm not entirely sure so is a reader compiled fast the claim is this this is about 1st let's investigate that is using the implementations of all this and all these methods so let's look at our that what's already got search 1st getting are that search as we can see uses something called underscore com file and then it goes search so on 1 of what does carried
compiled it uses the same function so they just based on this evidence we could think that probably it's better if we compiled 1st and then use search because we would be saving ourselves that but it's not that simple if we look at the implementation of underscore compiled which notice a couple things 1st of all it uses the cash secondly before it even starts doing anything else it checks that cash so we what we thought was 2 compilations essentially no really boils down to the 1st compilation readout search and then the 2nd time we do a reader of search it is just a dictionary so we're not actually something that we have balance Of course this is dependent on 1 the cash gets cleared it said the 500 and normally I just based on playing around with that I would expect you to run into that limit if you you if you're loading some some Our framework or model that uses rate expressions heavily you might and then you sort of get 0 but
realistically after from almost programs does not have to be a
serious benefit to use compiled into the speed so slightly faster if it catches cleared and if you really really care about optimizing that much I would recommend you really think about the regular expressions themselves because Python and Perl for that matter most advanced inspection the model libraries use a nondeterministic refreshing engine in the back and and that is entirely driven by your regular expressions so if you have find a way to to optimize that you've been lots of lots of lots of speech I'm not going to talk about that specifically in this topic is that people write books about that it's kind of the topic but instead I'm going to sort of close with but because this this topic by same sure use reader compiled don't expect it to be super-fast way all right let's get back to where you were reading and you came across this all this slide you realize all all my my questions and capture this would like you so the documentation URI little bit and you find that there is a solution you can use something called by the dialog case give to our compiling and search your researchers will be case insensitive but what is already done if a printer it's just an and of but we can stack them so we can combine several flights slides together using this type using DB twice for we can do so again ad infinitum so what happens in the actually the device basically takes advantage of the fact that all integers that and binary obviously and if you choose your integer as well namely if you choose them all to be of powers of 2 they will be basically 1 happened coatings where they're the ones the only 1 in the sequence uniquely that have lot have unique position so combining them chaining them with type of service with the bitwise or will just let you know which ones are set and conversely if you use some of the ways and you can then figure out which options are present and which are not now this sort of was not on my radar for 1 simple reason I realized I don't use these patterns in almost at all and then I thought well maybe crazy maybe I'm just and I'm I'm a linguist by training needed we're Tyson code but I also use other people's libraries and and they don't use this pattern that so maybe it's just where it is uncommon for Python these days so I decided to verify what the better way to verify them to check the standard library so I read through the documentation from the standard library couple sleepless nights and I I found only 2 models had of the 240 years that you these are the areas and there always for opening and accessing files in some these 2 things are interesting about this phase while most like a standard library confirms the intuition that they're not very common in B the ones that you might eventually bearing flags are very low-level stuff so to me it seems like the army model somehow miraculously retained something basically from an old area that was to sort of refactoring of moral as that the rest of the standard library but stayed in the development models that had to do with level operations and cool when I was that there was a fun right I'm not in the in new sort of natural progression in the life cycle of a regret regular expressions and normally would be to talk about search and matching but unfortunately minus the skills are just not up to par to have a coherent picture despite their improved quite a bit when I started since I started working on this but I nothing that I think that can present public so we're going to go straight to the match objects so this section this this this is part of the talk is a little bit of unusual different from from the previous ones I'm actually not gonna try to see anything you whatsoever just rehashing things that everyone already knows the there is no real so I'd like under water resource some weird stuff going on like comes to match objects and the documentation is actually very clear about and yet I find at personally whenever I use they are you might have to look at look up every time all the difference between groups woops group data and then all the other stuff that you can do with with natural that's kind of throws me off a little bit and I don't think I'm the only 1 because I occasionally see code like this 1 when I when I read others code and we hope was that all come up with like a simple and succinct the rule of thumb that will encourage people to sort of avoid using that because it's not really you know you're not playing to strike so let's let's have an example of what we compile a regular expression and this 1 is and I don't want this didn't have to choose to be a little bit complicated it has 2 groups the first 1 and they're both made the 1st 1 is called means and they were searching for the string God then we would have a space and the 2nd group at this stage follows and there we can match any alphanumeric character that at least 1 or more William text and just 1 sentence and we do match the the tests and we've looked at and we see that what I we know what we do for all will continue to get more information basically and what really want folks to take away from this is that the match object response to 3 types of request 3 questions 1st of all how you can tell it to give you the whole match so this includes groups number stuff in between everything then you can ask for an individual submatch so you can ask you just got or words of the and finally you can get all the subgroups together ignore the strings you other parts of the expression the number you just get the groups alright so the total match you simply call match dot group and you get the entire string that match the you can also call national group with the 0 0 is implicit and so the more the more clear ways to call and that's the it that's only that's that's all there is to the total there and if you wanna get individual subgroups you can start calling that group with integers starting with 1 because 0 taken more you can give it the names of the major group you can give the names of the groups and that also on the table and the the 2nd match the created finally if you want all the subgroups used call that groups and that returns a tuple and if you have main groups you can also called take that returns obviously in dictionary so when when people do column . group take and try the access individual keys and they're really they're what they're trying to go for it is not group with keen of only you only really need these and other groups in the group that if the plant to somehow been passed on these hold their structure and to where the process and that's the that's more less the things I want you basically from from the start of the number 1 the argument and old but the use of flags in are used and in your really get that from anywhere else in Python these days at least of user recompiled but don't hope that it will magically spilled up your code by lots of factors and finally I think In this process that you have you walk away with a slightly clear notion of what the match object doesn't have to be thanks thank you have we have equal to that you know about that simplex of Python Rex yet 7 foot slide a regular expression for humans altered knowing that it will have heard of there's 1 that there was an attempt to rewrite the radiates modeling and extend its functionality quite a bit and a few years back there were having lots of lots of words on the mailing list of adding back but they decided in the end basically not to do it but you can get on pipeline yeah I think I think useless area in new ways in really OK call the figure 2 questions for accessible to compile uh you'll compared to compile and so should by combined called this of it's if the if the old 1 of those this faster me no I have and after that I sort of 1 by what which steps would be it would be necessary to go into this from the correct couldn't see if there were any sort of optimizations that were not apparent that would then become show oftentimes of but from within different levels In the 2nd question is too small to be 2 are not used the 2nd the 1 who which was library visitors socket and West OK thank you had also question about the flags but there was still a man is antibiotics and they didn't think there was anything strange as much as was sensitive anyway if you want to do something instead of the usual thing would be a flags race you have explicit stuff saying this true this war warrant what I've seen also in some ways people use strength a lot of this slide and then you could of people adjusting in each of the and cold weather this is best string so any other questions no whole