We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Time flies like an arrow; Fruit flies like a banana: Parsers for Great Good

00:00

Formale Metadaten

Titel
Time flies like an arrow; Fruit flies like a banana: Parsers for Great Good
Serientitel
Anzahl der Teile
66
Autor
Mitwirkende
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache
Produzent
ProduktionsortSan Antonio

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
When you type print "Hello, world!", how does your computer know what to do? Humans are able to naturally parse spoken language by analyzing the role and meaning of each word in context of its sentence, but we usually take for granted the way computers make sense of the code we write. By exploring the way our brains construct grammars to parse sentences, we can better understand how parsers are used for computering -- whether it be in the way Ruby and other languages are implemented or in webserver routing -- and recognize when they may be the right tool to use in our own code.
26
Vorschaubild
44:21
ZeitrichtungSyntaktische AnalyseParserFormale SpracheProgrammierspracheWort <Informatik>Arithmetisches MittelRichtungÄhnlichkeitsgeometrieProgrammierungVirtuelle MaschineKreisflächeInformatikÜberlagerung <Mathematik>MehrrechnersystemSyntaktische AnalyseVirtualisierungEnthalpieSampler <Musikinstrument>ComputeranimationVorlesung/Konferenz
ComputerQuick-SortSondierungRandomisierungCodeQuellcodeReelle ZahlElektronische PublikationKartesische KoordinatenParserComputeranimationVorlesung/Konferenz
ParserElektronische PublikationZahlenbereichLogische ProgrammierspracheArray <Informatik>ParserMetropolitan area networkMereologieCodeSchlussregelFormale GrammatikObjekt <Kategorie>Wort <Informatik>SpieltheorieDiagrammValiditätObjektorientierte ProgrammierspracheMultiplikationsoperatorSchwarzer StrahlerGradientInstantiierungComputeranimationVorlesung/Konferenz
ZeitrichtungPrimidealFormation <Mathematik>PrimzahlBitTUNIS <Programm>Wort <Informatik>MultiplikationsoperatorMetropolitan area networkDatenstrukturRichtungBestimmtheitsmaßZeitrichtungFächer <Mathematik>DifferenteMusterspracheComputeranimation
Metropolitan area networkTropfenWort <Informatik>Metropolitan area networkDatenstrukturZweiVollständiger VerbandArithmetisches MittelVorlesung/KonferenzComputeranimation
Formale GrammatikSchlussregelPrädikat <Logik>Formale SpracheNatürliche SpracheFormale GrammatikEntscheidungstheorieValiditätBitSchlussregelVorlesung/Konferenz
BildschirmmaskeZahlensystemFormale SpracheSchlussregelRegulärer Ausdruck <Textverarbeitung>Folge <Mathematik>ZahlensystemSchlussregelTopologieFormale SpracheWort <Informatik>Metropolitan area networkFormale GrammatikFehlertoleranzDatenverarbeitungssystemSchnittmengeProdukt <Mathematik>TeilmengeBitDateiformatFolge <Mathematik>Radikal <Mathematik>SymboltabelleEinfache GenauigkeitRechter WinkelStangenzirkelComputeranimationVorlesung/Konferenz
SchlussregelProdukt <Mathematik>RechenschieberPrädikat <Logik>MAPMetropolitan area networkRadikal <Mathematik>ComputeranimationVorlesung/Konferenz
Wort <Informatik>Radikal <Mathematik>Produkt <Mathematik>FehlertoleranzMetropolitan area networkFormale GrammatikSchlussregelProgrammierspracheProgrammablaufplanGefrierenProzess <Informatik>NeuroinformatikCodeComputeranimationVorlesung/Konferenz
DatenverarbeitungssystemCodeQuellcodeFunktion <Mathematik>ParserToken-RingEin-AusgabeNatürliche SpracheProgrammierspracheParserDatenstrukturFunktion <Mathematik>Prozess <Informatik>ProgrammablaufplanZeichenketteToken-RingTopologieRechenwerkAbstrakter SyntaxbaumCodeSyntaktische AnalyseCompilerInterpretiererBefehlsprozessorSkriptspracheDifferenteBoolesche AlgebraÄhnlichkeitsgeometrieUnternehmensarchitekturCursorBenutzerschnittstellenverwaltungssystemElektronische PublikationMereologieSelbst organisierendes SystemComputeranimationVorlesung/Konferenz
AdditionDifferenz <Mathematik>MultiplikationMathematikSchlussregelToken-RingZeichenketteEin-AusgabePortscannerFehlermeldungSyntaktische AnalyseZahlenbereichParserTopologieStrom <Mathematik>Rekursive FunktionGradientenverfahrenUnendlichkeitFormale GrammatikZahlenbereichExtreme programmingNichtlinearer OperatorSystemaufrufWort <Informatik>Ein-AusgabeFolge <Mathematik>Token-RingMereologieProdukt <Mathematik>Schreib-Lese-KopfProzess <Informatik>sinc-FunktionSyntaktische AnalyseFormale GrammatikSchlussregelMathematikNatürliche ZahlAdditionDifferenz <Mathematik>MultiplikationMinimumGanze FunktionFormale SpracheTypentheorieGanze ZahlInterpretiererAggregatzustandEntscheidungstheorieRechter WinkelTopologieImplementierungFehlermeldungUmwandlungsenthalpieOrdnung <Mathematik>MultiplikationsoperatorHochdruckMAPEinfache GenauigkeitRekursive FunktionVererbungshierarchieBinärcodeInformationZweiStichprobenumfangBenutzerschnittstellenverwaltungssystemInverser LimesRegulärer Ausdruck <Textverarbeitung>ZeichenketteCASE <Informatik>Abstrakter SyntaxbaumFunktion <Mathematik>Quick-SortRechenbuchGruppenoperationDifferentePunktElement <Gruppentheorie>Metropolitan area networkSymboltabelleParserRekursiver AbstiegDigitalisierungAssoziativgesetzComputeranimation
Keller <Informatik>SyntaxbaumFormale GrammatikSchlussregelVerschiebungsoperatorDifferentePunktSyntaktische AnalyseToken-RingMAPMomentenproblemVorlesung/Konferenz
SchlussregelToken-RingFormale GrammatikSchlussregelMomentenproblemZahlenbereichGebäude <Mathematik>Topologiep-BlockEin-AusgabeMatchingMathematikParserNichtlinearer OperatorRadikal <Mathematik>SymboltabelleVorzeichen <Mathematik>GruppenoperationMultiplikationsoperatorProdukt <Mathematik>MultiplikationTypentheorieFormale SprachePunktElement <Gruppentheorie>MAPOrdnungsreduktionAggregatzustandInstantiierungStrömungsrichtungKonfiguration <Informatik>MinimumBAYESEinfügungsdämpfungStabComputeranimationVorlesung/Konferenz
Rekursiver AbstiegMinimumSchlussregelSyntaktische AnalyseSchlussregelParserRekursiver AbstiegTypentheorieMinimumFormale GrammatikComputeranimationVorlesung/Konferenz
Formale GrammatikParserFunktion <Mathematik>Konfiguration <Informatik>ZahlenbereichTabelleAggregatzustandMinkowski-MetrikZeichenketteURLAdressraumE-MailFormale SpracheFormale GrammatikSchlussregelGenerator <Informatik>Elektronische PublikationSyntaktische AnalyseParserZahlenbereichp-BlockGruppenoperationMereologieFolge <Mathematik>MinimumDatenloggerBildschirmmaskeToken-RingKlasse <Mathematik>SummierbarkeitNichtlinearer OperatorAdressraumRechenschieberInformationTabelleE-MailQuick-SortBitAlgorithmusZahlensystemSymboltabelleCodeDateiformatGarbentheorieURLComputeranimationVorlesung/Konferenz
ParserZeichenketteLesen <Datenverarbeitung>Ein-AusgabeSchlussregelRegulärer Ausdruck <Textverarbeitung>Überlagerung <Mathematik>CASE <Informatik>URLComputeranimationVorlesung/Konferenz
ProgrammierspracheHierarchische StrukturTypentheorieComputerRegulärer GraphRegulärer Ausdruck <Textverarbeitung>Überlagerung <Mathematik>CASE <Informatik>CodeProgrammierspracheFormale SpracheTypentheorieWeb SiteHierarchische StrukturKategorie <Mathematik>FreewareMereologieFormale GrammatikSiedepunktRegulärer GraphNatürliche SpracheRechter WinkelEinfache GenauigkeitReguläre SpracheRadikal <Mathematik>Regulärer Ausdruck <Textverarbeitung>SymboltabelleKontextbezogenes SystemProgrammierungComputeranimationVorlesung/Konferenz
Einfache GenauigkeitKontextbezogenes SystemFormale GrammatikSchlussregelSymboltabelleVerschiebungsoperatorFolge <Mathematik>Formale GrammatikSchlussregelSiedepunktSymboltabelleZeiger <Informatik>Radikal <Mathematik>Metropolitan area networkProgrammierspracheEinfache GenauigkeitFolge <Mathematik>Rechter WinkelKontextbezogenes SystemFreewareComputeranimationVorlesung/Konferenz
Formale SpracheFormale GrammatikProgrammierspracheRegulärer GraphOrdnung <Mathematik>Generator <Informatik>Bridge <Kommunikationstechnik>MereologieFormale GrammatikDivergente ReiheFormale SpracheFlächeninhaltBitZahlenbereichSchlussregelValiditätZeichenketteSchreiben <Datenverarbeitung>Regulärer Ausdruck <Textverarbeitung>Radikal <Mathematik>Folge <Mathematik>Äußere Algebra eines ModulsComputeranimationVorlesung/Konferenz
Mathematische LogikKontrollstrukturLogische ProgrammierspracheGeradeEinfügungsdämpfungSystem FFormale SpracheGraphfärbungZahlenbereichQuick-SortBefehl <Informatik>Automatische DifferentiationVorlesung/KonferenzComputeranimation
Formale SpracheReguläre SpracheRadikal <Mathematik>Formale GrammatikBefehl <Informatik>SchlussregelFolge <Mathematik>CASE <Informatik>Rechter WinkelKontextbezogenes SystemComputeranimationVorlesung/Konferenz
AutorisierungE-MailClientFramework <Informatik>Exogene VariableMessage-PassingSechseckAlgorithmusZeichenketteProgrammierspracheFormale GrammatikValiditätE-MailSchlussregelStandardabweichungEin-AusgabeImplementierungInternetworkingAutorisierungParserAuthentifikationGewicht <Ausgleichsrechnung>Computeranimation
Regulärer Ausdruck <Textverarbeitung>ZahlenbereichBenutzerbeteiligungServerSoftwareschwachstelleProzess <Informatik>Formale GrammatikPRINCE2SchlussregelInformationExogene VariableTopologieRoutingLesen <Datenverarbeitung>Regulärer Ausdruck <Textverarbeitung>PunktGenerator <Informatik>ParserSyntaktische AnalyseComputeranimationVorlesung/Konferenz
VerschiebungsoperatorParserRekursiver AbstiegBefehlsprozessorProgrammierspracheZeitzoneImplementierungBitrateFormale SpracheKugelkappeCoprozessorProblemorientierte ProgrammierspracheVerschiebungsoperatorParserOrdnungsreduktionProgrammierspracheMailing-ListeVerschlingungRekursiver AbstiegComputeranimationVorlesung/Konferenz
Videokonferenz
Transkript: Englisch(automatisch erzeugt)
way, Sue, that actually is how it's spelled. All the H's are silent, so on Twitter,
you can find me by so many H's. So today I'm gonna talk to you about parsing. So for those of you who saw Aaron Patterson's talk this morning on the Ruby Virtual Machine, my talk covers all the boring steps that his talk skipped over, so if you still feel like you wanna skip over those steps, like, now's your chance.
So a little explanation about the title of this talk. I probably first started thinking about parsing when I was teaching English as a foreign language, and I had to explain why so many words in English, like flies, had different meanings, and depending on which meaning the word had, the sentence can go in really different directions. So when I started looking into how computer language
are parsed, I noticed that there are actually a lot of similarities to the way humans parse sentences in languages that they hear and read every day. So I'm a relative newcomer to the programming world, so I never learned about parsers or compilers in school or anything, so I have an alt title for this talk, which is How I Accidentally
a Computer Science, and So Can You. So of course this isn't going to be a comprehensive survey of parsers, but I just wanted to share how I came about learning about them as someone who doesn't have a CS degree, because I had no idea what they were, and I sort of ran into them by accident. So I had been building Rails applications for a while,
and one day I got tired of just blindly trusting all that Rails magic, and I wanted to see how things really worked underneath the hood. Specifically, I was trying to figure out how routing worked, so I went to GitHub and started poking around the Rails source code, and it came across a file called parser.rb, and I was like, cool, I'd done an exercise,
I was sort of like this, so I thought I'd probably know what the code would look like. But then I looked at the file, and it was nothing what I expected. Like, it was just a bunch of arrays with numbers, like, where was the logic? Like, who even wrote this stuff? So I was like, okay, that's not enough. Looking closer, I saw that the previous file
was actually generated using a different file called parser.y, so I went and looked at that. Still didn't make any sense, and barely looked like any Ruby code that I recognized. Like, there's semicolons, like what? So that's when I decided to figure out what all of this meant. So to get warmed up to the idea of parsing,
let's play a game. I hope that most of you have played it before, but for those of you who may not be familiar, this is the word game where one person asks for certain kinds of words, such as a noun or an adjective, to make sentences. So for example, I need a verb, anyone. Did I hear drink?
Sure, drank, great, awesome. The young man drank, awesome. So that's a valid sentence, right? If you can remember your middle school grammar lessons, we can diagram the sentence into a subject and a verb, where the verb is drink, and since all sentences need a subject to actually do the verb, we know that the subject is the young man.
So, makes sense so far, right? But let's see if there are other possible sentences that could have started with the young man. So for example, the young man did something. So here, I need a verb and a noun. We can, so we can stick with our original verb, drank. So any suggestions for a noun?
Beer, no, my crowd. So the young man drank beer, also grammatical. This time, instead of just subject plus verb, we have subject plus verb plus object, which again abides by the rules of English grammar if we interpret the young man as the subject,
drink as the verb, and beer as the direct object. But still, can we come up with even more possibilities for valid sentences that start with the young man, either using subject plus verb or subject plus verb plus object, or even something else? So for example, I need a noun.
Anyone? So for example, is this a grammatical sentence, the young man the boat? Yes, no, do we know? Based on the previous examples, at first it's easy to think that it's not grammatical because we assume that the subject for this sentence is still the young man.
So then when we see the boat, our brains are like, nope, not a sentence. And this is because we know that a subject has to be followed by a verb. But it turns out you can parse this sentence in a different way. So here, if we interpret the subject as the young, as in young people, we still bet it still follows the same rules as the sentence the young man drink beer,
which was subject plus verb plus object. So we were initially led astray because we tend to think of man as a noun and therefore in the previous sentence as part of the subject and not a verb. So this kind of sentence is called a garden path sentence. Garden path sentences are sentences that contain a word that could be interpreted in more than one way
so that you think that the sentence is going to follow one structure before pivoting on that ambiguous word and going in a different direction. Time flies like an arrow, fruit flies like a banana follows that pattern, as well as the man who hunts ducks out on weekends, the prime number few,
when Fred eats food gets thrown, Mary gave the child the dog bit a band-aid, and the woman who whistles tunes pianos. So ambiguity in words or sentence structure
can also produce hilarious, unintentional second meanings for newspaper headlines such as grandmother of eight makes a hole in one, man eating piranha mistakenly sold as pet fish, eye drops off shelf, complaints about NBA referees growing ugly,
and my personal favorite, milk drinkers are turning to powder. So regardless of the ambiguity in natural languages like English, we can all agree that as speakers of a language, there are certain grammar rules that we are all aware of
that allow us to make decisions as to whether or not a sentence is a valid expression in that language. As we've seen from the last few examples, we know that one possible kind of valid sentence in English consists of subject plus verb plus stuff. And again, most of us probably did some kind of sentence diagramming in middle school that looked a little bit like this.
So at the top of the tree, we have a sentence which can be broken up into its constituent parts such as a noun phrase and a verb phrase, which then in turn get broken down further until every word in the sentence can be added as a leaf of that tree. In this tree, the leaves are the words for the sentence, John lost his pants, but the same exact tree could apply to the young man, the boat.
So these kinds of trees are constructed based on grammar rules. The convention usually used to describe grammars in the computing world is a notation called Bacchus-Naur form, or BNF. In BNF notation, you have a set of rules that together describe the language, or in other words, can define every possible valid sentence in that language.
There are two kinds of symbols that are used to express each kind of grammar rule, terminals and non-terminals, which we'll get a little bit more into later. So the basic format of each production rule is that there is a left-hand side which produces a right-hand side, or rather can be rewritten as what's on the right-hand side On the left-hand side, you have a single
non-terminal symbol or token which produces either a terminal or a sequence of terminals and non-terminals. So for example, a super simplified subset of production rules of English sentences based on the past few slides that we saw might be something like at the top level, a sentence becoming a subject followed by a predicate, and then a predicate becoming a verb followed by stuff.
So here, we can see an example of 10 production rules to describe both of the sentences we created earlier, the young man drank beer, and the young man the boat. So these rules make use of six non-terminals, sentence, noun phrase, verb phrase, noun, verb, and article, and seven terminals
which are simply the words the, a, young, man, boat, beer, and drink. So using this grammar, we can start at the top level and keep applying production rules until we generate the full sentence. So we start with a sentence which can be rewritten as a noun phrase and a verb phrase.
The noun phrase then translates to an article plus noun which in the next step hits the terminals of the for the article and young for the noun. Similarly, the noun phrase which gets rewritten as a verb followed by a noun phrase then becomes, the verb becomes man
followed by a noun phrase. The noun phrase then becomes article followed by noun. The article becomes the, and the noun becomes boat. The same rules can apply for the other sentence, the young man drank beer. But what does all of this have to do with computers?
So it turns out the process by which we parse and understand a sentence in natural languages is not too different from what goes on with a computer language parser. Somewhat debatable but for our purposes. The basic flowchart of what happens to your code can be seen here. Your code gets fed into a lexer or tokenizer which breaks up each sentence into recognizable units.
These get fed to the parser which for most computer languages outputs an abstract syntax tree which can then be walked by the compiler or interpreter to output CPU instructions. Of course most of us aren't in the business of implementing new programming languages but by better understanding the lexer and parser steps we may be able to see how we can use parsers
for simpler situations. Rather than a Ruby script for example you can feed anything, a log file, a markdown document or even just a plain string into your parser and get some kind of output which doesn't even necessarily have to be a tree. It could be a different data structure for example or even something as simple as a boolean.
So you feed your string into your parser and you get true or false to tell you whether or not it was valid. So let's look at lexing or tokenizing. All this means is turning your input into a sequence of tokens that your parser can recognize. These tokens will then be matched to the grammar rules that describes the language you are parsing. Let's look at a simple example.
Let's build a parser that can handle super basic binary operations in math such as addition, subtraction and multiplication. The entire language will consist of just an integer followed by an operator and then by another integer. So the grammar rules for this language are pretty simple. We only need three of them to describe every possible sentence
in our mini math language. The start symbol here is an expression which consists of three non-terminal symbols or tokens, a number token, an operator token and then a number token. The number token can then be rewritten as any digit notated above using a regular expression and the operator token can be rewritten as a plus, minus
or a time symbol in production rules two and three. So here's a simple implementation of a tokenizer in Ruby. It just uses a regular expression matching pretty much identical to the one that we saw in our production rules to grab tokens of either type num or op and then it returns them all in an array. So when we input the string three plus seven
we should get a three element array of num, op and num. So now we want to pass it to the parser. Ultimately we want to end up with a tree that looks something like this so that when you feed this tree to the interpreter it sees the head of the tree and knows to execute addition on the left child and right child.
So a super trivial implementation of a parser might look something like this. You have a parser that only expects three tokens in a specific order and outputs a tree with a single node containing the operator and then two children which are the number tokens to be operated on. So very easy. But now for some slightly harder math.
So take for example this expression two times parentheses three plus seven. With this expression we need slightly more complicated rules to describe it. Whereas before our first production rule was expression equals num followed by op followed by num,
now we have at the top level expression becomes num followed by op followed by expression where expression can then either be rewritten as another expression in parentheses or a number. The number and operator non-terminal still translate as they did before in the previous production rules. So we tokenize our input and get an array of seven tokens.
And this is ultimately the tree that we want to build but how do we tell the parser to construct the tree correctly? One common approach is to use a top-down parser that looks ahead one token to decide what grammar rule to follow. And like garden path sentences we need to know what comes after a given token
to know what kind of rule to use to understand the sentence. So for example knowing that the young man is followed by a noun, the boat, allows us to know what the subject and verb of the sentence are. If you can adhere to a grammar rule each time you consume a token then you're good to go. But if you can't then you reach an error state and the parser throws a parse error.
So here we start with the first token which is a number token. Since peeking ahead at the next token tells us we have an operator coming up we know that the only possible rule that we can follow is the one that says an expression is a number followed by an operator followed by another expression. So we can start building our tree with the node two.
Shifting over to the next token we have an operator. Peeking ahead we see that the next token is a parenthesis. The only grammar rule that has an operator followed by a parenthesis is still that first one. So we know that so far we're still in the realm of grammatical correctness. And since there's no parse error we can use that trivial parser from before
and set the operator token as the head of our tree. So now we consume the parenthesis and see that the next token is a number. Since we know that a valid expression can either be a number or a number plus operator plus expression we still haven't violated any grammar rules
so we can continue parsing and update the right child of our tree to be evaluated later. Now we consume the number token which holds the value three. Previously we knew that we had two options. Either interpret the expression as becoming just a number token which then terminates into three
or as part of the first production rule which states that an expression can be a number followed by an operator followed by an expression. So since we can peek ahead to the next token which is an operator this gives us enough information to know which rule to follow which is in fact that second one. So now we know that the expression we have as the right child of our tree
will in turn be a binary expression and using the same parser we know that the three will be the left child of that subtree. Now we consume the operator token and see that it's a plus. Peeking ahead to the seven we know that this adheres to the same production rule as before
and that we decided on in the previous step which tells us that we're good to keep parsing so we can add the plus operator as the head of our subtree. Now we reach the seven which we already know from the previous step has to be an expression of some sort. The look ahead tells us that our current token
must terminate into a number. When we consume the final parentheses this completes the sub-expression of expression becoming expression in parentheses so now we can build our complete tree. If on the other hand there had been no parentheses then the second rule would have been violated and we would not have been able to construct a valid syntax tree.
So looking at a summary of the steps we took to build our syntax tree you may have noticed that we often have to call expression from within an expression call. In other words the process we went through to parse the sentence was recursive and since we were building the tree from the top down this method of parsing is also called recursive descent. This is probably the most common type of parser
that most people try to write by hand but there's a number of problems with recursive descent. For one the recursive nature of the process makes it relatively inefficient so you may end up checking a token against every single grammar rule in your language before you can actually parse it. Or if you're not careful in how you write your grammar rules you'll end up with infinite recursion.
So for example if you have a production rule that says an expression becomes an expression operator expression then when you hit the first token you go back to expression which produces expression operator expression and so forth at infinitum. Finally sometimes you have limitations in the way or order you write your grammar rules for your language. In the case of the calculator language like this one
depending on the order you write your grammar rules you'll potentially run into issues with associativity. That is if you say two minus three plus five how does the parser know whether or not to give precedence to addition or subtraction. So another approach to parsing instead builds the parse tree from the bottom up.
The key difference here is that this approach makes use of a stack that you can push tokens onto and so there are enough tokens on the stack to fulfill a grammar rule at which point they get reduced and popped off. For that reason this kind of approach to parsing is also called shift reduced parsing. So let's go back to our slightly hard math example.
Here we start with the first token two. We push it onto the stack for the moment. Based on the grammar rules we know that two is a terminal for a num token so the stack reduces it to the left hand side symbol of number. Then we set aside the two as a building block for our eventual tree.
Now we shift over and consume the next token which is a time sign. This gets pushed onto the stack and then we look at the grammar. The rule that matches is operator becomes multiplication so reduce the token on the stack and move it over. Now we shift again and grab the parentheses and we add it to the stack.
Since this looks like it could be covered by the rule where expression becomes a parentheses expression we just keep it in the stack for now and keep shifting. The next token is now a three which you push. Again applying the first grammar rule we reduce the three to a number token in our stack and set it aside.
Shifting over to the plus sign we add it to the stack as usual. We look at our grammar rule and know that this is an op token so the next action is to reduce. And once again we set it aside. Now our next action again is to shift. We push the next token to the stack,
apply our production rules which recognizes a number token but also that a number is a type of expression so our stack executes another reduce on the stack and we put our seven aside. So at this point the top tokens on our stack have fulfilled a rule that our grammar recognizes
namely that an expression can be a number followed by an operator followed by an expression. So now the stack takes the three tokens involved in the rule and reduces them to simply an expression now shown on the stack in green. So now our parser knows what to do with those elements involved and can create a subtree from the three, plus and seven
in the same way that we previously saw in the top down parser. So that's all the reducing that can be done for now so again we shift to consume the last token. Again we push it onto the stack and now again the tokens at the top have fulfilled another grammar rule which is expression becomes expression in parentheses.
So since the three tokens at the top of the stack fulfill a grammar rule the stack reduces the parentheses, expression and parentheses into an expression token as per the rule. So now the tokens we have left on the stack are num operator and expression.
But now we're at a state again where the tokens at the top of the stack have met another grammar rule which is expression becomes num operator expression. This allows the first tokens we set aside to be added to the tree at the proper level above the subtree construct from the previous reduction. Again we reduce the tokens at the top of the stack into the top level expression non-terminal.
So now we've reached the end of our tokens and since we were able to perform a valid reduction we know that the input was a valid grammatical sentence in our math language and our parser can accept it and create the final tree. So these are two types of approaches to parsing.
Top down or recursive descent parsers are the most commonly written by hand whereas bottom up which you might sometimes hear called LALR are much harder to write. Unsurprisingly the more grammar rules you have the harder it is to construct a parser. But fortunately there are tools to make parsers for you.
These tools are called parser generators and all they need is a grammar file with all the rules that determine the language you want and bam it'll spit out a parser for you. So those grammar files were like that wacky dot y file that we saw at the very beginning of the slide deck but now that we understand how grammar rules are written we can see it's not that different from the BNF notation that we've been using.
If you're using something like Rack as your generator it expects a class with a rule section and that's where you define your grammar rules. So here you can see that the rule for the non-terminal symbol expression consists of a num followed by an operator and followed by another number. When this sequence gets hit by the parser it executes a Ruby code that you specify in the block. We saw this earlier in the bottom up parser
that when a grammar rule is fulfilled the parser performs some action. The grammar file also comes with a tokenizer which is how the parser generator understands what the non-terminal tokens are that you define in your rule set but it's pretty similar to the one that we saw earlier so not concluding on the slide. So all that's left is to pass your grammar file
into Rack or Bison and it should generate a file that you can now parse for whatever language that you specified. You don't even need to do anything or even look at it but if you did it would look something like this which now makes a little bit more sense. The numbers in the various tables here simply correspond to the different actions the parser needs to take
but for the most part you don't even need to worry about it and now you have a parser. But same way you might ask when would I actually use a parser? And that's a great question because there's a lot of situations in which you can use one. So for example you could use one if you're validating a string such as a URL or an email address or maybe extracting information from the body of an email
or log file or even converting some sort of formatted document into another form for example marked down to HTML. But now you might be thinking well I validated strings before just using regular expressions and it's true for some simple string input you probably could just use a simple regex
but has anyone here actually tried to validate for example a URL using regexes? I'm sorry because if you did following the rules for how URLs are constructed you might have ended up with something like this which is terrible. You pretty much never want something like this in your code because it's hard to read.
Don't read it. You guys are actually probably like trying to figure it out right now right? And it probably doesn't even like cover all the cases that you need it to and the reason for this is because there's a really important distinction in languages even string languages to take into consideration. And that is that all languages belong to a hierarchy. The established hierarchy for languages
which was first introduced by Noam Chomsky has four types. The first type is unrestricted and this is the category that most natural languages like English fall into. These are languages that do not have regular or consistent grammars and which can be ambiguous. Type one is called context sensitive which we don't really care about for the purposes of this talk but then we get into type two and three.
Type two languages are what are known as context free languages and most programming languages are type two but you don't even need to be a language as complicated as Python for example to be considered context free as we'll see in a minute. Finally, type three languages are what are called regular languages and these are the only kinds of languages
that can be properly parsed by regular expression engines hence the name. But what does it mean for a language to be either regular or context free? For the most part, it all boils down to the grammar. For regular languages, all grammar rules can be expressed as a single non-terminal symbol on the left hand side
and either a terminal or a nil on the right hand side. This terminal can sometimes have a non-terminal before it or after it but not both. On the other hand, context free languages have more flexible grammar rules. While the left hand side is still a single non-terminal, the right hand side can be a sequence of terminals
and non-terminals. And as it turns out, most languages aren't regular. So let's take a really simple language that you think is so simple that surely you could just use a regular expression to parse it. So let's consider the AB language. The rules for this language are that you have a series of As followed by the same number of Bs. So a valid string could be AB or AABB and so forth
whereas invalid sentences are string of all As or ABB or ABAB alternating. So if we were to try to write a grammar, grammar rules for this, we might start with something like the non-terminal A becomes the lowercase A and non-terminal B becomes lowercase B.
So then when we could construct a sentence, S, as a sequence of non-terminals AB, but this rule only gives us one possible sentence which is just AB. So in order to produce all possible sentences, we'd have to have AB becomes AABB and then AAAABBBB, et cetera, and so on forever.
And this obviously doesn't work. So this is kind of the same problem as figuring out if you would have balanced parentheses. So what is the grammar? Thanks to parser generators, coming up with the grammar is the hardest part to writing a parser, but sometimes coming up with that grammar is a bit tricky since it has to cover all possible sentences.
So this particular problem reminds me of a logic problem that maybe some of you have heard before. The problem is there are a bunch of gnomes who've been kidnapped by an ogre and put in a cave. Some of them have their foreheads painted red to be eaten right away and some of them have their foreheads painted blue. However, the ogre gives them one chance at saving themselves.
The gnomes are allowed to exit the cave one by one so that they can actually see each other but not communicate and they still can't see like the color of their own forehead. And if they can sort themselves into a single line where all the red painted gnomes are on one side and all the blue painted gnomes are on the other side, then they'll all be freed. So does anybody actually know the answer to this one?
No? So the answer is that if each gnome who comes out of the cave inserts himself in between the red and blue gnomes, they'll automatically sort themselves. And if you think about it, this is actually the same way that the AB language works. To use BNF notation, that means a sentence or statement S
is basically an A or a B with another statement token inserted in between the A and the B. If we specify that S can also be nil, then that provides the only terminal you need to cover all cases of our deceptively seeming AB language. So remembering that a grammar rule for a regular language can't have a sequence of terminals
and non-terminals on the right-hand side, we see that this language is in fact a context-free language. So as I mentioned, not everyone is in the business of implementing a programming language from scratch, but using grammars is useful in everyday work, such as understanding internet standards. For example, here are the rules that describe the syntax of HTTP auth headers
from RFC 2617, which specifies HTTP authentication. A Ruby implementation of this standard can be found in the net HTTP digest auth gem, but you could also implement a parser using these rules just to check for validity of input. Except for email. To validate email, just send the email. So as you've seen,
you don't really need that much complexity before you get out of regular expression parseable territory. And because of that, you're going to be better off by figuring out the grammar rules for the thing that you're trying to parse and using a parser instead. This also ensures greater accuracy in what you're trying to parse. For example, there are a number of web servers that use a parser for HTTP requests,
such as Raggle or Jetty. But even as recently as a few months ago, there was a vulnerability uncovered in Jetty, which is used in Java servlets, that prints debug information in the response. So that was bad. The other thing about parsers is that they're often a faster way of getting what you need done, done. So for example, in a web framework, if you use a parser for your routes,
then you can build a tree, which makes routing much faster than a linear look-through. Bonus points if somebody can tell me the big ol' complexity for that. Of course, parsers are harder to write, but thanks to parser generators, all you need to figure out is the grammar. So there are many resources that I used to prepare for this talk, and I couldn't list them all on one slide,
but here are a few links if you want to see a full implementation of recursive descent parser or shift reduce parser. I also highly recommend the book Constructing Language Processors for Little Languages by Randy Kaplan, as well as Pat Shaughnessy's Ruby Under Microscope, which Aaron also recommended for his virtual machine talk. So thank you very much.