We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Understanding parser combinators: a deep dive

00:00

Formale Metadaten

Titel
Understanding parser combinators: a deep dive
Serientitel
Anzahl der Teile
96
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Traditionally, writing parsers has been hard, involving arcane tools like Lex and Yacc.An alternative approach is to write a parser in your favourite programming language, using a "parser combinator" library and concepts no more complicated than regular expressions. In this talk, we'll do a deep dive into parser combinators. We'll build a parser combinator library from scratch in F# using functional programming techniques, and then use it to implement a full featured JSON parser.
Syntaktische AnalyseKombinatorParserDreizehnCodeFormale SpracheZahlzeichenMereologieBruchrechnungZeichenketteWitt-AlgebraProgrammbibliothekGebäude <Mathematik>Message-PassingFehlermeldungFunktion <Mathematik>ProgrammierungSinguläres IntegralWiederkehrender ZustandGoogolEin-AusgabeVersionsverwaltungEin-AusgabeZeichenketteMereologieFunktionale ProgrammierungElektronische PublikationVersionsverwaltungFunktion <Mathematik>Syntaktische AnalyseZeiger <Informatik>QuaderStreaming <Kommunikationstechnik>Formale SpracheArithmetisches MittelKombinatorVideokonferenzProgrammbibliothekMAPMessage-PassingParserPuls <Technik>Abstrakter SyntaxbaumCodeOrdnung <Mathematik>TropfenLesezeichen <Internet>AbfrageObjektorientierte ProgrammierungEndliche ModelltheorieProgrammierspracheSuchmaschineSpeicherabzugSoftwareindustrieDämpfungTypentheoriePunktCASE <Informatik>Leistung <Physik>SkalarproduktPoisson-KlammerWinkelWeb SiteEinsMultiplikationsoperatorTwitter <Softwareplattform>KnotenmengeVerzeichnisdienstRechenschieberProgrammierungSymboltabelleMailing-ListeInnerer PunktFehlermeldungSoftwareSchlüsselverwaltungComputeranimation
ZeichenketteEin-AusgabeMessage-PassingVersionsverwaltungSyntaktische AnalyseAuswahlaxiomMotion CapturingFunktion <Mathematik>Wrapper <Programmierung>ParserInnerer PunktSyntaktische AnalyseEndliche ModelltheorieRechter WinkelMereologieAutomatische IndexierungTypentheorieHardy-RaumCASE <Informatik>ResultanteUmkehrung <Mathematik>FehlermeldungMultiplikationsoperatorParserFunktionale ProgrammierungMathematikEin-AusgabeVerzweigendes ProgrammObjektorientierte ProgrammierungZeichenketteHyperbelverfahrenCompilerCodeKombinatorStreaming <Kommunikationstechnik>AuswahlaxiomFunktion <Mathematik>AppletGenerische ProgrammierungInformationBitVersionsverwaltungTupelParametersystemMatchingMessage-PassingSampler <Musikinstrument>LaufzeitfehlerComputeranimation
Syntaktische AnalyseEin-AusgabeZeichenketteRechenwerkCodeParserEin-AusgabePolstelleTypentheorieOrtsoperatorResultanteMessage-PassingZeichenketteComputeranimation
CodeKombinatorProgrammbibliothekGanze ZahlSyntaktische AnalyseParserEin-AusgabeFehlermeldungFunktion <Mathematik>Nichtlinearer OperatorParserMAPInnerer PunktKombinatorProgrammbibliothekAdditionSyntaktische AnalyseDatenbankMultiplikationsoperatorResultanteZweiLogische ProgrammierungPoisson-KlammerDemo <Programm>Ein-AusgabeCodeWort <Informatik>Nichtlinearer OperatorFunktionale ProgrammierungGeradeStreaming <Kommunikationstechnik>Vorzeichen <Mathematik>SkalarproduktMailing-ListeFolge <Mathematik>WinkelZeichenketteHardy-RaumMatchingKnotenmengeDelisches ProblemBitWeb-SeiteGanze ZahlWeg <Topologie>ÄhnlichkeitsgeometrieSymboltabelleObjektorientierte ProgrammierungMereologieDigitalisierungApp <Programm>Message-PassingArithmetisches MittelXMLComputeranimation
ZeichenketteParserEin-AusgabeSyntaktische AnalyseParserZeichenketteSyntaktische AnalyseInnerer PunktGruppenoperationSchreib-Lese-KopfMAPWeg <Topologie>SchnittmengeResultanteVersionsverwaltungCodierungMatchingGanze ZahlMessage-PassingCodeComputeranimationProgramm/Quellcode
Komplex <Algebra>Gebäude <Mathematik>Syntaktische AnalyseMailing-ListeOrdnungsreduktionAuswahlaxiomZahlzeichenUmsetzung <Informatik>Folge <Mathematik>ParserZeichenketteZeichenketteSymboltabelleMatchingDifferenteElement <Gruppentheorie>CodeEinsQuick-SortFolge <Mathematik>Mailing-ListeFunktion <Mathematik>PunktMAPMapping <Computergraphik>ParserFunktionale ProgrammierungSyntaktische AnalyseDigitalisierungKombinatorGeradeAuswahlaxiomNichtlinearer OperatorOrdnungsreduktionMessage-PassingReelle ZahlComputeranimation
Folge <Mathematik>AuswahlaxiomParserMailing-ListeZeichenketteSyntaktische AnalyseZahlzeichenRechenwerkHIP <Kommunikationsprotokoll>ZeichenketteFehlermeldungKombinatorCodeGeradePunktDigitalisierungOrtsoperatorMereologieParserComputeranimation
KombinatorZahlzeichenKombinatorMinkowski-MetrikParserSyntaktische AnalyseMatchingMailing-ListeGeradeProgrammbibliothekInnerer PunktGrenzschichtablösungZeichenketteDelisches ProblemDigitalisierungRechter WinkelGüte der AnpassungSchnittmengeProzess <Informatik>EinsMessage-PassingMereologieEinfache GenauigkeitComputerunterstützte ÜbersetzungBoolesche AlgebraOrtsoperatorComputeranimation
ZeichenketteSyntaktische AnalyseResultanteZahlzeichenMailing-ListeDigitalisierungProgrammbibliothekCodeMailing-ListeInnerer PunktEigentliche AbbildungBitGanze ZahlMAPFunktionale ProgrammierungParserObjektorientierte ProgrammierungRechter WinkelGeradeComputeranimation
ParserEin-AusgabeFehlermeldungSyntaktische AnalyseVersionsverwaltungZahlzeichenMessage-PassingStreaming <Kommunikationstechnik>GeradeInformationKontextbezogenes SystemGanze ZahlGebäude <Mathematik>DiagrammZeichenketteZahlenbereichNichtlinearer OperatorBoolesche AlgebraKontrollstrukturUnicodeSymboltabelleProgrammbibliothekCodeInhalt <Mathematik>FehlermeldungZeichenketteTypentheorieKonstruktor <Informatik>Funktion <Mathematik>ResultanteSchreib-Lese-KopfMereologieMailing-ListeZahlenbereichBoolesche AlgebraDämpfungGeradeSchlüsselverwaltungData DictionaryNP-hartes ProblemGanze ZahlMessage-PassingDigitalisierungKombinatorParserGamecontrollerGüte der AnpassungMatchingKategorie <Mathematik>Prädikat <Logik>Funktionale ProgrammierungEin-AusgabeSyntaktische AnalyseStreaming <Kommunikationstechnik>Quick-SortMultiplikationsoperatorMaskierung <Informatik>KontrollstrukturBitAuswahlaxiomDelisches ProblemKonditionszahlTaskEinsObjektorientierte ProgrammierungNichtlinearer OperatorZusammenhängender GraphCASE <Informatik>Zeiger <Informatik>UnicodeSystemaufrufMathematikElektronische PublikationXMLComputeranimation
SystemaufrufHorizontaleZahlzeichenMaskierung <Informatik>UnicodeZeichenketteMailing-ListeAuswahlaxiomParserVorzeichen <Mathematik>Ganze ZahlMereologieExponentBruchrechnungZahlenbereichMaskierung <Informatik>FehlermeldungVorzeichen <Mathematik>ZeichenketteProgrammierspracheMereologieCodeFunktion <Mathematik>Mailing-ListeAuswahlaxiomCASE <Informatik>Funktionale ProgrammierungSyntaktische AnalyseParserObjektorientierte ProgrammierungKonfiguration <Informatik>DiagrammDigitalisierungSystemzusammenbruchGanze ZahlProgrammbibliothekDemo <Programm>ExponentBitWellenlehreTypentheoriePunktBruchrechnungEigentliche AbbildungEinsGanze FunktionExpertensystemUnicodeZahlenbereichResultanteDatensatzDelisches ProblemEinfache GenauigkeitMessage-PassingServerMAPVerschlingungHochdruckImaginäre ZahlKontrollstrukturNeunÄquivalenzklasseStandardabweichungMinimumComputeranimation
Syntaktische AnalyseZeichenketteMinkowski-MetrikPrädikat <Logik>GeradeKonvexe HülleSteuerwerkFolge <Mathematik>ParserEin-AusgabeZahlzeichenStellenringSystemprogrammierungProgrammbibliothekObjektorientierte ProgrammierungRechenwerkFehlermeldungZahlenbereichGammafunktionWMLWidgetObjektorientierte ProgrammierungZeichenketteMailing-ListeCodierungCodeMAPBildgebendes VerfahrenDämpfungGanze FunktionNatürliche ZahlSystemprogrammZahlenbereichMultiplikationsoperatorPoisson-KlammerSummierbarkeitGeradeQuadratzahlBoolesche AlgebraFehlermeldungMessage-PassingWeb SiteSyntaktische AnalyseKategorie <Mathematik>Data DictionaryGraphfärbungGanze ZahlWinkelLesezeichen <Internet>SoftwaretestOrtsoperatorMinkowski-MetrikParserImaginäre ZahlWidgetProgrammbibliothekReelle ZahlHilfesystemBitrateComputeranimation
Funktion <Mathematik>KombinatorSyntaktische AnalyseKomponente <Software>ProgrammbibliothekGeradeParserVideokonferenzSoundverarbeitungNeuroinformatikTypentheorieObjektorientierte ProgrammierungFunktionale ProgrammierungVideokonferenzEinsProgrammierungPunktEin-AusgabeStreaming <Kommunikationstechnik>Folge <Mathematik>AuswahlaxiomCodeKomplex <Algebra>ZeichenketteGeradeParserProgrammbibliothekFormale SpracheSyntaktische AnalyseVerzeichnisdienstHilfesystemKombinatorCodecQuick-SortLeistung <Physik>Schreiben <Datenverarbeitung>Rechter WinkelCASE <Informatik>Hook <Programmierung>ÄhnlichkeitsgeometrieMereologieComputeranimation
Transkript: Englisch(automatisch erzeugt)
Alright, well I think I'm going to start a minute early because it's pretty much time. So this talk is understanding parser combinators. My name is Scott Voloshin, that's my Twitter handle. I have a website fsharpforfundedprofit.com and the code and the slides and the video
will be on a directory called parser at some point. So this is one of my talks where I try to squeeze like a whole day's worth of stuff into like 60 minutes. So as usual I'll be going very fast and covering a lot of ground. I don't really, as always, I don't
really expect you to remember everything, but if you can get just some of the concepts and if it gets demystified, that's the main thing. So it just becomes less intimidating. So I'm going to be using F sharp for code examples. The concepts will work in pretty
much any programming language, other than COBOL or something. So here is some typical code using parser combinators and when you look at it, when you first come across code like this, it looks kind of intimidating. The main thing is there's all these strange symbols, there's like a vertical bar thing and angle brackets with dots and stuff, and
I think that's one of the things which is particularly intimidating. And so I guess the goal of this talk is if you can understand this code or at least not be intimidated by this code, that will be a success in my book. Obviously, it's really hard to learn everything in 60 minutes, but if you can just look at this and say that kind of looks
vaguely familiar, let me look that up and see what it does. So first of all, I'm going to talk about what is a parser combinator library, and then we're going to build a very, very simple parser which will be the foundation for the rest of the talk. We're
going to build three very simple parser combinators and then we'll start using those combinators to build more complex combinators from the simple ones. We'll have a little side track, a little side excursion on improving the error messages, and then finally we'll get to the core thing which is how to build a JSON parser using these techniques.
So, what is a parser combinator library? So when you write a parser, there's something you're trying to match, it's a keyword or it's an int or a string or a float or something, you're trying to match this thing and in this model you create a step in a recipe, a parsing recipe, and you end up with this object which is a parser of
something. So this parser of something is really a recipe that when you run it later on it will give you back whatever is the thing you're trying to look for. So you have these parser things and then you combine these parser things with other parser things to make new kinds of parser things. So this combination of combining
things together, that's what a combinator is, that's all it is, it's not particularly mysterious, but the whole point is that when you combine things and you get things of the same type, you can then cascade that, you can build on that and make bigger and bigger things from smaller and smaller things. This is the whole concept of composition, why composition is so important, and especially in functional
programming why people are so strong on using that technique. So here we have, in this case, a recipe to make a thing C from an A and a B. And then finally when we've got the recipe, the parser, we have to run it and when we
run it we get a success or a failure depending on whether we succeeded in matching the thing, and in order to run it we also need some input, which is the stream of characters that we're running over. So that's it, that's parser combinators in a nutshell. Now why parser combinators as opposed to something like Lex and Yak or Atlo or the various other techniques?
So the first thing is that they're written in your favourite programming language, which is nice, you don't have to kind of drop out into a different language to do it, which means also you can use your favourite tools to do the programming in, which is nice. There's no pre-processing needed, so the lexing and, you know, in traditional parsers you have a
lexing stage and a parsing stage and you transform the AST and all this kind of stuff. In the parser combinator model that's all one thing. And because there's no pre-processing it's very REPL friendly which you can use it interactively, which is kind of nice. Because it's such a small kind of thing, you can use it to create little DSLs really quickly,
like Fog Creek, the software company, they used the fparsec, which is the F sharp library for this, to write a little DSL for parsing query strings, a search engine, and, you know, very, very simple DSL, do you say, you know, something and something or quoted string and, or, you know, whatever, just like Google has a very
simple DSL. And finally, from my point of view, it's a fun way of understanding functional composition. So even if you're actually not going to use it for anything, it's a fun thing to learn because it gives you insight into what a nice functional library looks like. So let's start with a simple parser and I'm going to create four
versions of this parser, starting with something really simple and getting more and more complex. So the first version of the parser is going to be just parsing the character A, that's all it is. So in order, this is going to be a function, there's this parsing function, this gray box, and there's going to be an input, which is a list of characters or a string or a stream of
characters, whatever. It's going to return true or false if it succeeds in matching the character and it's going to return the remaining input. Now, if it matches the character, it's going to consume that character from the stream and returning and return the remaining characters. What's really important is that the inputs and the outputs are immutable, just like
in all functional programming. In particular, that's very useful because that means if the parsing fails, you can take another parser and apply it to the same input. You don't have to worry that the file point has moved around and you're not starting at the same place. So that's a key aspect of the combinator design. So let's show you the
code. So here's my Pchar. I'm going to call it parser character A. If the input is empty, I'm going to return false. If the first character is A, then I'm going to return true. And I'm also going to take the remaining characters just by starting from index one and returning
the rest of the string. And if it doesn't match A, I'm going to return false. So that's really a brain-dead kind of parser. If you're not familiar with F sharp, these are actually the return values. You don't need a special return keyword in F sharp. So that's what we're returning. Okay, that's version one. Version two is a
little bit better. The problem with version one, of course, is it's hard-coded for character A. Let's make it a bit more flexible and allow us to parse in any character. So I'm going to parse in an extra parameter now, which is the character to match. So it could be an A or a B. It could get really exciting here. The
other thing is the output is going to be a bit more complex this time, because if it matches, I need to return the character that I matched, because I don't know whether it matched or not, and the remaining input. And on a failure, I want to return a nice error message. I don't just want to return true or false. I want to return a message like, you know, I was
looking for an A and you gave me a B, something like that. So let's look at the code for that. Very similar. This time, I'm returning a nice error message, no input. If it matches the character, I return the match character and the remaining, and if it doesn't match, I have a nice little error message. I was expecting this and I got something else. The
problem with this code is it doesn't compile, because the return values here are different types. That return value is different from the other one. On the failure case, I'm returning a string, but on the success case, I'm returning a pair, a tuple of the matched character and the remaining string. So this won't
compile. So the way to fix this in a functional programming language is to create a union type, a choice type, I like to call them, which is a choice of either case. So we're going to create a type, I'm going to call it results, and it's got two cases, it's either success or it's a failure, and if it's successful, it's going to have a tick A,
that's the F sharp way of saying it's a generic type. In C sharp or Java, that would just be like a capital T or something. So those are the two choices, and then our output now has two choices as well. On the success branch, it's the pair, and on the failure branch, it's the message. So let's see
how the code has changed. So instead of returning the string, we're going to return the failure of the string, and instead of returning the pair, we're going to return the success of the pair, and again, on the failure case, we're going to return a failure branch as well. So that's our
version 2 code. Hopefully this is, I know I'm going through it very quickly, but this hopefully is really kind of obvious, roughly, again, the concepts are really obvious, the details of the code you don't have to worry about too much. So the return values are now the same type, when the compiler is very happy.
Right, so that's version 2. How can we make it more complicated? So one of the things about this model is that currently, we have this character to match, and we know this in advance, when I'm looking for a particular character, I know that I'm looking for a quote or a semi-colon or something, I know that when I'm designing the parser. The input, we don't
know this in advance, we don't know the input until later on. So what I really want to do is be able to work with all the stuff, with the information that I do know, and kind of delay working with the rest of the stuff until I actually have it at run time. So the way to do that in a functional programming language is to return
a function. So instead of having a two input function like this, I'm going to turn it into a one input function which returns another function. So you see there's before, and there's after. So this thing is now going to return a function. That function is just like the original one, it's got this little input stream, and it's got
the result and so on. So now what we've done is we've got a way of building a function that when we run it with an input stream would actually do the work. In version four, we're going to take this function, which is kind of awkward to use, and we're going to wrap it in a type. We're going to just wrap it up so it becomes a thing. And we're going to call this thing a parser,
and in this case it's a parser which returns a char. And here is the type, this is where it starts getting a little bit ugly, so the type of parser, it contains a function inside the type. So this is definitely functional programming here, it's a function which has been treated as an object.
And the function in there takes a string, and it returns a result, and then we're wrapping it up in this parser type. So that's our basic parser. So let's see, we'll go back to our recipe model and see how it works. We've got this character we want to match,
we run it through this thing, we now have this parser, and then we're going to combine all these parsers. And remember that a parser just basically means it's something that if you give me some input later on, I will then give you a success or failure later on. And when I want to run it, I give the input, and again, if you look inside the parser,
basically it's just a function inside there, so all I have to do is run that function with the input, and it gives me the success or failure. So let's look at how the run code works. So to run a parser with an input, basically you just unwrap the parser to get the inner function, and then you call the inner function with input.
So the piece of code that runs it is really trivial. All right, let's see some code. That's enough talking. So this is the real code. There's the result type, there's the parser type, there is the code that matches the thing,
there's the run code. Let's see if this works. And so here is a parser for the character A. There's some input. If I run that character,
whoops, no I don't, thank you very much. If I run that parser on that input, I get this down here, it's a success. This is the character that was returned, that's the character that was matched, and this is the remaining string. But if I run it on a bad input,
like the first character is a Z, when I try and run it, I get this failure message. I was expecting an A, but I got a Z instead. So that is it, that is our parser done and dusted. Pretty straightforward. So hopefully that makes sense.
That parser is, that's it, we're done with the parser logic. Everything else is now combining these parsers. So let's look at some basic ways of combining the parsers. So this word combinator, it has a technical meaning, which is basically any function that depends only on its inputs,
which is not really a very helpful function. So typically when we talk about combinators, we talk about combinator libraries, and a combinator library is a library designed around combining things to make new things, okay? That's a combinator library. Very common in functional programming, you have a parser combinator library, you have an HTML combinator library,
you might have a database combinator library. It's a common way of designing libraries. So here's an example of a combinator. Addition is a combinator. Adding two integers, you get a new integer. If you concatenate two lists, you get a new list. And in F sharp, the at sign is the list concatenation.
And finally, if you add two parsers together, you get a new parser. So here's the question. What are the different ways you can add two parsers together? What can go there between the two? So the first thing is you can chain them in sequence. So you can say, match this thing and then match this next thing.
So I'm calling it the and then combinator. Or you can say, match this thing, and if that doesn't work, match this thing instead. So I'm calling it the or else operator. And another really useful one is a map combinator, which basically says whatever you've parsed, transform it in some way into something else.
So you might have parsed a string and you wanna transform it into an int just because you know it's a string of digits, for example. So let's look at these three basic combinators. So the and then one. So this is the logic for it. You run the first parser. If it fails, give up. Otherwise, you take the remaining input and you give it to the second parser.
If that fails, give up. If they both succeed, we're gonna return a pair because we got a result from the first parser, we got a result from the second parser, we're gonna combine them into a pair and return it. So let's look at the code. So we're gonna define an inner function because we've got this inner function all the time. We run the first parser.
We check the result. If it's a failure, return. If it's a success, we go to parser two. And now I'm gonna page down to the next page. So there's the inner function. There's the remaining. And we're gonna use that remaining input for the next bit. We take this remaining bit. We run it on the second parser.
We test the second parser results. If it's a failure, we return. If it's a success, we now know that we have these two pieces that we need to combine, so we create a combined value, which is the pair. We call that a success. And we also, we've now got the remaining thing after the two parsers, so it's the second remaining stream. And then we finally take this inner function
and we wrap it up in a parser and return it. So that is a really fundamental combinator written in, you know, 15 lines of code. So that's just important. We have to keep track of which is the remaining code. There's the combined value and there's the inner function that gets wrapped up again.
The or else is kind of similar. Run the first parser. If it succeeds, we're done. If it fails, we take the original input, which hasn't changed, and we run the second parser on that input. And if the second parser succeeds, we return. And if the second parser fails, we return anyway. So either way, we return the result of the second parser.
And finally, the map, similar kind of thing. We run the parser. If it succeeds, we take the parsed value and we run it through this function to transform it into something else. And if it fails, we just give up. So this is where these funny operators come from because we don't really write and then and or else.
It's quite nice to use infix operators. So in the F sharp parsing world, we use the dot angle brackets angle brackets dot to mean the and then. And the dots are important. We'll see why the dots are important in a minute. The or else is a vertical bar
and the map is a vertical bar with double angle brackets. So when you see these symbols, that's what they are. And then, or else, and map. So let me actually give you a demo. Whoops, parser two.
There we go, there's the and then code. There is the infix version of and then. There's the or else code. There's the infix version of or else. There's the map code. And there is the infix version of map.
All right, so here's some little parsers. A and then B, right? So we take A and then we try and get B and we want both of them. And if we try and run it on ABC, we get a success and we return a pair
and the remaining string is C. If we run it on ZBC, the A is gonna fail. So it's saying I'm expecting an A and I got a Z. Now if we run it on a different string, the A is gonna match but the B is now gonna fail and you actually get the nice O message.
I was expecting a B and I got a Z. So you can actually tell you where it actually knows about the different parsers and it keeps track of where you are in the string. So it's not bad. Here's the or else thing, very similar. So it can match an A, Z, Z, that works.
It can also match a B, Z, Z, that works. And then you can start combining them. You can say it's a B or a C and then it's an A or L and then a B or a C. And here is the map in action. So I'm gonna take an A and a B
and then I'm gonna map it. I'm gonna take this pair which is a pair of characters. I'm gonna turn each character into a string and then I'm gonna add the two strings together. And when I run that, I now get a string back. Rather than a pair of characters, I get a string with two characters in it.
So there we go. And let's do an integer one. So I'm matching these two characters, one and two. Again, I got a pair of characters. I'm gonna turn them into strings and then I'm gonna turn the resulting string into an int. So this should actually be an int, like this.
And let's see if it is. Oh, there it is, it returns an int. You can actually see it says it's a result which returns an int down here. All right, so that's it for the basic combinators.
So already we've actually done quite a lot. We've got a basic parser, we've got a couple of basic combinators and now we can really just go to town and start combining these basic ones in more complex ways. So let's look at some of the ways you can do that. There's a function in functional program languages which is reduce, the reduce function.
And what it does is it basically, given a list of things or a sequence of things, it takes some sort of operator and it sticks it between every element in the list. So here is one, two, three, a list with three elements and it's gonna be reduced by the plus symbol. And that's exactly the same as writing one plus two plus three, okay, that's what reduce does.
Now, if we deal with parsers, here we have a list of parsers. So parsers are things and we can put them in a list. So here's a list of three different parsers and we're gonna reduce them using the and then operator. So that's exactly the same as writing pars at character A and then pars character B and then pars character C.
Or we can use the or else operator and combine them that way. So that's the same as pars character A or else pars character B or else pars character C. So that's quite nice. And this vertical bar with the thing, that's the F sharp pipe operator.
If you're not familiar with it, it's just kind of like the Unix pipe. It takes the one thing on the left hand side and it feeds it into the function on the next side. Okay, so here's our first kind of compound combinator choice. You give me a list of parsers and I will make a single parser that matches any of them.
So I'm just doing list reduce. It's very nice. But then we can take that and build on that. So let's say we have a list of characters we wanna match. We take that list of characters. We run the normal map which is built in and we map each of those characters to a parser for that character.
So I'm mapping using Pchar. So now we have a list of parsers and then we run choice and now we have a single parser which matches any of the characters. And here's a real example. Let's say I want to parse any lowercase character. I just say any of and I give it a list of characters that I'm looking for.
Or let's say I wanna parse any digits. I just say here's the list of characters, zero to nine and any of those things will match. So already you can see I've built up, beginning to build up some complex stuff just for some basic things. Another important combinator is a sequencing thing.
So I have a list of parsers and I wanna do them in sequence. And the code for this is a little more tricky. And first I'm gonna write a helper function that given a pair of parsers, it takes the output of the pair and it basically lists concats them together. So I have a list of parsers.
I match each parser to a list singleton and then I use this reduce with this little helper function. Don't worry about how this works. The point is that in six lines of code I've got quite a powerful combinator. And I can use that combinator now to make a parser that matches a string.
So a string is basically a sequence of characters and I wanna match each of those characters in turn. So if I'm matching the string true for example, the string literal true, T-R-U-E, I wanna match each of these and if any of them doesn't match then the parser fails. So I take each character in the string, I map it to a parser,
I use sequence to turn it into a parser which returns a list of characters, I then convert that into an array of characters and then I convert the array of characters back into a string and now I have a parser for strings. And like I say, the code, I wouldn't worry about the code,
really understand the code but the point is that this is like a few lines of code. Once we've got the basic combinators done, the other combinators become very simple to write. So there's parse lowercase, there's parse a digit. So if I run the lowercase parser,
it says yes, I succeeded and I found an A. If I parse a lowercase and it's the first letter, uppercase A, it's a failure. I was expecting a Z. Now it's interesting because it's expecting any of these characters and it's the last one. It gave up on the A, it gave up on the B, when it got to the Z, when it gave up,
that was the error. So say I'm expecting a Z. That's a kind of ugly error message and we will deal with that shortly. Similarly with parsing a digit, let's move down to parsing a string. So I wanna parse the string, ABC.
If I parse in a proper string, that's successful, it matches ABC and the remaining string is DE. If I put a bar in the second position, it knows that it's expecting a B there and got a vertical bar. If I put a bar in the third position,
again, it knows it's expecting a C there. So the error message is already pretty good. It could be better, but that's not bad, considering we've literally written like 30 lines of code so far. All right, so we're not done. More combinators. Many, many more combinators. You can build a huge library of combinators,
a combinator library. So the first set of combinators is what I call the more than one combinators. So sometimes you have something where you wanna match more than one of a certain thing, more than one comma, more than one character, whatever. So there's a many combinator and a many one combinator,
which is one or more, zero or more, and then optional, which is zero or one. Those are really common. And here's a good example, white space. A white space character is any of a space or a tab or a new line. And then white space in general is one or more white space character.
So there we now have a white space parser and we can then combine that white space parser with other parsers. Another one is what I call the throwing away combinators. We saw this one with a dot on both sides. That's the and then combinator. Sometimes you wanna parse something and then ignore what you've parsed. You just wanna match it.
Things like double quotes in a quoted string or list literals or something. You wanna match it and then throw it away. So in this one, you put the dot on the left hand side. That means you keep the left hand side and you throw away the right hand side. In the second one, you keep the right hand side and you throw away the left hand side.
And another useful one is a between. So you have three parsers and you parse the first thing, but you throw it away. You keep the second one and you throw away the third one. And again, that's something like a double quoted string. So let's look at that. There's a double quote, parse a quote. And a quoted int basically says between a double quote and parse an int and a double quote.
So it will throw the double quotes away and just return you the int that is parsed. There you go. That's another one. And separators is another one. Really, really common. You know, a comma separated list or a semicolon separated list or something. Same kind of thing. One or more, zero or more.
Here's a comma, parse a comma, parse a digit. And then parsing one or more digits in a list is digits separated by commas and you have to have at least one of those. So I can demo that. And here we go, some more.
And you can see this is a little bit, some of this code is a bit more complicated. The overall parser library is about 500 lines altogether. So I can show you the full thing later on. All right, so there's digits.
So define a digit and then digits is one or more digit. And then an integer is basically one or more digits. And then we're gonna convert, we find the digits and we're gonna convert this using map into an int somehow with a little helper function.
And then when we run that, it finds the one. But when we have two digits at the beginning, it picks up both digits. And when we have three digits at the beginning, it picks up three digits. So it is picking up, it'll basically pick up as many digits as you have before it hits a non-digit.
And similarly here's a list, same kind of thing. So this is a comma separated list and it returns, I didn't pass it in properly.
So you can see it's returning a list of digits that it found. And if I change this to one, two, three, it successfully picked up a new one.
There you go, that's that. So I know I'm whizzing through all this stuff, but like I say, it's really the concepts of how you build these things together. All right, now we're doing for time here. Good. So what can we do now to improve the parser?
The first thing we can do is name the parsers because as you saw in one of those cases, when it gave an error message, it said I was looking for a Z or I was looking for nine. And what you're really doing is you're really looking for a digit. And because it doesn't know that it's a digit,
it will just give you the last thing it was looking for which is kind of unhelpful. So what you can do is you can take this parser object because it's an object now, you can just add a property which is a name property. And in this case, we'll give it a name, we'll give it digit. And so when we use it like this, we're gonna again, as always, we'll create some sort of cryptic operator
to make our lives harder, to hide it from other people. But in this case, it's the question mark. So this is what the code looks like when you run it without a label. It's trying to parse the digits. The digit is defined as zero up to nine. When it can't find a nine, it'll say I can't find a nine,
which is a really unhelpful message. If you take that same parser and you give it this label, digit, when it fails now, it will now say I was trying to parse a digit, I couldn't find a digit. And that again is much more helpful.
And the other thing you can do obviously, which is really nice, is to give you the line number and the column number where the parsing failed. So if you have a large file, obviously you don't wanna just say, I was expecting a comma here, and it doesn't tell you where it's expecting a comma, that's not at all helpful. So again, we just have to change our inputs, take the input object,
instead of just having a stream of characters, we have a stream of characters along with the current line and the current column. And then every time we parse a character, we increment the column number and the line number. And we can write a nice little, the error handling, the messages become much nicer. So here we're trying to parse an integer.
The minus is okay if the Z is not a valid integer. And so it says column one is an error, and we can put the even little caret saying, here we are in the line. Here's a float. This is Z is wrong. So on column four is an error. And again, we can write the error line
and we can also write the little caret telling you where it is. So that makes the error handling much more friendly. So this is not very hard to do. I'm not gonna show you the code because this is where it does get a little bit ugly, but it's not that hard. You can see it would be obvious how to do it. Okay, so building a JSON parser.
I'm gonna use the JSON spec at JSON.org. And that has lots of pretty pictures. And this is the first picture, which is that the JSON value is one of the following things. It's a string, or it's a number, or it's an object, or it's an array, or it's true or false, or it's null.
So how can we represent this in F sharp? Well, in F sharp, we can use a choice type. So here's our choice type. A JSON value is either a string with a string inside it, a number with a float inside it, an object, which is a dictionary of key value pairs,
where the keys are strings and the values are other JSON values, an array is just a list of JSON values, a boolean is just a boolean, and a null has no value at all. So that's a type that represents a JSON value. All right, let's start parsing it.
So let's start with some easy ones. The true and the false and the null, these are just literals. So we can just write, we've already got a thing that parses string literals already, so that's easy. Let's start with null. Yet another little helper operator, and more cryptic symbols.
And so I apologize for these cryptic symbols, but this is part of the reason why these libraries look complicated, because they have all these obscure symbols. You just have to get your head around them, make sure you just have a reference sheet on hand. There aren't that many, though. I mean, we've seen five or six of them, and once you get the hang of it,
there's not that many you have to memorize. So what this one is gonna do is it's gonna run the parser, and it's gonna take the output, but it's gonna ignore the output of the parser, assuming it succeeds, and give you back some result that you specify. And that's very common if you have a string literal. So in this case, I'm looking for the string null.
If I find it, I don't really care what the characters are, because I already know it's a null. So I'm just gonna give it the null value, which is this J null is an F sharp type, or a type constructor. So I don't need to process the contents of the string in any way.
So I can just ignore it. So that's how you parse a null. And then I'm gonna give it a label, so that when it fails, rather than saying I couldn't find an L, it's gonna say I couldn't find a null, which is a much nicer message. Similarly for booleans, I have a parser for the true literal,
and then I'm gonna map it to a boolean value called true. I have a parser for the false literal, and I'm gonna map it to the false. And then my boolean parser is just a choice between the true parser and the false parser. So there it is. I'm just using the choice, the or else operator,
and then I'm gonna give it a label, boolean, so that if it errors out, I get a nice error message. All right, what about strings? Okay, now it starts getting a bit more complicated. So a JSON string can be any Unicode character, blah blah blah, or it can be various escape characters,
or it can be a hexadecimal. So let's start with this one. I'm gonna break it into components. So one of the nice things about parser combinators like this is you can break the task into small pieces and build the bigger one from the smaller ones. So I'm gonna just work on the individual subsections and then combine them later on.
So the first subsection I'm gonna work on is this one that says any character other than double quotes or backslash or control characters. And I'm gonna call this the unescaped character parser. And here is the code. Unescaped character is something that satisfies a character parser.
So satisfies the one you haven't seen yet. It basically says, give me a function which is a predicate for characters and if it matches that character, it's okay and if it doesn't match the character, it fails. So in this case, it'll work as long as the character is not a backslash and is not a double quote.
And I'm gonna forget about control characters for now. So if it satisfies that condition which is any other character, that parser will succeed. We will get an unescaped character. Okay. The next one, I'm gonna call this set the escaped characters. And there's a lot of them.
There's like eight of them. So what I'm gonna do is I'm gonna create a list of all the eight possible things in the list. And for each thing, I'm gonna say, here's the string that I'm looking for and if I find that string, here is the output that I'm gonna return. So if I, this is unfortunately,
we've got the escapes. So in this case, I'm looking for a backslash followed by a double quote and I have to escape them both. And if I find it, I'm gonna return the quote character. If I get a backslash followed by a backslash, that's gonna return a backslash
and everything's doubled up. If I have a backslash followed by a forward slash, that's gonna return a forward slash. A backslash followed by a B is gonna return the backspace character and so on and so forth. And then I take all these pairs and for each pair, I'm gonna convert it into a parser and the parser is gonna take the first item in the pair,
the string to match, and that's where it says P string to match at the bottom in red. And if I find that thing, I'm gonna return the results which is the second part of the pair. So I can process basically a whole list of items in one go and this is one of the nice things about using a programming language
to write your parsers in is that this is standard F sharp code. This is nothing special about the parser library. So if I was using C sharp, I could use link, for example. This would be the equivalent. Instead of using list map, I'd use dot select and that would give me the same thing.
And now I have a list of parsers. How do I combine a list of parsers into one parser? I use choice. And so I've combined all my parsers into a single parser. All right. And then I'm gonna bring a label, make the error messages nicer. And this final one is the Unicode character.
I'm not gonna show you the code for that but basically you define a parser for an individual Unicode character and you have four of those parsers in a row starting with a U and that's how you define that. And now I have my three sub pieces and I wanna combine them together. So the string is basically a double quote character
followed by zero or more of these characters followed by another double quote character. How do I define that? Here's my quote character. So it's a P-char of a quote and I'm gonna give a label so it says I'm looking for a quote. The adjacent character is either an unscathed character
or an escaped character or a Unicode character. And then the main parser for that is gonna be a quote character followed by zero or more adjacent characters followed by another quote character. And I'm gonna throw away the quote character. So you can see the dot is on the inside. So this is basically the same as the between, okay?
So that's the important thing is I can build up more complex parsers by combining the simpler parsers. So it's tedious to write a parser. I mean the JSON spec, you know, it's not that hard but it's kind of tedious. You have quite complicated little things but if you just follow, you can just follow the railway diagrams and write the code.
It pretty much corresponds exactly to the railway diagram. It's one of the really nice things about this style of parsing. And then a proper JSON string is a quoted string and then I have to map it into one of this JString objects which is one of those cases in the main type we had and give it a nice name.
All right, numbers. Okay, this is a bit more intimidating. So break it into smaller pieces. There's a sign which is optional. There's what I'm calling the integer part which is either a zero or digits one to nine followed by zero or more normal digits, okay?
So let's look at that. So there's the optional sign. I match the hyphen and I say it's optional. I can be zero or one of these hyphens. The zero part is just matching the string zero, okay? The digits one to nine is basically
any parser that satisfies where the character is a digit but the character is not zero. The normal digits are anything where it satisfies where the character is a digit. So, and I'm using the chars digit rather than zero to nines just in case there are some Unicode digits
that are not an ASCII digit, that I'm not an expert in Unicode but almost certainly there's some weird numbers out there somewhere. All right, so non-zero integer, we said is a one to nine digit followed by zero or more normal digits. So I'm gonna combine them using that little and then.
And then I wanna turn that into something useful. So I'd map that, so I've got a little pair. I turn the first one into a string because the first one's a character. The second part is gonna be a string and I combine them into a new string. So I've now got a string there. And then finally the entire integer part is either the zero bit or the non-zero bit.
So it's tedious but it follows the design. You can see it follows the railway, it's not hard to write, it's just boring to write. Same thing for the fractional bit. A fraction is a decimal point followed by one or more digits. That's easy to write, there's a decimal point.
It's a decimal point followed by one or more digits. The exponent part is either a lowercase e or an uppercase E followed by an optional sign followed by one or more digits, blah, blah, blah. Same kind of thing. So you know, it kind of gets boring after a while writing all this stuff. But there you go. The good thing, by the way, is that when you do write this, it will do type checking for you.
So you won't be able to mess it up too much. If you try and combine something that's not the right type, it won't compile. So you're pretty much guaranteed that at least whatever you type in is gonna work. It might not parse exactly what you want but it won't crash with some weird error. So now we have our four different parts. We're gonna combine them.
So we say it's an optional sign combined with an integer part with an optional fractional part, an optional exponent part. We're gonna take this whole thing and convert it into a J number and give it a label. And so on and so on and so on. We're gonna go through the whole JSON parser. But you get the idea.
And so finally, we have all these parsers for these individual pieces now. We wanna combine them into a parser that parses any JSON function. And here's the code. To parse a JSON value, you have a choice of something that parses a null, something that parses a billion, something that parses a number, something that parses a string,
something that parses an array, and something that parses an object. So the code really, really matches the parsing diagrams which is very nice. So it's pretty easy to write this kind of stuff. All right, so let's actually see the demo. So before we just get,
I think I've got enough time just to show you one thing here. Yeah. I just wanna show you. Here's the entire parser library including all the line number handling and stuff. Quite a lot of, it's a little bit complicated. But even with all this line number handling and some utility stuff,
the code itself altogether is, there's parsing an integer, there's parsing a float, 496 lines. Okay, so less than 500 lines for the entire parsing library which includes utility things for parsing a float,
parsing integer, parsing spaces, parsing white space, parsing a string, all that stuff, all these parsers are in less than 500 lines. The JSON parser itself, here we go, here's the JSON value. Like I said, this is the real code.
There's the escaped character code. There's the number code, tedious, tedious, tedious. But the whole JSON thing is 295 lines. So let's actually test that.
Will it come up? Yes, it does come up, okay. So here's the example of the null. So there's the J null parser and it succeeds and it returns a J null. If I parse in null P, rather than just saying I was expecting an L, it gives a nice message, column three, error parsing null,
unexpected P at that position. So that's a much more helpful error message. Similarly, parsing a boolean. If I try and parse trucks, it says I'm trying to parse a boolean, you gave me trucks, there's an unexpected T.
I guess it gives up on the, it can't find the true, so it tries to find the false, it can't find the false. So that's what that error message is, okay. It's backtracking. Here's the string stuff. Let's go down to a real piece of JSON code.
So here is a real JSON fragment and you can see it's got a string, it's got a boolean, it's got a birthday which has another object with three properties in it and then the favorite colors is a list of strings. So if I parse this, show this up here,
you can see it successfully parsed it. It returned a JSON object. The JSON object contained a map, a dictionary. Inside the dictionary there's a property called birthday which in turn is a map which in turn has the day property which is a J number, a month which is a J number,
a year which is a J number, favorite colors is a JSON array which contains a JSON string and another JSON string and so on and so forth. If I change this to a JSON number, let's see if this works. It should say it's a JSON number and there's the JSON array
and the first thing is a JSON string and the second thing is a JSON number. Now if I put a bad character in there like an angle bracket or something, a square bracket and I try and parse it, again line five, column zero, parsing an object, unexpected square brackets
and here's some of the JSON.org site has some other examples. This is one from their site so it's quite a more complicated one and if I run it, I get the complete thing.
I successfully parsed it. It's a map containing a widget. The widget is a map containing debug and image and so on and so forth. So there's a full JSON parser in 300 lines of code. Not bad, I think.
Like I say, the details can get quite complicated but I think you can see the concepts, how you build up the more complicated parsers, some simpler ones. So what have we got? I think we've pretty much done. The most important thing is we're treating functions like objects.
So the original parser returned a function, right? That's a very functional programming kind of thing to do. We treat functions as things in their own right and in this case, we wrapped that function in a type and once we'd wrapped it in this parser type, we could then manipulate it. Those types I was calling recipes.
People normally call them effects or computations but the point is you go to these recipes, they don't actually work until you actually give them the input stream but what you can do is combine them before you actually have the input stream. You can basically do a little programming with these recipes. You can take two recipes and combine them to make a bigger recipe
and that's a very cool thing. So actually kind of programming with types. You're programming with combinators. So you have a little program and then you run the program with the input. And I think hopefully you get the idea of the power of these combinator libraries. We just started with three basic combinators and from that we could build a choice
and the any of and the sequence and the string and the all sorts of other ones built from those basic ones. And that's this whole thing of building complex things from small things. This is really the essence of composition, the essence of functional programming. And they're very small but they're very powerful.
Like I said, the combinator library is 500 characters. 500 lines, not bad. And with that library we could actually write a JSON parser in three headed lines. And people using these kinds of things, you can write binary parsers as well. You don't have to write string parsers. So you could write, you know,
they wouldn't be particularly efficient. By the way, this code that I've shown you is not at all efficient. It's really an example to get the concepts across. If you want efficiency, I would use a commercial, you know, a more serious library. So for example in F sharp, the F parser library is the one to go for and there are similar libraries for all other languages.
So thanks very much. The code will be at the parser directory there. Contact me if you've got any questions, slides and video. If you want help with F sharp, F sharp works consulting. And if you want more about F sharp itself, go to fsharp.org. Thanks very much.
And if you've got any questions, just come and see me afterwards and don't forget to fill out the thing at the back. Thank you.