Haskell Weekly


Parser Combinators

Listen on Apple Podcasts
Listen on Google Podcasts

You can also follow our feed. Listen to more episodes in the archives.

Are you curious about how Parsec is implemented behind the scenes? Cameron Gera and Taylor Fausak follow Antoine Leblanc’s walkthrough.

Episode 32 was published on 2020-12-14.



>> Hello and welcome to the Haskell weekly podcast. I'm your host, Taylor Fausak. I'm the lead engineer at ITProTV. And with me today is one of the engineers on my team. Cameron Gera. Thanks for joining me today, Cam.

>> Thanks for having me, Taylor. I'm excited to be here. It's been a minute since we've done this, you know? So I'm glad to get back into it.

>> Yeah, we took a month off there around Thanksgiving. I guess we needed some time to recuperate from those large meals.

>> Mm. All that turkey. Mm. But yeah. I mean, today we're in December. It's, uh, here in Florida. It's actually a little chilly for once, which has been really nice. Um, and, you know, the holiday season is upon us, and there's this awesome project that's been out there for a couple of years now called Advent of Code. And so I've been working on it. Taylor, you've been kicking my butt at it. But you know, Taylor is also working on it on, and I just wanted to, you know, let our listeners know that that's available to you. If you're interested, you can solve it in any language you want. But, you know, I'd encourage Haskell Taylor's doing, Elm so there's, You know, it's all about getting to the right answer. Uh, and so that actually kind of leads us in today's topic because, you know, in Advent of Code, we're doing a lot of parsing. And in this edition of Haskell Weekly, which is edition 241 there is a blog post called parser combinators walkthrough and I was like, Oh, sweet, This is awesome. I need this, um, and you know, the author Antoine Leblanc did a great job of explaining this I'm really excited to kind of chit chat about it. So the parsing expert ourselves from least not it's not Antoine, but it's Taylor. So he's the expert here, so I'm probably gonna be pinning him with a lot of questions. Um, so, yeah, I mean,

>> I'm ready.

>> Obviously, I know what a parser is. Luckily, Antoine talks about that which is just, you know, the ability to read through some sort of stream and transform it into some other data type. The most common one in Haskell is generally a string. You parse the string by each character, and You go parse it into something else. Um, and here he focuses on, you know, understanding what Parsec does and what it is and kind of rebuilds it to Then kind of make a JSON parser

>> Yeah, You've touched on a lot of things that I wanna build on just a little bit. So yeah, a parser at a high level is something that takes some input and produces some output. But that basically describes a function as well. So that's a little too general. Usually when we think of parsers, we're thinking of parsing strings and in Haskell returning them into more structured data types. So one thing that we've talked about before on this podcast is the difference between parsing and validating. So parsing like you consume your input and produce something that tells you more about it. You know, more information at the end versus validation is like you already have some data on hand, and you just check something about it,

>> right.

>> Um and like you mentioned, string is a really common input type for parsers, but it doesn't have to be strings. You can parse byte string. You can parse, JSON. I mean, and when I say parse JSON, I mean, like, you could use JSON as your input type and then produce something else at the end. So with the library, like Aeson, when you're doing the FromJSON instance, you could think of that as like, I'm taking a JSON input and producing some custom type as the output. That's a helpful way to think about it. But

>> yeah, it makes a lot of sense to me. I just, you know, have a really excuse me. I've really never taken the time to understand what's happening underneath. I just kind of say Oh, yeah, it's It's parsing JSON into our new data type and that's it's fine. But when I'm, you know, working on Advent of code Problem where I'm saying All right, let me take in this stream of input, and I need to make sure you know I need to parse it and then also validate it afterwards, which you since I have the data type and I could parse it into a known you know ADT for say, I think hair color is something I'm trying to parse right now, and you know. Do you have a certain op number of options for that? Um and so I could parse that into a specific, you know, data type. That would be easier to understand. It's a check against.

>> Yeah, that's a really natural fit for, uh, parsing into a data type. Because instead of matching on a string and having some, you know, underscore match that says this should never happen You know, this is an impossible case. You can actually just rule it out. Um,

>> you should see the code. Right now it's I'm using data dot text, and I'm doing a lot of infix isInfixOf and any and alls and it's no good. So

>> yeah, and hopefully after we finish talking about parser combinators and you can apply this to your advent of code solution, you'll be able to look and see like Oh, yeah, using a parcer here makes code that's both easier to read and easier to maintain and easier to see what's going on versus the like, poking around and plucking sub strings out of stuff like you might do in some other languages.

>> Yeah, so I mean, that's a great little segue like where do we start with parsecs like, What's a good place to start understanding and how Parsec and parser combinators work?

>> Yeah, so this post obviously, is a great place to start, But it's not the first resource for something like this. And it's been a while since I was learning how parser combinators work. So I want to say that like the Parsec documentation or Megaparsec or Attoparsec any of those are probably find to understand, like the API that they expose. But to figure out how it works behind the scenes, you're going to need a resource like this or Stephen Diehl wrote one a while ago where he goes through the same thing of building up a parser combinator from scratch. And I'm sure other people have written things as well. So I'll try to include links to those things in the show notes. But yeah, this is a great resource. So let's let's dive right in.

>> Yeah, so the first thing he does is obviously give you a type signature and show you that Parsec is really just a function. Uh, you know you're taking some string and transforming into some a, um and you know, that's not always gonna be successful. So he actually creates it, you know, it returns an either. So you can kind of catch a parsing error and display that to your users so that they can understand. Oh, my input. Incorrect. Let me change that. Uh, and so, you know, that's a good little, you know, place to start. And then he's like, All right, well, enough of that we're done. Cool. Let's move on. We've got to really crux parsers, right? That are super low, low, low level or high level.

>> They're low level, I think.

>> Right. That's what I would have thought

>> They're like primitives almost

>> right. And that's the idea of, you know, an any parser and an end of file parser. So an any parser just says, keep giving me stuff while there is stuff. And then the end of file partner says, Oh, there's nothing left to parse. We're done. Here is your output. Here's your result.

>> And both of these things are quote unquote combinators. I know that when I was learning functional programming in Haskell and parser commentators that term kind of threw me off like, What does it mean exactly? So from this post that any function that is a parser and the EOF sorry, I called it function. It's a parser value, which itself is a wrapper around a function. But any and EOF are both people would refer to them as combinators. So if you hear parser combinator, that's kind of the thing they're talking about.

>> Gotcha. Okay, that makes sense. So, you know, that covers the two base cases

>> I just wanted to touch on. Why those two are base cases because our the function type that we have says you give me a string and I'll give you effectively either a result or an error, right? And these two combinators, any and EOF Let us step through the entire input so we could go one character at a time. Just give me each character as it comes along, and I'll do something with it, and then I'll hit the end. Um, we're gonna need a couple more things, and what we're building here may not be very efficient. You can imagine going one character at the time isn't the fastest thing in the world. Sometimes, you know, you can skip a bunch of characters. Um, but this is enough to show you how these things work.

>> Gotcha. So how does it I guess any is recursive, right? Is that Is that what's happening, or how do we continually get through? You know, a string? Like what? What in parsec does that?

>> Yeah, it could be a little tricky to wrap your head around. And I think throughout this episode we'll probably say Parsec And we may not actually mean the Parsec library. We may be talking about what they're showing here. Um, but what? What he's showing here the way that you carry that string around and like, quote unquote update your state as you might think of it. In a normal imperative, language is you take a string as input and what you return is a tuple that has a string and then that either I talked about before. So you take your input and then you produce like the rest of the input after that. So with any combinator, you're taking one character off the front of your input and then passing the rest of it, Uh, back to the caller.

>> Okay. Makes sense.

>> It could be a little tricky to wrap your head around because you're right. Like using this will typically kind of feel recursive, but the any function itself doesn't use recursion.

>> Okay, that's that's good to know, because that's where, like, you know, looking at the Parsec library and kind of looking at the API exposes. I'm like, Okay, but how do I use these things? Like I know I can see what they are, and I understand the type signatures, but how do I use them? Um and so, I think, diving down a little bit into the his version, you know, of any and EOF you know, is helpful to just kind of grok That idea.

>> Yeah. And I would have liked seeing a little more high level motivation at the top. I think he's kind of taking for granted that you already know what it will look like to use this thing at the end. And you're curious about how is it implemented? But you know, towards the end, when he gets into parsing JSON, he mentions that or he shows like this is how you will parse a JSON value will say, like this constructor f mapped over this parser, Or this constructor over that one on so. That's, I think, kind of what you were touching on of, like, Okay, I can see how all these what the type signatures of all these pieces are. But how do I actually use them together?

>> So yeah. I mean, it is also very nice to know what's happening underneath? I don't discount that at all, but as someone who hasn't put into practice like I read this today knowing that okay, I wanna take this tomorrow and run with it for ah, advent of code and parsing. You know, this hair color thing that I'm trying to figure out what it is to make sure it matches the rules that they have placed upon it. So, yeah, I definitely feel like I got a little bit more value at the end of this article, but, uh, moving along. You know, I think talking about sequencing these parsers is, um yeah, parsing is sequential in its nature, right? Like you're moving all through something to translate it into some other data type. And so you know, he kind of dives in a little bit of like how a parser might help and how it allows you know you to be in the parser monad that than makes it a little easier to to comprehend and follow.

>> Yeah, I feel like this is maybe it's intentional. Maybe it's accidental, but it seems kind of like a good introduction or Good monad tutorial, which is kind of a meme in the Haskell community. But he starts by showing what it would look like to use one of these parsers if you had to pattern match on the result every time. And it sucks. You know, you keep indenting more and more, and you have to match every time you capture a value. And, uh, it feels to me, or it looks to me a little bit like handling errors in golang where you say, if there was a problem, then panic. Otherwise, keep going and you just do that over and over again. And then he shows how you can take that and improve it a little bit by writing another combinator called andThen where you give it a function with the result you expected, and it'll string all these things together for you so you can hide away some of that complexity. And this, to me, looks a lot like the elm way of doing things where there's no special syntax or anything. And it's all very straightforward what's going on? But because of that, there's a lot of extra noise where you have to plug all these things together. And then you get to, like, kind of the whole reason. One of the reasons that Haskell exists this special syntax around monads where you can make it look like the same code you would write for IO Or maybe or either or any other monad that you use Haskell boom. You can use it for a parser to, and I like that progression. I think that's cool.

>> Mhm. Yeah, honestly, I think you know, I know there's fancy ways to do you know this parsing, but I'm probably gonna start with, like, Alright, let me use do notation and get this piece of it and then say all right now I've got this piece of it. I think that will be super helpful. and read easy to reason about.

>> I think it's a natural fit because, like you said, parsing is sequential or at the very least we can think of it as sequential. Sometimes you can jump around. So, like if you're parsing a JSON String, you could skip to the next double quote and just say, Okay, well, everything in there is probably a string and I'm gonna jump ahead if you're looking for something else. But again, that's an optimization. So, yeah, looking at the sequentially of do notation like Look for this character, then look for a space, then look for this other character. It's really easy to wrap your head around that

>> Mhm. Yeah, No, I think that's awesome. And like I said, there's more complicated processes, which is like, you know, there's a lot of infix operators here that make your life easier. But if you're not familiar with, you know, their type signatures and what they're actually doing, it can get a little bit confusing. Get backwards easily.

>> Do notation definitely has. The advantage of you could point pretty much anybody at it, and they could probably guess what it does. These other operators are very convenient, and once you're familiar with the grammar they form, then it's a really succinct way to describe these things. Um, and a lot of Haskellers are already familiar with some of these operators, in particular the like angle bracket, dollar sign angle bracket that infix fmap operator. We use that one all the time, and we also use the I forget what what people call it, like the spaceship or whatever. Which is the angle bracket asterisk. Yeah, the applicative Um, and we use that to, like, build up records or something out of a bunch of different fields. But then there's there's the versions of those operators that only have one of the angle brackets, so they're, like, missing a side, and that treats one side as special. So, like it will ignore the result from one side and give you the result from the other. Um, and again, those are super convenient. But if you see these and you don't already know what they do, you might not be able to find out,

>> right? Yeah. So, yeah, he kind of chit chats about some of those infix operators as well. Eh? So I do definitely see their value It's just kind of a little blurry, but I'm glad that this blog post is giving me insight.

>> Well, my my prediction is that as you get more comfortable with parsers, you will start to be a little annoyed by the common structure that you often do where you're like parse one thing, and then do a void parse something else and then parse some other thing. And that's very neatly captured by some of these operators.

>> Mm. Like oh, yeah, I want just the left side of this. Don't give me whatever comes from the right.

>> And I also wanted to circle back so earlier when I was talking about combinators. The way that I think of parser combinators is in my head is stuff that you use with these goofy operators. So, like, if you would plug it into this weird sequence of star angle bracket, you know, dollar sign, blah, blah, blah, then it's probably a parser combinator That's a That's a pretty wishy washy definition. But it works pretty well.

>> Yeah, seems to cover the bases. Hmm. All right. So you know, we've learned about this stuff and, you know, we How do we know we're getting the right thing. Like, how do we know that you know this string? You know, once we've quote unquote parsed it out of, you know, the larger string How do we know what to transform it into, like, how does that work?

>> Yeah, that one is tricky a little bit. So we have to match on what we captured, right? So, like, let's say we grab some string and if that string is a double quote, then we know we're doing it. Or that might be a bad example. If that if that character is a you know, square bracket, then we know we're going to start parsing an array at that point. And this is another place where recursion I think comes in because usually what you're going to do is look for some special identifying character, like a square bracket or a curly bracket, and then call some other parser and that parser is gonna parse a couple of more characters and produce a value for you. So, like, um, like, null in JSON, that's a That's a good example. So you may look for, like, hey, parse the letter n and then parse the letter u and then l and then l. And after you've done those four things return or pure in modern parlance, pure, the like JSON null value. And that's how your quote unquote converting this string into a more structured value. You're doing the parsing. And then, after the parsing has succeeded, you're saying this is the value I actually want to use,

>> right? So you're Yeah, you're just getting that chunk of data and saying all right, now that I had this piece of information, does it turn transform into this proper

>> right? And null is kind of, ah, weird example because you're not actually doing anything with your input. You're just making sure that you were given some input. If you want to parse a number, you may parse like let's say you want an integer, a positive integer. So you just parse some number of decimal digits in a row. Um, you still have to turn that into a number. And that part just depends on what data type you're trying to build. So, you know, for numbers you would like, uh, take each digit individually and convert it from a character into a numeric value and then add all those up with their respective, you know, multiplication for the whatever base you're in. Um so to answer your question, it depends. That's how you turn stuff in structured data.

>> Sounds like Haskell. Everything. It just depends. Which was totally cool. I love that flexibility. And they, you know, the safety that it gives us. Um, it's great. Uh, cool. Well, thank you for your kind of diving and a little bit on that bit. So, like I was saying earlier, I'm gonna be parsing something that could be one of I think eight things is the problem I'm currently trying to solve. Mm. What happens if, like, I think I think choice is what I want here. But like, if I have a list of, you know, match strings matches and I want to say, All right, it's got to match. At least one of these is like is like, the idea of picking one, making a choice that is successful. Is that what I would want to do?

>> Yeah, almost always and choice itself. I don't think the Parsec library probably provides this. Um, but the operator that it builds off of Is this another one of these special ones? That is angle bracket, pipe angle bracket. So it kind of visually looks like this or that.

>> The ax

>> the battle ax

>> I don't know. Is it like a hammer. Maybe, uh maybe pick ax Yeah. There we go.

>> Um, and choice just glues together. A bunch of things that you provide as a list so that you can not use that operator a bunch. Um, but yeah, you normally would say I want this thing or if it's not that thing, I want this other thing or if it's not that other thing, I want this yet another thing and you have to be a little careful here because the parser as we've talked about you can think of it as being sequential. So if you put something less specific higher up in your list, it may prevent your more specific thing from matching. So, uh, to use an example, let's say you want to parse numbers. So you say I'll take any any digit any number of times. Okay, But then later, you specifically want to parse the number zero because you do something special with it in your language. You're gonna have to put zero before your general number parser. Otherwise, it will never parse it because it will already have been grabbed by the earlier one.

>> Hmm, okay.

>> So when you're doing that angle, bracket, pipe angle bracket, that choice operation, you have to put more specific stuff at the top. Yeah, and then less specific stuff below. Order matters. And usually or often I should say your like attempts at parsing stuff are gonna have different first characters. So it won't matter too much. Like with JSON. You're parsing null or true or false. All of which start with a different character or you're parsing a number which doesn't start with any of those characters or a string or an array or an object. So they all start with different things. So this isn't super important for that particular case.

>> Gotcha. Okay, cool. Yeah. So seems like choice is going to give us, you know, based on which one it matches, which everyone that matches first, it's gonna give us that data,

>> Yeah It has. It has no concept of, like, a best match or like a most specific match,

>> which yeah, that would be really tough to do. I mean, I'm sure it's possible, but seems not worth it. Just order things the way you want.

>> Then you get out of that mental model of thinking of it as sequential. Because then your model becomes, Well, it's going to try to parse everything quote unquote at the same time. And then it will give me back the best one. And, you know, there are parsers that can work like that, the one that we're building in. This isn't like that. And I'm pretty sure Parsec isn't like that. Um, maybe you can muscle it into being that way, but I'm not sure.

>> Yeah, I mean, I'm the newb here, so, you know, don't quite new. Okay, so what if we are iterating? Like so say we have a list, for some reason not some reason it's a valid thing. We have a list, and we all we know it's separated by commas. Like and we just wanna get each piece of data. How would we do that? What would that look like? Um, yeah,

>> yeah. So this is another one of the combinators, and I should have said this earlier but at the top when we had any and EOF those are like the primitives, they're absolutely necessary to do anything. But everything we've been talking about are also primitives. And in fact, many of them are defined in the control dot applicative module for either applicatives or a related type class called Alternative. An alternative is the one that defines that choice operator, the Pick ax Um, and these are generally useful. So, like, um, choosing the first of many options or parsing are selecting many things in a row or saying I want lots of things separated by things. These are all general concepts, and they happen to apply to parsers. Uh, but they're not specific to parsers, So I should have said that earlier. But first, for separating by So like, this is really common, like with JSON parsing an array, you say, give me a square bracket or I want a match on a square bracket and then some JSON Value and then a comma, and then some other Jason value. And just keep doing that until you hit a closing square bracket. Um, and the sepBy operator, which Parsec, provides in this little library he's building up also provides Does that for you. Where it. Um uh, This this is finally where we run into actual recursion because sepBy has a companion function called sepBy1. And the difference between those two is that sepBy1 needs at least one element to match inside where as sepBy I can give you back an empty list. And these two functions take turns calling each other. They're mutually recursive. Or maybe not these two. But he has, ah, similar pair of functions called many and manyOne. where many will give you could give you an empty list. And manyOne will always guarantee there's at least one thing in it. And this is a challenging thing to wrap your head around because many calls manyOne or it returns an empty list and then manyOne calls the parser and then calls many. But then manyOne comes back and calls like they just bounce back and forth between each other. Um, so I know I struggled with that the first time I saw it, but it's it's a little weird.

>> Yeah, I was like, I'm still trying to grok it, and it's like three or four times now. So it's like, Yeah, I mean, I see the benefit, but it's and like the need for it, but it's just a little only confusing side.

>> So I think many and manyOne are a little easier to understand than sepBy and sepBy1. So I'm gonna focus on those. And just so you know, I was talking about the alternative type class. It defines many and manyOne but it calls manyOne some s o M e, which is super confusing because I can never remember. Does many mean it can be empty? Or does some mean it can be empty? And the answer is that it's many, but I like that he went with many and manyOne. So the way that they work is that for many, it tries to call manyOne and then if that fails, it returns an empty list. So it has that pickaxe operator there. It's doing a choice. So that means that if if the parser that you're trying to match many of did not succeed at least one time, then you're going to get back the empty list And then we can consider manyOne. So what manyOne does is it runs that parser You gave it. So it'll look for, let's say, a JSON Value. And then it'll run that parser zero or more times after that by calling many. So it's just a way to say, like, it's clever. It doesn't need to be done this way. You could write them, not in terms of each other, but it is clever.

>> Got so soon as you get a Failing manyOne, you would then return an empty list.

>> So it just keeps trying that parser over and over until it can't anymore and then says, Okay, I'm done. You get back your empty list and that'll be consed. You know, all the elements you found up until then, we'll be consed onto it. Yeah.

>> Nice.

>> Yeah. And if, uh if you're looking for, like some experimental insight into how this works, you could use the debug dot trace module to put a debug at the first like line of each of these implementations to see how it bounces between them and what it has built up. So far and what it's parsed so that may be illuminating as well.

>> Yeah, I saw that. That actually parsec that has, like, its own trace functions and stuff.

>> Yeah,

>> put straight up in line was just like, Okay. I mean, they're probably the same as debug dot Trace that Probably just

>> and Parsec and probably the other related libraries. One of the benefits of using them is that they have some utility functions that will print out a lot of debugging information as they go along. So you can. There's some function in Parsec. That's like, you know, try to run this parser against this input and also spit out all the debugging info you generate along the way. And it may not give you down to the level of which combinator am I calling at any given time. But it'll be like I'm examining this bit of input, and this is the decision I made based on it. So that's useful.

>> Yeah, yeah.

>> Um, but yeah. Sorry. Bringing it all the way back around to sepBy, um, the way that set by works is essentially by delegating to many and manyOne, but it will because you're separator means that you have to parse that separator between parses right? So like I'll parse a JSON value and then I'll parse a comma. And then I have to parse another JSON value after that. Um, and so sepBy just builds those two parser together, the separator in the value in the right order and then calls many with it. I feel like I'm doing a poor job explaining this, but it conceptually isn't that different than many. And manyOne.

>> Gotcha. Okay, cool. Yeah. I mean, that makes sense. I would definitely go have people check it out if they're curious.

>> Yeah. Agreed.

>> Yeah. Um well, I know we've been chatting about this for a minute, and I don't want to keep our listeners too long, and I wanna respect you know, all of our times. So if you're interested in the parsing JSON section of this, check out the the post it will be in the show notes a swell, as in, uh, issue 241 of Haskell weekly. So there's lots of places to go find it, but that that bit kind of puts it all together, you know, shows you how to take a set of information and then parse it all the way through and figure out which choice is going to succeed on. They start with the most, um, most descriptive, which is, like, object. And then they move down as far as like what things could be parsed into. So, you know, if you're interested in that, go check it out.

>> Yeah, it's a great resource, and I feel like some of the things that we've talked about our a lot easier to look at on screen. So, like saying angle, bracket dollar sign angle bracket is super clunky, but reading it's not that bad. Um, and on the flip side, reading something like the many definition, uh, just kinda rolls past your eyes and you're like, Wait a minute. What? What did I just read there? Yeah, yeah, it's a great resource. Go check it out. It'll be in the show notes.

>> Yeah, and I mean, who knows maybe next week? We'll see if I actually was able to figure out how Parsec works and actually succeed at Day four part two of Advent of Code which I've been stuck on since December 4th, so if you're curious, that's where I'm at and It's currently December 11th. So we're struggle bussing here.

>> Good luck.

>> Thank you. I need it. I need it, But yeah. So thanks for being on the show, Taylor. It was Thanks for having me as well. It was fun to chat and catch up. It's been a little bit since we've got to do this. So,

>> yeah, I'm not sure which one of us is on the show in which one of us is having the other on the show. But, you know,

>> I think it's our show

>> We're co-hosts we're both doing each.

>> It's like our house in the middle of the Haskell weekly. Yeah.

>> Can we sing that? Do we have to pay royalties?

>> I don't know. I mean, I made a spoof off of it. So do a half. You think so? I didn't want to use it directly. Um, yeah. So it was fun to be here today. Uh, you know, I want to also thank ITProTV our employer for sponsoring this podcast and allowing us to, you know, chit chat, catch up and spend some time, you know, sharing with the community what we're learning and what we're doing. So I appreciate them a lot. They also want to thank all of you guys, Um, for, you know, listening. And they wanted to offer you guys a promo code for 30% off the lifetime of your subscription using a Haskell weekly 30 again, that is Haskell Weekly 30 for access to ITProTV the e-learning platform for IT professionals.

>> Yeah, man, go check it out. itpro.tv Also, if you wanna learn more about us, you can visit our website Haskell weekly dot News were also on social media Twitter, Reddit, YouTube, all that stuff and if You have suggestions for stuff you would like to see in the newsletter or you'd like to hear us discuss here on the podcast Excuse me. Podcast, please let us know. You can email us info at Haskell weekly dot news or open an issue on GitHub or ping us on Twitter.

>> Whatever. Or if you wanna be on the show is a guest just hit us up.

>> Yeah, we'd be happy to have a guest. And Yeah, I think that will

>> Just two lonely souls here.

>> Yeah, um, that will do it for us this week. With any luck, we'll see you again next week. And, uh, Happy holidays, everybody.

>> Happy holidays. Stay safe.