Podcast
Async Exceptions
You can also follow our feed. Listen to more episodes in the archives.
Special guest Cody Goodman walks us through an interesting PostgreSQL bug. Handling async exceptions properly is trickier than you might expect!
Episode 42 was published on 2021-03-29.
Links
- https://www.parsonsmatt.org/2021/03/17/async_control_flow.html
- https://github.com/codygman/tech-roam/blob/master/20210326113249-haskell_persistent_issues_postgres_connections_are_returned_to_pool_too_quickly.org
Transcript
>> Welcome to the Haskell weekly podcast. This is a show about Haskell, a purely functional programming language. I'm your host Taylor Fausak. I'm an engineer at ITProTV. With me today is Cameron Gera, one of the engineers on my team. Thanks for joining me today, Cam.
>> Of course Taylor, glad to be here. Uh, we've got an exciting topic to talk about today and something that's impacted us here at ITProTV. Uh, and so, you know, I'm really excited cause we have a special guest with us today our Haskell wizard himself, Cody Goodman is here. Um, and Cody's going to help us kind of talk about, um, async control flow, because we discovered some issues with some of our code base because of an issue with an underlying, you know, the async control flow in Haskell. So, um, welcome Cody.
>> Thanks, Cam. Yeah. Uh, like you're saying, um, we had an issue with Postgres simple. I guess I'm jumping right into it as the YouTubers say.
>> That's okay.
>> Let's get right into it.
>> Um, so we, we got to work. He saw a nice little Sentry error in our Teams channel. It said libpq query in progress. And we're like, what? What was going on there? Um, So we, we go and we dig a little deeper and eventually we find out that, um, a long running batch process was sometimes dying and it was leaking a connection back into the pool, which. Then if another connection tried to use that pool, say a web process or something, it would get a libpq failed, another command is in process progress, error, uh, which is kind of shocking.
>> Yeah. We would only get this sometimes, right? Cause like the bad connection would get put back in the pool, but it would be in the process of cleaning up. So if enough time passed, the connection would be good again. So you wouldn't see this error.
>> Yeah, exactly.
>> Yeah. And, uh, you know, and we had a lot of various libraries to look at, you know, we had libpq, postgres simple, resource pool, resource T, persistent, persistent Postgres SQL. So there was a lot at play here. And Cody, I know you've spent a lot of time on this, so I'm really excited that, you know, You can be here and talk to us. Cause I know you worked with Matt Parsons, um, after fault, after filing an issue in, um, persistent, that then led to our discovery of what was really going on underneath. So, um, you know, talk through, you know, the minimal case to reproduce the bug, if you don't mind.
>> Yeah. Uh, so the first thing is, uh, you know, how do we get that high level description of, we have some process that's inserting some things and. It's causing the connection pool to become bad. How do we get that into code? And then, you know, you have something, try to pull that back out of the pool and reuse it. Uh, one of the first things which, uh, actually overlooked at first is just having a pool with only one resource. Um, because otherwise it's sort of like gambling, uh, not a good use of your time.
>> You don't want to play slots at work? Just, pull it. Oh, triple sevens, I win!.
>> It might've been closer to like Russian roulette with the connection pool
>> Yeah. Yeah. All money on red.
>> Um, but yeah, we had, like Cody mentioned, this was like a background process that was doing this and. As is so common when you want to open a bug against one of the open source libraries that you're using, it's not very helpful for the maintainer to say, you know, this happens in our closed source application and we think it's your fault. So please fix your library. We wanted to have something public we could point to and say, look, this is the problem we're running into. So we were trying to take, you know, our tens of thousands of lines of code that are under consideration for our app and point it to one of those libraries that Cam mentioned earlier, one of the seven different libraries that could be at fault here.
>> Yeah. So Cody, you were talking though about, yeah, you probably, you started with more than one, uh, you know, a pool with more than one resource and then you kind of narrowed it down, right. To test with that one, like a one resource pool.
>> Right. Cause you know, that, that single mistake there, not just taking a second to say, okay, let's just make sure we get this right. That cost me some time because sometimes it wouldn't fail. And I was like, wow, why is this not reproducible? I can't just put this example that only sometimes fails up there. I guess I could. Matt's a nice guy. He probably would have ran it once. Okay.
>> Yeah. But you want it to be easy for the maintainer to reproduce this problem so that, you know, maybe you'll nerd snipe them and they'll be like, huh, that's weird. That shouldn't happen. And ideally that'll happen the first time. The first time they try it.
>> Yeah, which seems to be eventually what happened, right. And you know, you partnering and working with Matt a lot on this, you know, helped and, you know, obviously Matt's got a post about it. You know, you have a lot of notes and stuff from it. So, you know, I think there was a great experience for, you know, anybody who's working with a library that, you know, starting to create some issue. Like it's a good example of say, hey, like it's okay to create an issue. And if you can make that reproducible bug, you can work with the maintainer to create a solution and help everybody that uses the library, not just your own team.
>> Right. And it's funny, you bring up nerd sniping, Taylor, because really I'm, I'm hoping that this nerd snipes some people and we can actually answer this question and improve Haskell as a whole. Uh, so I'll, I'll say. I'll say something provocative here. I don't think many people actually know the root cause here. There's still a lot of unanswered questions. And I don't think a lot of people know about asynchronous exceptions in Haskell. hope to be proven wrong. And we can figure this out and add some documentation and improve everything as a whole.
>> Yeah. And I think you, you made a good point is like, what is like asynchronous exceptions and what are, you know, what's the difference between synchronous control flow and asynchronous control flow? Um, could you give us kind of high level overview of that Cody or Taylor?
>> Cody, I think you're the one to answer that. And I'll put you on the spot
>> Uh, yeah, kind of been drowning them for two weeks. So hopefully I understand, uh, what they are well enough to describe it. Um, so synchronous exceptions are basically just on the same thread. Uh, there's something like, uh, if you use the unsafe head function, um, this list was empty. Uh, an asynchronous exception is easier to think about in terms of like your computer telling you, telling your process that it ran out of memory and then canceling your thread, where your computer here is more specifically the GHC runtime.
>> Okay. So I normally conceptualize an async exception as coming from a different thread, but it sounds like if you think of the GHC runtime as a separate thread, then that kind of fits into that understanding as well.
>> Right. And analogies are lossy, but hopefully that's a good one to sort of start with and, uh, be aware there's more subtleties.
>> Um, and is there any difference here? You mentioned using unsafe head, which throws either undefined or error or something like that, is there really any difference between that and control dot exception dot throw. Or for these purposes, are those basically the same.
>> Yeah, I'm a, I'm a little spotty on that you might have to help me out, but, uh, I think it's precise versus imprecise exceptions.
>> Something like that. And maybe the, the thing we should be comparing against is throw IO or like the monad throw constraint, that type of thing. Um, but yeah, I think for our purposes, as far as this bug is concerned, it didn't really matter if it was a precise or imprecise exception. It mattered if it was synchronous or asynchronous.
>> Right, right.
>> Okay, so Cam, are you satisfied with that, uh, description of asynchronous control flow?
>> Yeah, I appreciate is the, the least Haskell-ish wizard here to get that understanding. And I appreciate your analogy of, and wording of how that relates GHC runtime versus not. So thank you Cody for that explanation, Taylor as well. Um, you
>> And maybe one other thing we should mention actually, before we move on from async control flow is how you manage them, right? So with normal synchronous exceptions, you can add a catch or a handle around it. And if you evaluate everything at the right time, then you can deal with that exception. And with async exceptions, normally you don't catch them that way. Is that right? Cody?
>> Right. Uh, you normally, don't, that's, that's actually a really big piece of, this is the whole problem of bracket and resource finalization. Uh, you really struck a chord here, if you can't tell. Um, uh, basically there was a big thread on this a while back, uh. Bracket does not use, uh, what was it called? Uninterruptible blocking, uh.
>> Uninterruptible.
>> Masking. There we go. Yeah. Which wasn't named blocking because of reasons. I don't remember the reason. Uh, but yes, masking interruptable masking versus uninterrupted masking. Uh, that basically means can the runtime system block. Uh, what you're doing inside of this code or can it not in
>> Okay. And let's, let's unwind that a little bit. So starting with the basics. Bracket is a function where you tell. You, you set up how to acquire a resource and how to release a resource, and then you use the resource within that. So the most common example I think is with file. Like open a file is acquire, close a file is release. And then once you have that file handle, you can write to it, read from it and do whatever you want. And masking, uh, like you mentioned, it has to do with the runtime. Can it interrupt what's going on there or not? Um, maybe interrupt is the wrong word to use, but, uh, I typically conceptualize masking as, uh, can an exception be thrown into like, while this code is running, can it receive an exception or is it, should it avoid, should the runtime avoid sending exceptions while this code is running? Should it wait until it finishes.
>> Yeah, I think that's a, that's a good way of looking at it. Um, and for here specifically, what we're concerned with is, uh, the release portion of that bracket. That is when that, that file handle that you took, you acquired from the file system. When you're releasing it back to the file system. So all the other things can make use of it. Uh, can the GHC runtime cancel that process? Are you just going to have, uh, these file handles out there that can't be reused?
>> Right. And the problem could be that you start releasing the file handle, and then GHC interrupts that release process with an async exception, for instance, saying you're out of memory. Something like that. And then because of that, you don't finish releasing that handle, but whatever code around that call to bracket thinks that the release has been run successfully. So you have this kind of like a, I dunno, Schrodinger's file handle of, it's not open, it's not closed. It's in this weird state that it wasn't meant to be in. Okay. So hopefully that kind of explains what we're dealing with here with bracket and masking. Right, Cam? You feeling good with that explanation?
>> Uh, I'm feeling okay. Yeah. Our communication channel here broke up in the middle of your last statement. So I lost a little bit of what you said, but overall, I think it's helpful. Um, and you know, obviously I think, you know. Async exceptions aren't talked about enough kind of like Cody mentioned earlier, which, you know, makes it seem like you're, you're all alone in the situation. It's like, wait a second. There are other people who face it. They just don't talk about it either because they don't, they're like the community just doesn't think about and talk about async exceptions in a regular format. Like they just, it's a kind of the, um, Yeah, kind of the black sheep of Haskell, maybe. Like thing that everybody knows is wrong, but they just there's, hasn't been the bandwidth time to take care of it and find the perfect solution. But I think now with the community growing, creating a foundation, you know, with that, we'll start to provide more resources and maybe more, uh, platforms for discussion and debate about what async control flow should look like and how we should handle async exceptions, we'll move the language forward. And so I'm really looking forward to seeing how that comes. You know, and maybe make it more approachable to other people. Because if you, you know, aren't super invested and then you all of a sudden come across this kind of issue and you're like, Oh, well now I'm back away. I'm not going to go any closer to that because it's not worth it for me because I'm not sure what the heck is going on here. Or. Nobody seems to be talking about it. So there's not really support for this and I'm out.
>> Yeah, it could be demoralizing. Like you mentioned, if you're new or newer to the language and you run into this problem and nobody seems to be talking about it. So you're like, well, this is just some insane edge case that I happened to run into. Well, maybe, maybe not, maybe it's pretty common, but nobody talks about it or everyone just hopes that it doesn't happen to them. And we got unlucky and it happened to us.
>> Right. You know, you talk about Haskell adoption, uh, is somebody at a big company starts trying to use Haskell. They convinced their team to use Haskell, and then they get a Postgres error like this, you know, an issue with what's supposed to be a core library, uh, a binding to Postgres that everyone relies on, you know, a data dot pool, something everyone thinks of is bulletproof. Um, that's pretty scary. You know, that's gonna make you really rethink things. That's not gonna make that person who, who put their neck out to, uh, get their team to adopt Haskell look very good.
>> Right. Doubly so since Haskell has a reputation of being focused on correctness, And if you can immediately run into a showstopping bug with, like you mentioned, one of the most popular libraries, then that's not a good look. Um, it is worth mentioning that there are library level solutions to this problem. I think the unliftio library, um, does resource finalization differently with the bracket function that it exposes. So if the persistent library or the whole, you know, menagerie of packages we rely on here, happened to use unliftio. We wouldn't have run into this problem. Um, or if unliftio was just part of the base library, same situation.
>> Yeah, that, that was my understanding as well. I have a little bit of doubt though. And the reason for that as I, I replaced pretty much everything. The Postgres part of persistent with unliftio, including, uh, forking data pool and, uh, or not. Yeah, yeah. Using someone else's PR that replaced all of it with unliftio and it still didn't solve the problem. I had a lot of things running. I could have got something wrong in there, but. It at least deserves to shake that confidence a little, I think.
>> Uh, and the reason I mentioned unliftio is that. I'm reasonably sure its bracket release uses an uninterruptible mask, which means that the runtime wouldn't be able to interrupt this release. Um, but some people don't like that as a default choice because then if you're release sits there and like has an HTTP timeout and takes 30 seconds to do something, your program is completely unresponsive for 30 seconds because like hitting control C is an async exception thrown from the runtime to your program. So you wouldn't be able to respond to that until that release wraps up.
>> Uh, I think the answer there, um, it seems kind of simple. Maybe someone else's recommended someone else probably as is to make bracket use on an uninterruptible mask by default, and to take a required timeout.
>> That would make me happy.
>> Same.
>> Um, okay. So, so we've been talking about masking and bracketing and interrupting and all this stuff, but let's, um, let's come up a little bit and let's talk about how we ran into this problem in the first place, because like we mentioned not many people talk about this and I think it's because most of the time, in usual circumstances, people aren't going to run into this and we were doing something a little unusual. Cody, could you explain what we were doing?
>> Yeah. If I recall we had an outer left join and we didn't have a distinct in combination with that. So we were doing like, uh, maybe it was Cartesian, uh, of a thousand or 10,000 rows. And it turned into like 50,000 or a hundred thousand queries, one of the two.
>> Uh, results, not queries, but yeah, yeah, yeah. We had a join and then an aggregation and we were duplicating the, uh, that field over and over again. And then we were iterating over that aggregation. So, we, we tried to batch something up into like 10,000 rows, but each of those rows contained several thousand aggregated together fields within it. Uh, so the data set, we were looping over was really large. So that, that was a big select, right? And then we're also doing an insert at the same time?
>> Right. Yeah. We were, uh, we had moved to streaming with persistent yet, and we were doing a solution where we did those selects and then inserts right after. So it added up to a lot.
>> Yeah. So we kind of like accidentally got into this situation and, you know, there was a bug in our query and we have since fixed it, but that doesn't mean we wouldn't have run into this problem otherwise, just that it would have been much less likely.
>> Right. We would have had to have more processes going that were requiring that thing. I think here, you know, this could happen to anyone. Like that's really what we're trying to say here is that, you know, you're not alone if this happens to you, or if you've struggled with this or anything like that, like we we're all in this together as a community and we're trying to like, you know, really you know get, get everything figured out, you know, create the best version of Haskell we can create and all have happy, fantastic, uh, job satisfaction because we're
>> It sounds like we may need a support hotline for. Are you or someone, you know, affected by async exception handling and Haskell? Call this number now.
>> The async helpline.
>> Yeah,
>> We get, uh, Michael Snowman to say some words of support?
>> I think it would just be, it would be his personal phone number.
>> There it is. Sorry. Yeah, uh, Snoyman, but there you go. You're on the hook now.
>> But yeah, so, so we ran into this problem in a way we were kind of fortunate to have this bug because Cam like you mentioned, maybe we, we would have run into this only as our dataset got larger and then it would have been like, well, this thing was working fine for months and months and then it just exploded, what happened? So that was kind of lucky. Um, And it was also in a way, a little encouraging to see that other libraries that we use are also susceptible to this problem. And on the flip side of that coin, it's a little discouraging because like, well, you know, this seems like it's kind of pervasive. Um, but yeah, the queue library that we use has, it keeps track of how many times it tries a job so that it doesn't try something over and over again. And. That bit of logic wasn't working because the like, uh, I, I assume is implemented with bracket behind the scenes. I don't remember. Do you, do you know Cody?
>> I'm fairly sure it was bracket, but I wouldn't bet money on it.
>> But yeah, so it would, you know, check the count and then if the count was too high, it would put it into this failed state rather than waiting to retry. And we, we have a limit of 10 retries on our jobs and these jobs that were failing in this particular way with this async exception, we're getting retried hundreds of times instead. And so that was like, huh, that's weird. So we haven't chased that bug down yet, but I bet it's going to be the same or a similar root cause.
>> Yeah. And when, when I saw that I was, I thought, you know, am I going crazy to something else change? You know, what other millions of things could I have done wrong?
>> Yeah. Um, but yeah, Cody, I think you had a good kind of thought experiment here for, um, we, we were lucky in a way to run into this bug, but is there a way this could have been prevented in the first place, right?
>> Right. And I was thinking about that. The only real way I think this could have been prevented is if, uh, when writing postgresql-simple, you, you write functional tests that presume whatever pattern. It, it should be used in a, uh, I guess, a professional space, which would be with, uh, the pool library. So that's kind of a big overhead to ask of somebody who's already creating your PostgreSQL bindings. Uh, so it's kind of a hard problem, but that that's what would have been necessary to prevent this.
>> Right. And we should say that, while that test did not exist, it does now exist. So as part of Matt chasing down the root cause here, and fixing it in persistent, he wrote a test case. So we're pretty confident there won't be a regression there. Um, but you know, it would have been nice to have it at the outset.
>> right. Uh, and there's a related PR in postgresql-simple that it hasn't been merged yet or reviewed, I don't think. Uh, but hopefully there can also be a test to put in there since pretty much all the other database libraries, you know, opaleye, uh, selda, uh, and I think maybe beam, they depend on postgresql-simple too.
>> Right. Yeah, it was interesting because we reported or Cody, you reported this bug and then there was this kind of synchronicity going on where two other people. Reported the bug almost at the same time in different libraries. It's like what's happening here? Weird to see.
>> Yeah. And, uh, I tried to weave a, a tangled web that you can follow and see all the related things there. Um, also include a link to some notes where I'm trying to figure out the real reason, the real root cause of everything here.
>> Yeah. And we'll add those links into the show notes. And like you mentioned a while ago, Cody, um, If you're listening to this podcast and you're like, man, those guys are dummies. The solution is so easy. It's this thing. Uh, please tell us, we would love for somebody to just waltz in and tell us what the, what the answer is. That'd be great.
>> Yeah. And I think it be insightful for everyone, you know.
>> Yes, exactly. Uh, we'll, we'll release a follow-up episode if that happens with, uh.
>> With you as a guest.
>> We were dummies. Here's what it is. Yeah.
>> Yeah. Yeah. You'd be a first-class guest here. We'll pay for your flight and everything. Come in, studio. Fancy stuff, you know, All paid expense trip. I'm
>> writing some checks. I don't know if we'll be able to them or not.
>> I'm just joking, but we would love to know if there is a better solution or the right solution. Um,
>> Um, well, one, one solution is using a different library, right? So Cody, you mentioned that most of the ecosystem ultimately relies on postgresql-simple kind of underpins everything, but there is an alternative.
>> Right. Um, there is a library that. No one seems to know how to pronounce in my circle called hasql or hasql. I'm really not sure.
>> H A SQL, if you want be verbose.
>> There, there we H A SQL, um.
>> I thought I was just laughing SQL, like ha SQL.
>> That's cute. I'll take that.
>> Yeah. Maybe that's what it's supposed to be.
>> Um, so I, I was led back to hasql. I'll go with that. Um, because when I was searching for this issue, they had libpq command in progress in their test suite. So they had a regression test for it. And I was like that immediately made me think, hey, they've been here before, what would it take to, to replace postgresql-simple with hasql. And, um, yeah, that was, that was a thought, uh, part of me wishes I would have just tried to replace the PostgreSQL persistent stuff with hasql.
>> Right.
>> Yeah, that could have been cool.
>> Yeah, that could be an interesting thing to chase down, make a persistent hasql binding library. And, you know, the whole point of persistent is that hopefully we'd would be able to switch that out behind the scenes without having to change our code. Um, but we did not do that. Not yet.
>> Um, what's what's even funnier is I had that thought and then Matt Parsons, while I was talking to him, he actually said, you know, unprompted, maybe it would have been faster to rewrite this with hasql.
>> That's funny.
>> So yeah, if you're listening to this podcast and that sounds like a fun project to you, please take it on.
>> Yeah, we would adopt, be early adopters.
>> I was just going to say that hasql is a, um, it's by Nikita Volkov, and it's very focused on correctness and it uses, I think the binary protocol to talk to Postgres and it tries to represent as much as possible on the value level. So instead of like throwing an exception, if something goes wrong, it'll pull back an either and then you have to deal with that, however you want. So that's one of the reasons I think that it does have this cases, cause Nikita was really going through this with a fine tooth comb, try and find stuff like this.
>> I'm going to have to actually look how he handles things like, uh, asynchronous exception, you know, like stack out of memory error or whatever, uh, because one of the common criticisms of handling exceptions like that at the value level is, well what are you just going to case on some exception?
>> Yeah, why not?
>> Well, I mean, we kind of have some spots in our code base, but.
>> Yeah.
>> Uh, well, awesome. You guys have anything else you want to chat about in regards to async control flow and async exceptions?
>> That's it for me.
>> I think that's everything. I'll be writing books on books in my notes.
>> Perfect. Hey, we look forward to seeing that stuff. Uh, thanks Cody for being on the show and thank you all for listening to the Haskell Weekly podcast. I've been your host, Cameron Gera. And with me today was Cody Goodman and Taylor Fausak. Uh, find out more about Haskell Weekly, check out our website, Haskell Weekly dot news. And if you enjoyed the show, please, please, please rate us and review us on Apple podcasts just helps more people find us. It'd be awesome. And if you have any feedback, always feel free to tweet us at Haskell Weekly on Twitter.
>> We are brought to you by our employer, ITProTV, which is an ACI Learning company. They would like to offer you 30% off the lifetime of your subscription. You can redeem that by checking out, going through the normal flow and adding the promo code Haskell Weekly 30 at checkout. Um, so please go to it pro dot tv and sign up for a subscription. But that'll about do it for us today. Thanks for joining us and we'll see you next week. Bye.
>> Peace.