Byte string, text, and vector, oh my! This week we review Michael Snoyman’s proposal to unify vector-like types. Learn about boxed versus unboxed values, pinned versus unpinned memory, and more.
Episode 41 was published on 2021-03-22.
>> Welcome to the Haskell Weekly podcast. This is a show about Haskell, a purely functional programming language. I'm your host Taylor Fausak. I'm an engineer at ITProTV. With me today is Cameron Gera, one of the engineers on my team. Thanks for joining me today, Cam.
>> Thanks for having me Taylor, I'm excited about today. I think, uh, last couple of weeks for a Haskell Weekly and interviewing some incredible people in the community. And, uh, today we have something, you know, back back more to our roots, what we're used to, what we do day in and day out, which is a look at one of the, um, articles from the past edition of Haskell weekly. And then we dissected and jumped in a little bit. Um, today I'm really excited because, you know, we've covered a lot of Michael Snoyman's articles before. Um, and we're going to cover another one today, but this one is honestly was a little brain stretching for me. And so I'm excited to kind of ask some questions I had from it and get some clarity, um, you know, as well as just. Bounce ideas around with you. Um, but today's article is titled Haskell base proposal, unifying vector like types. So we're going to just dive right in, right?
>> Yeah. And as you mentioned, it's by Michael Snoyman, who we frequently talk about on this podcast. Uh, but in relation to this discussion today, I think one of the reasons this is even coming up again is that, Michael Snoyman is involved with the Haskell Foundation, particularly kind of the more tech side of it. And, uh, this is a proposal to augment or change the base library to have like a common vector type that everything else can use. So I think that's a little bit of the broader context under which this article was published.
>> Yeah. And I think, you know, that, you know, his. Kind of focus and driving force is just to like help drive that technical agenda for the Haskell Foundation and really find a way to make the quality of life of Haskell a little easier, um, you know, and kind of clear up some of the confusion about, you know, boxed versus unboxed, pinned versus unpinned memory. Uh, what vector is, why there's so many different types and you know, all the variety of libraries we have for them and like, what does, what, and what's the side effects. So I think he like really dove in deep. Um, and so I'm looking forward to diving in.
>> Yeah, it's a lot to cover, so we better get started. And I think, uh, again, to contextualize just a little bit more actually, before we get in, is that one of the goals of the Haskell Foundation is to increase like, uh, industrial or commercial adoption of Haskell and try to find some of those things that could be warts or paper cuts that new people may run into and minimize those as much as possible. And one of the motivations for this discussion is strings. And it's a common meme or complaint about Haskell that there's, you know, five string types you have to care about. And that isn't really true. There's string from the prelude, which is a linked list of characters, which is a pretty bad representation. And then there's text, which is kind of the de facto library for representing textual data in Haskell. Um, and often byte string gets involved in the mix there because you can represent strings as byte strings, but you lose some information there cause you don't know what encoding it is. Uh, anyway, uh, just, just to contextualize, that's like the main motivation here, I think, but behind the scenes, regardless of, you know, what you do with strings, they need to be represented in memory somehow. And that's what, this is more focused on.
>> Right. And so, you know, you've talked about string being that meme wart thing that, you know, because it's the only textual basis in the base library. Right?
>> Yeah. Uh, in the prelude and in the base library, the string type is really the only choice and so much of the base API uses strings as input or output because it's the only choice. Somewhat recently, like, I don't know, in the past couple of years, um, the text library has become a wired in library for GHC, which means that when you install GHC you get text for free. So that has helped and that's made it more available, but it's still not like actually in base or better yet in the prelude where you don't even have to import anything to be able to use text.
>> Yeah, yeah. And I know we've, we've struggled against that. We've have our own custom prelude now that has that, you know, available to us. Um, but what, what is kind of the and you know another kind of thing going on? Um, That's like related to like all these options, right? Everything. There's so many options with different libraries and everything. Um, but nothing really shares anything behind the scenes. It's all different representations.
>> Right. So that's the key point here. So like, like I've just mentioned, string is a linked list of characters and linked lists are baked into the language. You could write your own, but there's one in the language. So a lot of people use that. Uh, but linked lists have a lot of downsides. They aren't very compact in memory and they're not easy to iterate over or rather they're easy to iterate over, but you might get a lot of cache misses along the way, and they. Aren't good at random access or mutation in place. So that's why there are other options for this. The most popular one is probably byte string. So if you wanted to represent just a, you know, a bunch of memory in your program, you're probably gonna use a byte string. Um, but it's not the only option. You could also use the vector package, like a vector of Word8 Uh, values or you could use the array package, which is part of the Haskell language report, like the language standard, but it's also not a very good library In a lot of ways vector is just the array library, but done better. Um, and then there are a lot of other options. Um, but those, those are the main players. Uh, and maybe another one that's it's worth mentioning at this point is the foundation library. Not to be confused with the Haskell foundation. Organization. Um, but there's a library called foundation and it has a sub library called basement. And that contains a lot of these more primitive array, vector like types. So it's got one of these as well.
>> Gotcha. So you talked about, you know, byte string and vector Word8 being quote-unquote like synonymous. Um, but you know, in the article he talks to that byte string is a pinned foreign pointer. Um, you know, what does that mean for someone who hasn't. A lot. It doesn't have a lot of programming knowledge or they're still learning the ins and outs of pointers and stuff like that.
>> Yeah, so pinned is. A, um, property of the memory that GHC will allocate for this thing. And what pinned means is that if you need to call out to C or just call, call out to anything through Haskell's, FFI the foreign function interface. Then if your memory is pinned, that means that the garbage collector won't move it around, which means if you hand a pointer off to some C program through the FFI. That C program knows that that pointer isn't going to move, which is really useful because it's probably going to want to do something with that pointer. Um, and of course the other, uh, you know, the alternative is to use unpinned memory, which the garbage collector is free to move around. And, um, you may wonder why would you want to use pinned versus unpinned memory? Like I mentioned, if you're making an FFI call, you probably need pinned Um, but another. Or, or one of the downsides of pinned memory, I guess the reason you don't use pinned memory everywhere is that since the garbage collector can move stuff around, it can, uh, compact your heap or, uh, you, people may be familiar with this back in the old days when we had spinning rust hard drives, uh, defragmentation where you may have a bunch of small files all scattered over your hard drive and you can. Pack them all together, real close next to each other. And, um, then you end up with more contiguous free space. And in this case, I'm talking about GHC's heap. Um, and that's beneficial because then if you need to allocate a big chunk of memory, you can do that rather than having to, you know, scan over all of your available heap or, or whatever. I don't know too much about garbage collection, but that's my, my understanding of what's going on here.
>> Yeah, no, I think that was a, a great explanation. Um, you know, it makes sense why we'd want pinned for the FFI because if that pointer points to something that's not there anymore, then we're really, uh, Sol. So, you know, it makes sense for, for that situation. And I think, you know, the idea of unpinned memory that allows the garbage collector to. Sort of organize the memory more or less, um, is great. Cause then yeah, that big giant heap or not, it's not actually a heap of memory, but this chunk of memory that needs to be allocated no matter what the size can generally be found, because everything isn't in the way.
>> yes. And it's worth noting that the byte string library provides a type that uses unpinned memory. And that type is called short byte string. This I haven't seen this used very often and compared to the like data dot byte string API, that's exposed, the short byte string exposes very little. It effectively lets you convert in and out of normal byte strings. And then just kind of keep it around. So I think the intent is to use this for like, uh, interned strings that other programming languages use or like symbol. And I don't know, just something that you really only need a reference to. You don't want to do anything with it.
>> Okay. Yeah.
>> Um, what kind of kicked off this discussion was, uh, we were talking about vector also and the other. Representations for this stuff behind the scenes. So I mentioned that byte string is kind of the same as a vector of Word8 And I think maybe a reasonable question is why isn't it that, you know, we have this vector library, why isn't byte, string, just a wrapper around that. And that's the crux of a Snoyman's post here is that maybe it should be right.
>> Maybe it can just be a new type wrapper.
>> Yeah. Um, but in order to do that, we need to solve this pinned versus unpinned problem, because vector is unpinned I think, um, I should check that, but I think it's unpinned and, uh, yeah, so that's one reason why it's not there. And the other reason, or one other reason is that, um, vector is a little expensive to compile and isn't a wired in library. So the GHC team. Can't just like depend on it, Willy nilly that, so it's a big deal to get a new library involved.
>> Right. I mean, yeah, and Snoyman said Like the factor package itself is huge, so it takes a long time to compile. Um, yeah. And, and kind of to your point of vector being unpinned um, I know it's, I don't, I can't remember if pinned or unpinned but I do know it's boxed or unboxed, which. I don't know exactly what those mean So I'm going to ask you a second with those mean, but my hypothesis of what those mean is the boxed is a range of memory in which it exists. And then unboxed is kind of, it's just wherever it wants to be within the heap.
>> Um, I think that's close. So again, I, I'm not too versed in the details here, so I may get it wrong. But my understanding is that boxed is what we think of for normal Haskell data types. Um, like, and I guess maybe most people don't think about the memory representation of data types. day to day I know I, that, that I don't, but, uh, boxed means that box means that it's like a. Uh, pointer to some struct that has all the fields on it that you're interested in. Uh, and sometimes there's only one field on the thing. So like a boxed Word32 is a pointer. To a 32 bit unsigned integer, somewhere in memory. Um, and the alternative unboxed means that there is no pointer, so it's just stored directly on the thing. There's no indirection there. Um, and this is important when you're trying to write high performance stuff with arrays because if you want to iterate over an array or a vector or whatever, if it's boxed every time you get to a new element, you have to chase another pointer and go to somewhere else in memory and get it. Whereas, if it's unboxed, you can just loop over, you know, one contiguous chunk of memory. And this is very good for cache hits. So your CPU will load like, you know, that line of memory into it's cache and you'll just rip over that thing super fast.
>> okay. So it's, I was a little bit backwards maybe then on my
>> yeah. So box is like, quote, unquote, normal for Haskell where. Everything is like a pointer to somewhere else. Um, and unboxed is it's stored right there. You don't have to chase down a pointer to get to it.
>> Okay. Yeah. And then what about, um, so there's also different types of vectors store, like there's just vector there's storable vector. And then there's like, obviously unboxed vector, which we kind of talked about it
>> yeah. And this is something that Snoyman wants to try to clear up. Uh, he doesn't have a very. Concrete proposal for this, but if you go poke around with the vector package, you'll see that there's data dot vector data dot vector dot unboxed dot. Storable mutable. Like there's lots of different variants. And I think that contributes at least a little bit to why this takes so long to compile. There's a lot of code because this stuff gets duplicated and it would be nice if we could have one thing, just data dot vector, and that can be boxed or unboxed, or it could be pinned or unpinned, or it could be storable or primitive or any of these kinds of varieties, because really the interface should be the same for all of those. And behind the scenes. We either may not care how it's represented or we want it to automatically pick the most efficient representation for us. And I, my understanding is that with Haskell's type system right now, that may not be possible. Uh, and I can't remember the specifics, but I know that when you go to unboxed, you are unable to do certain things and everything behaves a little bit differently. Uh, the main thing I can think of right now is that like lazy values. I don't think can be unboxed or rather if you have an unboxed value, it's not lazy because if it's not a pointer to somewhere else, that means you have that value. So it hasn't, it has to
>> It doesn't just live in a thunk anywhere. It's it is evaluated and available. Okay. Yeah, that makes sense. And I think, um, there's something that you said just a second ago and now it's escaping me of course. Uh, but the, wow. I really can't remember, but anyways, we'll keep moving on the, um, You know, underlying vector type. Now I remember what it is, uh, that, you know, existing vector types now have really complicated and convoluted instances for like boxed or unboxed or, and such. Um, because it's, it's not. Necessarily trivial to create those things. I know I wouldn't, I haven't looked at a type signature, so I don't know, but I would most likely have a hard time developing a instance of that thing. Um, because it's dealing with memory, it's dealing with things that are just kind of, you know, it's. It's behind the curtain, right? So it's, you know, you have this pretty view of what Haskell is, and then you have to put, you know, if you pull back the curtain, you'll say, Oh, okay, this is how all these things are represented in memory and how they all work together, how GHC manages them all, how the garbage collector does things, you know, that's something I haven't had necessarily a ton of time to dive into is the behind the scenes. Um, and so I think this was a great kind of step, um, for me to get into that direction. Um, and I appreciate Snoyman taking the time to kind of just share. Um, you know, some of the thoughts and feelings, um, and then you also being able to explain a little bit of the
>> Yeah, hopefully I've done a good job. And there's one other thing that I want to touch on. Um, that is one of those behind the scenes things. Uh, if you've ever looked at the documentation for data dot text or data dot vector, you may have noticed in a lot of places, it says subject to fusion and. Yeah, that, uh, if you aren't already familiar with fusion, those may be a little cryptic. Like, what does that mean? Is it good to be subject to fusion? Is it bad? Should I want these functions or should I avoid them? Um, and to give a high level overview, uh, fusion is where you may have a pipeline of operations that you apply to something and. It would be expensive to like take your input and apply the first change and then take, you know, produce an output from that. And then use that as the input to the second thing. And you'd have all of these intermediate, uh, text values or vectors or whatever, and fusion is where you fuse all of those things together. Hence the name. Um, and I bring this up because this is one of the, or let me answer the questions I posed earlier. I guess fusion is good. And you want to use functions that. Are subject to fusion. Um, but it can be very difficult to tell when that works or when it doesn't work and text and vector both have their own fusion mechanisms behind the scenes and they are different. So one of the upsides potential upsides of unifying these vector like types is to have one fusion mechanism behind the scenes. Or alternatively have no fusion mechanism at all. And instead rely on some type of streaming library, a la pipes or conduit to do that fusion more explicitly for you. Um, so that's a super quick overview of fusion, but I wanted to bring it up because, um, one of the kind of gotchas with text and vector is that if. All of your functions fuse together. You can get really good performance, but if you accidentally miss fusion somewhere along the way, then your performance can tank, which is very difficult to debug.
>> Hmm. Hmm. Yeah, it sounds a little tricky, a
>> Yes. Yeah. Um, but yeah, so I just wanted to get that in as a quick note. Um, but yeah, overall, um, Again, the point here is we have all these different representations of bytes really byte string versus vector or whatever. Um, let's have one and then everything else can build on top of that.
>> Yeah, and include it in base so we can get rid of the uses of string in base
>> Yes, that would be great. And so many people
>> Yeah. The first comment on the blog post uh comment section was, Oh, I'm so glad you want to get rid of string in base
>> yes so
>> be, it can stay in base, but it shouldn't be the base, uh, data type, you
>> yeah. And like you mentioned earlier, cam uh, for us as application authors, we can, we're in kind of a better spot because we can just say we're going to use text everywhere and. Most libraries have a text based alternative that you can use, or you can reach for an alternative prelude entirely like relude by Kowainik that uses text pretty much everywhere, I think. Or there are other alternatives too. Um, but as a library author, if you want your package to be used in as many places as popular or as possible, you probably want to depend on the smallest number of things as you can. So, if you don't really need text, then you may not depend on it. In which case, your only choice is to use string. And then that just perpetuates using string everywhere because you know, if somebody else wants to write, so I'll write a library that integrates with yours, they are kind of forced to use string. So it's a
>> Sounds like it. Yeah. And I mean, as you speak of Kowainik I wanted to do a quick little shout out to them for developing the, uh, They created a diagram of Haskell ideas and things to learn over time and kind of the difficulty of those, um, you know, on a, on a scale, like a graph. So it's really cool. It's in this week, week's edition of Haskell weekly, which I think is episode or, uh, edition 255 Um, so you can find it there. It's a really cool, like, Just kind of visual, um, and GHCJS is the hardest and longest thing to learn. So, you know, we talked about that last week, so it's, you know, everything's so related yeah, just wanted to do a quick shout out there. for that
>> Yeah, I thought that was really well put together. And like you mentioned, it's sort of a two-dimensional graph of these concepts in Haskell and how they relate, uh, in difficulty and kind of when you may be exposed to them. So if you feel like you're looking for something new to learn about Haskell, you could. Find, you know, kind of the edge of where you are and pick one of those things. That's just beyond it. Or if you feel like, you know, everything, you could look back and say like, wow, I learned this thing way sooner. than I should have.
>> right. Yeah. I mean, it's good to just kind of see what's out, what's out there that you may not know. Um, or be like looking at me, I'm a ball Haskell baller. If you're a Haskell baller, please call us. We would love to, uh, give you an opportunity.
>> for sure. Um, so yeah, I think I've said everything I wanted to say about, uh, this shared vector, like type I'm sure that, um, I've messed some things up, but those are the broad strokes. I encourage listeners to go read this post. It's good. And go discuss it on the Haskell discourse if you've got opinions, which I'm sure you do.
>> no, I think that's it. I think that it was uh overall great. Um, to kind of dive into this, even though it was a little brain melty for those who aren't as technically savvy. Um, and haven't looked behind the curtain enough. Uh, but I do think, yeah, but I do think it's, you know, really really great knowledge and something that, you know, I think the foundation will continue to think about and yeah, GHC contributors and all that, will kind of really evaluate and look at, you know, um, is this being something they seriously consider?
>> agreed a hundred percent. All right, well, uh, that will do it for us this week. Thank you so much for listening to the Haskell weekly podcast. I've been your host Taylor Fausak And with me today was Cameron Gera You can find out more about Haskell Weekly at our website, which is HaskellWeekly.News If you enjoyed the show, please rate and review us wherever you found us. And if you have any feedback for us, you can tweet us. Our handle is HaskellWeekly
>> And Haskell weekly is brought to you by ITProTV an ACI learning company and also our employer. They would love to extend an offer of 30% off the lifetime of your subscription. Uh, all you gotta do is go to ITPro.TV purchase, or you can even start a free subscription if you're skeptical. Um, all you gotta do is purchase. And at checkout, if you use. Promo code HaskellWeekly30 you will get that 30% off and you can see it, um, in the checkout process. Um, but I think that about does it for us. Thanks for joining us on the Haskell Weekly podcast and see you next week!