Episode 489: Sam Boyer on Package Management

Filed in Episodes by on December 8, 2021 0 Comments

Guest Sam Boyer, author of So you want to write a package manager talks about package management with Host Robert Blumen. The discussion covers – what is a package? what does it mean to manage package? package meta-data; package versioning; the quantity of packages in modern applications; examples from popular programming languages; where do packages live? how does the package manager locate packages? how are package versions queried? what are package dependencies? versions, dependencies and the version selection algorithms; Dependency hell; resolving version conflicts; lock files; reproducible builds; what does the developer workflow look like?

Related Links

 View Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact content@computer.org and include the episode number and URL.

Robert Blumen 00:00:17 For Software Engineering Radio, this is Robert Blumen. Today I have with me Sam Boyer. Sam is a principal software engineer at Grafana Labs. He contributes to open source projects, especially in the GO community, and is the author of the article, “So You Want to Write a Package Manager,” which will be the subject of our conversation today. Sam, welcome to Software Engineering Radio.

Sam Boyer 00:00:42 Thank you for having me. I’m delighted to be here.

Robert Blumen 00:00:45 Is there anything else you’d like listeners to know about your background?

Sam Boyer 00:00:48 I can do the quick things. I’m a self-taught engineer. I started in the industry about 12 years ago. I’ve been working in open source for a long, long time, and it was in large part my interest in the way the communities work that drove me to my interest in package management in the first place.

Robert Blumen 00:01:07 Great. Before we talk about package management, what is a package?

Sam Boyer 00:01:13 So, this is a totally natural place to start, and it is a surprisingly subtle question. So, most people, most of the time think of, I think, the packages they work with in terms of the actual systems they’re in, right? So, if you’re writing in JavaScript or TypeScript or whatever, you’re going to think of NPM packages and package.json, and the contents of that. And you might also think of system packages or system package managers for your old apps or your RPMs or more modern, like, Kubernetes-y things. But if you actually want to give a sufficiently general definition of what a package is, it’s kind of frustrating and useless. Like, the best one I can give is that a package is a collection of stuff with a name and some kind of logical boundary that separates the stuff that it contains from other packages of the same kind and their stuff, within some larger context.

Sam Boyer 00:02:10 And that is a, it’s like I said, frustratingly abstract, like “stuff” can be source code or compile code or configuration files or a bunch of cat gits. Like, what’s the contents of the package and what it needs to be a valid package as a function of the ecosystem packages from. If it’s an NPM package, it should have certain properties, it should have a package.json. A Rust package should have a cargo toml, and I think a non-zero number of MRS files. This is one of the weird subtleties of package management, though. It starts right at the beginning. It’s hard to actually pin a really precise definition of what constitutes a package.

Robert Blumen 00:02:43 It might help people to understand, if you could pick one language and talk about one or more packages in that language to illustrate it.

Sam Boyer 00:02:52 Okay, offhand. Oh man. So, in GO, we could talk about GO where the thing that we think of as packages elsewhere or referred to as “modules.” We could talk about the GO kit package. Sure. Popular microservices framework. And so, there’s a git repository that the actual process of developing the code is done it, but what we would think of as a package itself is a git tag that is attached to a particular commit to make a release that refers to then the snapshot of all the files in the git repository. And the identifying features of that GO package would be all the GO files themselves, and then the metadata files that communicate information at the package level: the GO.mod file, the GO.sum file in the case of GO.

Robert Blumen 00:03:48 Okay, so packages. So, it’s a bunch of code that I might want to use in my program so I don’t have to go to the trouble of writing it all myself. And then there’s some metadata in — if it’s too hard to answer this in general, talk about some things you might find in the metadata. You could pick a particular language.

Sam Boyer 00:04:07 Yes, well I mean the most important things that you’re going to find in metadata you’re always going to find a name, and you may or may not find a version. Sometimes that is considered to be something that is held inside of the package, but it always has to be something that’s also outside of the package so that other things can refer to it. But that’s sort of your, your two-point coordinate system in the universe of packages is you need to have a name and then a particular version of that name.

Robert Blumen 00:04:37 Yes. I don’t only want to use a GO kit package. I want you to know that I want 2.1 and they might release 2.2 and 2.3, but I’m not ready to upgrade to 2.3. So, I want to stay on 2.1, something like that?

Sam Boyer 00:04:53 Yeah, yeah. Those are the two absolute essentials. The other major thing that you will typically see in these metadata files is you’ll see a list of dependencies, list of other packages required by that package. And possibly also some construction about the version that’s supposed to be used with those.

Robert Blumen 00:05:12 The answer to this is going to be in very broad ranges. If I’m building a product, let’s say it’s a commercial product that is running for a business somewhere, and so I’m going to build a server. How many packages might, if look at the packages and their dependencies, and we have to go all the way down to the bottom, are we talking tens, hundreds, thousands? How many packages does a typical modern software application need?

Sam Boyer 00:05:46 So it really depends on the ecosystem. You’re going to see a lot more in the NPM ecosystem than you would in the GO ecosystem more the Rust ecosystem, for example. I’m not sure that I would be able to meaningfully construct an average number if only because you can’t. Even if I had access to all of the, if I had access to all the private software written everywhere in the world, and I could construct an average number, but there’s a selection bias involved with polling only the open source available ones. So, I think the best I can say is, yes, you might see thousands in JavaScript land and you might see tens to hundreds in like a GO or a Rust. And I don’t, I think most of the other… JavaScript is really the one that is like an order of magnitude more than most NPMs.

Robert Blumen 00:06:36 This means that if I am shipping software, most of what I’m shipping is stuff written by other people. My contribution is only a tiny amount. You get into this a little bit in your article when you talk about…

Sam Boyer 00:06:51 Yes, maybe we want to talk about like what a package manager does first. That actually might be a better way to segue into this.

Robert Blumen 00:06:55 Let’s move on to our main topic, which is we need some kind of a tool. I think that’s motivated by if there’s thousands of these things, it’s goes beyond human comprehension to keep track of all the packages. What problem does a package manager solve?

Sam Boyer 00:07:15 So, I mean, the obvious thing that it does, right, is it like downloads a bunch of stuff from the internet and puts it in places on, on disc so that you can run programs. So, in a sense it’s an automation tool. But I think that, as with a lot of automation, the value is not so much — I mean, certainly there’s value in keeping from having you to do rote tasks over and over again. But it’s in the uniformity of the way that a system is created on the other side of running a package manager command. All of the same packages at all the same versions. It’s part of a, like a, a reproducible build concept that when you have a tool that grabs everything in a uniform way and lays it out in a uniform way, the goal is usually to remove unnecessary choice from the operator in terms of how all these dependencies are arranged.

Robert Blumen 00:08:10 Okay. So, it’s going to ensure that you have reasonable versions of all the packages that you need, so you can do a build. Is that a fair statement?

Sam Boyer 00:08:19 Yes, and actually the word “reasonable” is incredibly important here because version selection is the core of, of all of this. Like which versions of the packages do you actually want to use? And my stance is that we should actually think about the job of a package manager as, at least by default? It’s job is to pick the correct version, which itself has a term that we need to unpack, but it sets us in the right frame of mind that the real job of package managers is to be a system that has an opinion about what it means for the like artifact to create something the other side, the composition of your code and all your dependency code to be correct. And to enforce that and to tell you when it can’t meet that.

Robert Blumen 00:09:10 Okay. We’re going to come back to the dependency versioning issue and discuss that in depth in a bit. Could you toss out the names of some popular languages and frameworks and the package managers that go along with them?

Sam Boyer 00:09:27 Sure. We’ve got there’s Node and NPM or yarn, and then there’s Preston cargo and GO and GO modules is the way you should probably refer to it.

Robert Blumen 00:09:39 We’re talking about commonalities to something that exists in nearly all different software. Is a package manager usually issued or built by the same team that built the language itself, and it comes as a part of a toolkit? If I download Python, I would get the Python package manager. Or does there tend to be competition among package managers, even within the same language community where different developers have a different idea of how package management should be done for language A and so you would have a range of choices even within language A.

Sam Boyer 00:10:20 Yes, nowadays I think it’s pretty well accepted that you really can’t ship a language without shipping a package manager — at least not if you want any kind of community to exist. But that’s not always been the case, certainly. So there is a shift in this over time: older languages, there’s more separation between the language tooling and the package manager; newer ones tend to build.

Robert Blumen 00:10:47 Will the package manager be written in the same language as the primary language?

Sam Boyer 00:10:52 Yes, almost always. The main limit here is that you absolutely have to have, your package manager be executable, wherever your language is executable. And as nice as it might be, if like we could write the mother algorithm for all package management in like one language, you end up in a really awkward situation as a language designer, if you’re having to juggle between your compiler or whatever, and like what platforms it can run on. And then your package manager not being aligned,

Robert Blumen 00:11:26 For example, Python itself is built in C, but I wouldn’t necessarily want to have to have a C compiler just to build a Python package manager. It should be written in Python. Okay. Now take a Python. It’s a language I’m familiar with. There are many packages which are built into Python. I don’t need to manage them because when I get Python, I get a bunch of common packages. Is it a requirement that the package manager only depend on the built-ins so that we don’t have to package manage while we’re bootstrapping the package manager?

Sam Boyer 00:12:03 I think that’s mostly a question that has more it has to do with the previous question, right? Like, is the package manager a part of the tool chain that’s distributed with the language? If it is well, probably yes. We’re not going to depend on anything external. If it’s not, then you have more degrees of freedom.

Robert Blumen 00:12:23 Package manager you mentioned that goes out to the interwebs and finds things. Where does it go to find the packages?

Sam Boyer 00:12:34 Yes, so, there’s a few approaches. Most package managers rely on some kind of central repository. I would say you can really say there’s, there’s four models. You can have the model where there is only one central repository. You can have the model where there’s, like, a search algorithm that picks between multiple repositories that can be specified. You can try to go fully distributed and use, like, a… I’ve seen a couple of blockchain-related package managers that want to like store the packages on the blockchain. So, your addresses are not going to a server they’re instead going to this, this distributed address resolution system. (Not finding good words there.) And then you can go for something more strictly like your URL shape, like the actual names of your packages are URLs effectively. And so you’re relying on DNS to give meaning to names that at least the highest level, instead of a, like, a central service.

Robert Blumen 00:13:32 If it’s built into the package manager, there’s a one repository everything’s there it goes there. We could put the name of the repository in the code. If there are several, or I want it to be configurable, would there be a setup step where I add some command like “add remote repository,” followed by a URL?

Sam Boyer 00:13:54 Yes. If that’s the model that has been chosen by the creators of the package manager, the, the scope of this ends up being really important, right? Like, for assistant package manager, you might configure at a system level, you know, in your Etsy directory, some additional repositories that get used, right? That way all users share them. One of the main issues is that, since we’re ultimately talking about name resolution here, is in order to have multiple repositories, you have to have a way of distributing the names of all of the name-resolution services effectively to any client that’s going to end up trying to resolve all those packages, because otherwise you don’t have the same uniformity across all of your users. That is so important.

Robert Blumen 00:14:39 Within an enterprise, there could be security reasons or availability reasons why you might not want to pull everything from a remote repository, or perhaps the amount of bandwidth going over the network. Suppose we want either to run our own repository or have caches or mirrors on a internal network, is that a common use case? And do all of these package managers support that?

Sam Boyer 00:15:07 There’s a range, but it’s certainly common. There’s any number of reasons, whether it is performance security, I guess those are the two major broad categories of reasons that, that jumped to mind right away. Actually, I suppose, to be performance security, and then also correctness in the sense of ensuring uniform name resolution, yes, would be the major reasons that people want some kind of intermediary to not just have things touching the inter-webs, the manner in which these are supported varies. I’m not aware of any like general framework for being able to talk about it, mostly because the way that you would inject an intermediary is largely a function of how your name resolution works in the first place. And so, it’s dependent upon something that varies in the first place across backup managers.

Robert Blumen 00:15:52 How does the package get pushed up into the repository?

Sam Boyer 00:15:57 So, it’s very clear that package managers are responsible for the read step. They have varying degrees of responsibility for the right path. It’s more common when you have a repository that you maintain yourself that is specific to your ecosystem of packages, that the tool will bundle in the right path. But, this also gets into like the nature of what your, what your package actually contains, right? I mean, you’re making a release, so, you generally want to like hook this kind of thing into a CI workflow. Something that gives you assurances that the thing you’re about to release is correct with respect to whatever rules you’ve devised for your package. Then, you have to figure out how to map the… I think what I’m trying to say is, with respect to the tooling, the job of the people writing the package manager is to probably be as non-invasive as possible into people’s workflows. Because people are going to put a lot of different things together when it comes to being sure that they’re actually ready to make a release. And, trying to bake too much of that into, what you would write in a standard tool for package management ends up hindering because you’re cutting off certain workflows.

Robert Blumen 00:17:17 So, if I have a compiled language, maybe my source code is in github. I’ve decided this code it’s passed all the validations, we’re ready to release 2.4. There would be some workflow I would go through, which might involve compiling and creating libs or jars or whatever the archival format is. And then pushing those up into an account that I have for our project on the repository server. Is that what the workflow looks like?

Sam Boyer 00:17:49 Yes, more or less, yes. Perhaps it’s something that, I mean, especially nowadays, you know, I would expect that this kind of thing is something you would create an automation for, which is very important. I mean, again, the goal with packages in general is this uniformity of the package objects and removing as much unnecessary source of user error in the process of both creating and consuming these things is essential to reliability. So yes, automate that, but in general, yes, what you described as is the way the one should imagine using it.

Robert Blumen 00:18:23 Do the package managers support an API that looks like, “give me all the versions you know about of this package by name”?

Sam Boyer 00:18:34 Usually, yes. It’s almost impossible to write a version selection algorithm without having knowledge of the universal versions.

Robert Blumen 00:18:41 Are there validations, like code signing to ensure that what you get is the trusted package that you asked for?

Sam Boyer 00:18:51 Yes, this is, I mean, the tricky thing here is since we’re, we’re down at the level of like giving meaning to names, you can easily get to the canonicality issues when it comes to like verifying the original upstream source. You know, if I,if I have a repository in some configuration file that I have added, then that thing needs to be the thing that I check against for the validity of the package name that I’m asking it to resolve. But if someone can attack like the config file that I have about which repository that I talk to, then how do I know that my verification is correct? There’s the whole field of, you know, software supply chain security is invested in, in trying to formalize a lot of the answers to this so that we at least don’t have a lot of unforced errors in this area, but yes, certainly that can be a part of the process and can be considered an aspect of what constitutes the correctness of a package.

Robert Blumen 00:19:48 That is a great topic. I would like for us to do an entire hour or, I mean, Software Engineering Radio to do an entire hour on supply chain security at some point. To continue with our discussion, what does the developer workflow look like when using a package manager? And if there isn’t a general case answer, then pick one language that you can talk about. What are the commands a developer runs?

Sam Boyer 00:20:17 I mean, there are a pretty general set of commands, right? Like, and they are essentially — they’re crudish. It’s: I want to add a package; I want to change the version of a package; I want to remove a package.

Robert Blumen 00:20:30 Add a package…

Sam Boyer 00:20:31 Here we’re talking like this is the developer working on a project, talking about the dependencies for the project that they’re in. Correct?

Robert Blumen 00:20:39 So, yes, let’s drill down into that. So, I’ve decided I need to do some feature in my program and there’s a package for it. I know the name, I know the package I want. So, I’m going to add a package. Do I have a config file that tells the package manager here, all the packages I want? I go edit that config file and then I run a command like sync or add?

Sam Boyer 00:21:06 It can be. It varies. In general, the two interfaces to package managers are a command line interface and then a file, because you certainly have to record state in order to be able to do anything with the package manager, you have to record dependencies. It’s typically far more convenient to offer people a command line interface to do most of these things more scriptable et cetera, because then the other thing that you can do, which is especially important for, for automation purposes is if you’re able to have the tool, take its action and then update the written file. This is usually the most preferable flow.

Robert Blumen 00:21:44 I’m not sure I follow what you mean by that. Do you mean if I add a package to the code that the package manager might detect this package is missing and I need to add it to the package list?

Sam Boyer 00:21:55 Yes, if I say, you know, add dependency Foo, in my terminal, I do package add Foo then when the command completes, if the command completes successfully, I should be able to expect that that package is present and available on my file system or whatever it means for the package to be available, such that when I go to like execute the program, the linker, the loader, whatever it is, is able to get at the contents of the package. So, I can expect that that requirement is satisfied. And also that a record that the requirement of the package exists is written into this thing which we usually call a manifest file.

Robert Blumen 00:22:40 I realize I don’t like this package, I’m going to remove it from the code. What does the remove workflow look like?

Sam Boyer 00:22:48 So the, I mean the, the main thing that’s really tricky here, right, is that usually the question is what the relationship is between the metadata file and the actual code base. Like if you have import statements that continue to reference a, continue to reference a package, which you are trying to remove. I mean, I’m not really sure why you’re trying to do that. There, there are reasons to do it, but there wasn’t like a general case for it. So, what we’re really talking about is a synchronization relationship between the dependencies as expressed in the code and the dependencies as expressed in the package-level metadata. So, are you referring to the case where like, usually what this entails is the developer needs to go in and excise any references to the package from their code. And the final step is just like remove it from the metadata file.

Robert Blumen 00:23:37 Yes. Okay. Now I think it would be a good time to start going into dependencies. You talked about this a little earlier, where I like this package. It does something useful. I’m going to add it to my package manager… this package, because it’s software, it relies on other packages which hurts dependency. So, you have recursive process. So, before I ask you some more questions, is there anything you’d like to say generally about dependencies and then we’ll get into some more detailed discussion?

Sam Boyer 00:24:15 Sure. I mean, so the, the graph of connections between software packages and the packages that they depend on and certain key rules about this graph, which include well certain key rules about the graph are some of the most important fundamental properties of any package ecosystem. And I’m sure the next questions you have will, will drive us right at those.

Robert Blumen 00:24:36 Clearly, we don’t want circular dependencies, but cycle is a property of the entire graph. How does the developer of a package ensure that they’re not depending on something that is in the wrong layer and would potentially create a cycle?

Sam Boyer 00:24:54 Generally, you can’t. I mean, because as you noted, it’s a, it’s a property of the graph, right? I’m not sure that there’s a general answer to cycles. Most packaged managers do allow cycles of some kind, at least primarily because it’s mostly like not up to an individual developer to, to resolve that. Like it’s no, it’s literally kind of the definition of a cycle, right? Like it’s no one person’s fault if there’s at least if there’s three, well, no, it’s no one person’s fault if there’s a cycle because the arrows go in a circle. Whether or not you can construct a sane logical system on the other side of there being dependency cycles tends to be the challenge. And the reason why you wouldn’t allow cyclic dependencies at the package level. Yeah.

Robert Blumen 00:25:37 Okay. Now, we’ve established, if I want a package, I have to specify the version I want. The package itself is going to specify versions or maybe a range of versions of its dependencies. How do versions and dependencies interact?

Sam Boyer 00:25:56 So, yes, this is the most, one of the most important properties for any ecosystem. I think I’ll talk first about, about like ranges versus, sorry, first I’ll talk about what you heard to kats likes to call the Highlander rule, and then talk about version ranges. So, every package ecosystem needs to make a decision about whether given that you know, A and B and they both depend on C, every package ecosystem, or rather the, thing consuming the package ecosystem needs to decide whether there are two instances of C or one. One for A one for B or one that they both have to share. And the semantics of that are complicated. They have to do with what actual kind of artifact you’re trying to create. On the other side of putting A, B and C together, it tends to be that compiled languages or these statically type languages have a lot of difficulty allowing that duplication? It’s necessary to come to a compromise on the version so that there can be exactly one instance of C. So, that names have proper meanings so that you can let types flow through the program and not have your type checker blow up. Dynamic languages can get away without it. And sure enough, NPM until recently, at least until NPM7, basically namespace everything. And most of the languages don’t. They follow the rule there can be only one. This is the source of the Highlander reference. There can be only one seat.

Robert Blumen 00:27:28 Okay, maybe an example. If I’m bringing in different packages and all of those packages use logging, it might happen that package A says, I want 1.4 of the logging package and package B says, no, I want 1.5 of logging. You’re saying there are two solutions. One is great, we have A uses 1.4 and B use 1.5. We have two versions of logging. Now that might not work because this could depend on the language runtime that logging is some kind of a unique name in the language. You can’t namespace it. So, now you have to agree or just because of the nature of what logging is, maybe the log aggregator you use, you can only have one of them? Those are the different options you’re talking about. Correct?

Sam Boyer 00:28:20 Yes, those, yes. So, the, if so, okay. If you have, let’s imagine that we have A, A depends on B and A depends on C. B and C both depend on logging. The question is, from a type checking perspective, is it possible that any of the logging types flow back through B and C, back up into A and need to be the same type because they actually rejoined in A and the type checker needs to approve that all these types are valid. If you’re in a language where you do have a type checker and we have to be sure of that in order to compile, then it’s generally not feasible to have multiple instances of C because they have different names. They’ve been namespace somehow, and the type checker is going to blow up. In a dynamic language you can get by without it. The other thing is what you’re talking about, where essentially, if C, relies usually it’s the global state question, if C, relies on some kind of well global state or external resources question, but it’s effectively the same thing. Like if the conventional thing that gets talked about here is more like a database connection. If C has a database connection and better than it, somehow you don’t necessarily want to have two copies of it because you don’t want two copies of a connection to your database.

Robert Blumen 00:29:37 In some languages there’s a concept of global variable. The language would not allow you to have two global variables that are different, but have the same name. So, in that case, you have to agree on their say, an object called logging, and it’s a global variable in some cases that can only be one. Okay, now what am I going to see if I run a package, add or sync command, if it detects an issue of this type, what am I going to see in the output?

Sam Boyer 00:30:13 So, it, it, this, this is where it gets very interesting in the first season you want to talk about is a language which does not follow the Highlander rule, right? You can, you can duplicate logging. I mean, in that case, like we don’t have a problem, right? What you’re going to see is success. Great. You know, we picked this version for, this version walking for A, and this version of walking for B in a language where you do have to solve the problem. You have a few different, there are a few different options, and they really depend on the way that the version selection algorithm for that package manager works. So, there are some package managers which use a version selection algorithm that essentially can’t fail, perhaps even when it should. So, there are advantages to this approach, mostly algorithmically it’s faster.

Sam Boyer 00:31:01 And it’s the behavior tends to be more predictable. So even though you get, you know, even though you can plausibly get an outcome, which won’t actually compile, you at least don’t have odd failures because the problem is the failures get really complicated too. So, if you have a version selection algorithm, which sees the problem and tries to do something about it, we’re now talking about the search problem effectively. Can we find versions of B and C, which can agree on a version of logging? And then we get into the question of what agreement means, but the output that you may see in that case it’ll vary widely. Like it varies depending on whether or not it was possible to find a version of logging the B and C agree on. Cause if it was then the package manager probably won’t tell you there’s a problem. It’ll just say I pick this version because it appears to satisfy constraints. If it did fail to find a version, the output can be potentially very complicated because there, it may have gone in order to exhaust the possibility space of all the different versions of B and C. It may have a whole lot to tell you about all the things that didn’t work. So, there there’s a, there’s a, there’s a range of output you may see.

Robert Blumen 00:32:19 An example we gave a few minutes ago, we were talking about package B, one’s logging, on 1.4, package C one’s logging on 1.5, you could imagine package B is, I am comfortable with 1.4 or greater up to 2.0, and package C says I want 1.7 or greater up to 3.0. There is clearly an intersect of things they could both agree on, but there could be more than one, right answer. How does the package manager choose? You could have a range of acceptable versions. How does it choose?

Sam Boyer 00:32:58 So, this is, I mean, that is effectively, that question reduces to how does a, how does a stat solver work? You, you have a range of possible versions. And I mean, in general, the, the simplest way to think about it is latest version acceptable from all parties is the one that you should get. But that is actually a deeply insufficient answer itself because just because you’re running version selection, doesn’t actually mean that you want to update. Like if B requires logging 1.5 and C requires logging 1.4, and you’re A, and you’re pulling in B and C both at the first time, well if there’s a version of C that is 1.6, does that mean you should get that version of C that’s 1.6? Because that’s not the version that either B or C ask for in the first place.

Sam Boyer 00:33:49 There’s a, a range of considerations here that are, that are not easy to, to resolve in a straightforward way. The thing though, that I think is important to point to is that even if, well, sorry, not even if. It’s easy to look at these numbers and almost be like, oh, this is fun. I get to write an algorithm that like solves constraints. But the reality is that those constraints that are written down are almost like almost certainly wrong. Almost anything that you write down it, at least if you are asking of your version range version constraint system, if you are asking the user to make a precise statement about which versions of its dependencies, the software actually works with for a range of reasons, that is almost never going to actually be correct. So, this is why, well, some people talk about dependency hell, right?

Sam Boyer 00:34:40 The problem where you’re trying to essentially negotiate between B and C with logging, right? You know, you’re A, B and C that depend on logging. They disagree, which version like works. I think actually it’s more interesting to talk about dependency, limbo and dependency health. Cause I think we spend a lot of time in dependency limbo, where we’re confused about the app from our package manager. And we are unsure as to whether there does exist or not, a version of logging that will satisfy B and C. And there’s two branches of that question. There is, is there a version of logging that will satisfy B and C according to what they’ve written down? And then is there a version of logging that will satisfy B and C according to reality, like when my code actually compile and be correct, whether or not the things that B and C wrote down about logging actually are satisfied. So, dependency, how I would say is when there’s a real problem and it will compile and your codes are correct, and dependency limbo is when you’re just not sure.

Robert Blumen 00:35:36 Let’s delve a bit more into dependency hell. Would this be where I run the sync says, you get two conflicting versions of logging. One of your packages needs wind up four or less, wind up five or greater. So, I say, okay, I’m going to upgrade some things to get on a higher version of logging and that should resolve that. And then it breaks something else.

Sam Boyer 00:36:03 Sure. I think there’s many circles, in dependency hell. And the problem is that because ultimately we’re, we’re at this 10,000 foot view of the actual behavior of our software. And we’re trying to make really core screen choices to arrange things in our dependency tree, just in order to get things to work? Yes, certainly I change in order to satisfy that in order to deal with B and C’s issue with logging, I update other things and then things break over there. Like it ends up being a problem where you can suck in the entirety of your application, just trying to fix one problem. And it’s difficult to tell which way is even up.

Robert Blumen 00:36:39 I could imagine this happening. We’re talking about logging, which is something pretty concrete. I usually know which logging system I want to use, but it could also be that I’m pulling in some very high level packages to do some something like machine learning and seven levels deep. They give me the name of something that’s live G R P N F zero. I don’t even know what that does. I never asked for it. I don’t want it. I don’t know what the risks are. If I say, just go ahead and use this anyway. And I wouldn’t really know how to resolve it. What do I do, Sam?

Sam Boyer 00:37:20 That way you, you, you more, the dependency hell is real. I mean, this, this is, this is why, again, I kind of like the limbo thing more is because what is hellish about the experience so often is our uncertainty about what is okay to do. You’re often solving problems that are not, are wrong in some deeply nested set of packages. Why did they make these decisions? Are these constraints real? Can I violate them and still have things be correct? How can I be sure that things are correct on the other side of violating them? How can I be sure they’re correct, even if I haven’t violated them? It’s these series of questions that is like, usually these systems work well enough that certainly we all get by with them, but there is this there’s this cliff and much of it involves just the gray uncertainty about how to fix problems when they arise and whether problems exist in the first place.

Robert Blumen 00:38:08 Is there typically a command or setting where I can and process what you mean by dependency limbo. I can make the call that, this package says it needs 1. or a high of 4 or less, but I am pretty sure it’s going to work with 1.5, the developer did not upgrade, is a bit too strict. So, I’m going to say pack measure says you have these two conflicting versions 1.4 and 1.5. I can say, go with 1.5, I’ll take the consequences.

Sam Boyer 00:38:40 Usually yes, these are referred to typically the, the general term that’s used to refer to these as some kind of override, right? Like I, from the root of the package graph, want to supersede something down. I have to fix a problem. So, I’m going to supersede something that one of my dependencies has set. The classic problem with overrides though, is that it is, it’s really only a power that can be given to the root package in the graph. It’s not that there’s no way to. The reason that we have these issues in the first place is because, you know, the issues of constraint on logging and C issues of constraint on logging and like we error out if we can’t find anything where there are constraints could be made to agree right? The point of an override is, well, this has higher priority than everything else.

Sam Boyer 00:39:28 But as soon as you’re down in the depth of like, B had an override and C, no, it wouldn’t matter because now we can’t pick one, which supersedes. So, what this creates is like, if you end up having to rely on an override switch as the author of a package, I refer to this as contagion failure. Because if you have to write down an override, then anybody who depends on you has to copy that override. They have to hoist it into their own metadata in order to get the graph to resolve in the same way. And it’s usually invisible to whoever’s depending on you that they need to do so. So, it’s this just like spread of overrides that that goes through the package graph and the deeper it gets the worst it gets.

Robert Blumen 00:40:06 It sounds very fragile. If then you upgrade something, it probably breaks all the overrides you did.

Sam Boyer 00:40:13 Possibly, yes. Depending on how overrides are phrased, it’s certainly phrased. It certainly can. Like if you were saying use version one seven, instead of version one eight of, of logging effectively, but then like the version of logging that was being used changed now you’re not replacing one seven anymore, because one seven wasn’t the person that was there in the first place. And you’re like override statements that affect that. How do you even know that that happened? It’s yes, the problem metastasizes. Okay.

Robert Blumen 00:40:43 I think now would be good time to delve a bit into the internals of package manager. We’ve talked about some of the components, there’s the project code that has import statements, some kind of a manifest file to at least two more components. Next one would be a lock file. Please talk about that.

Sam Boyer 00:41:05 Sure. So, I mean, a lock file is really the result of aversion selection computation I went through and the idea is I want to record the exact set of dependencies at a given moment in time translate. So, you have the entire graph written out in a file. These are the, these are all the dependencies. These are the versions. It is not strictly necessary depending on the package manager to have a lock file, because it depends on the algorithm. If you have a stable algorithm, then you don’t necessarily need to have a lock file. This is why GO modules doesn’t have a lock file because the algorithm is stable. Stable is an imprecise word because the algorithm never picks the latest version. It means that given a set of specific version to inputs, you can walk all the way down through the graph and know that you’ll always get the same set of versions, up to a certain details. But version selection algorithms, which do like default to latest need to have a lock file, because otherwise there’s no way to have reproducible builds at all. But ultimately the goal of the lock file is this reproducible builds property.

Robert Blumen 00:42:15 If you had something which could pick latest than latest could change because that packaged developer with has pushed an up-to-date. And so, okay, so you don’t have a reproducible build now because it depends on something external?

Sam Boyer 00:42:29 Right. This is often referred to as like people often refer to this sort of ambiguous they actually, but as version pinning, like I want to use exactly a version of the ambiguity tends to be whether you’ve written down whether it’s pinned, because it’s the version that you have written down in the lock file, or it’s pinned because it’s the version you’ve written out of the manifest file. The meaning of having it in the lock file tends to be, this is what I’m using right now. And we want it to be reproducible, but it could change. Whereas a manifest file says, I will use this version and only this exact version, it’s a constraint statement, right?

Robert Blumen 00:43:03 Depending on that language, if I have the lock file, then would the build system typically support the ability to do a build? And here’s the lock file. Don’t run the package dependency algorithm. Again, we’re starting in midstream because we’ve already run it.

Sam Boyer 00:43:23 That’s really, it’s a good way to think about the lock file is, like I said, it’s, it’s the result of a version selection computation. So, the presence of the lock file is usually used as an indication that use the versions that are here. We don’t need to rerun first and selection. We’ve done it. And this is the version that we committed to. So, you know, continue forward in your, in your build system. I do think there are interesting, possible universes where we could do a lot more like version selection, further down in our pipelines and RCI pipelines or delivery pipelines, but it, it requires some, some new algorithm and, and some new thinking about how we, about how we encode this information,

Robert Blumen 00:44:05 What would be the advantage of doing that?

Sam Boyer 00:44:07 So, it would be that instead of having to, they can get, imagine the context of perhaps AB testing or, well, if you had the ability to change the set of dependencies being used for a given piece of software, perhaps not on the fly, but that you could change it somewhere later in your, in your build pipeline. Instead of having to like have multiple build pipelines where you expressly commit a particular version to be tested. And now you’re like back in git land, trying to think about this, wait, this maze of branches that you’ve created, you can express a common tutorial set of build variations that you might want to test that are like a logical consequence of like a matrix of versions or matrix of constraints that you pass to the algorithm and have some knowledge of, of which versions will actually be tested. That’s one thing, anyway, this is a relatively distant concept. So, I’m not sure how it would end up being used.

Robert Blumen 00:45:08 Could there be an example of this where there’s some library that’s a very low level. It has tremendous throughput like networking. And I want to see if I bump a version if I either have a performance regression or performance improvement. And so I, I could do that by just changing the version in the manifest or the lock file and putting it on a branch and building that branch. But you’re saying

Sam Boyer 00:45:36 Yes, I think, yes, but you could basically design, you can imagine designing sort of, you can imagine designing checks, which want to continuously check against, some range of versions for some library, some particular property. Yes, like the performance of some, some networking through the library. The advantage of, of this is it’s hard to talk about because it was a few, this is something that I’m working on, but it’s hard to talk about because there’s a few concepts that you need to have kind of before you can get to the value of it and like explaining the concepts that get there is kind of out of scope. The simple way of saying it is that you need a language that you can use to express, which versions are acceptable to select, which are, which is then sort of usable on the other side to talk about which versions have been selected relative to what is like committed to in a log file. And that can then be used to like to parameterize your concept of what your build actually looks like. But that’s, I think that’s probably the most comprehensible way to say it. And I don’t think it’s very comprehensible.

Robert Blumen 00:46:45 Last component would be the dependencies themselves. Where does the dependency manager put them after it grabs them off the internet?

Sam Boyer 00:46:56 Yes, so I mean again, of course, this varies significantly from package manager to package manager, and the question there is really tied to the behavior of the, so given, given the packages are tied to a particular ecosystem, and there is an intent of using packages in a particular way, whether it’s an interpreter or compiler or as an intermediate stage, like to a larger, to something larger there has, there is some program that is going to tie together all these dependencies, and it has to know how to locate them. So, this is really sort of a, how do I locate things problem. The like, the OG package manager is really, when we think about system package managers, those are actually C package managers because we didn’t really make the distinction between all of, all of the code in our live directory, which was going to be used by C and packaged in fact, which would, actually go with like locate packages for you on disk for most of the languages though, this is a more self-contained problem.

Sam Boyer 00:48:03 And the way that packages get arranged is mostly a question of sort of look up convenience for the language. The main thing to say about it is kind of an, a note that like, what gets weird about package lookups is this is where you tend to want to do something like aliasing or some sort of, of override methods. And some of it is important, but there, if you give too much flexibility to the user, to be able to just sort of rename pack this all over the place, they will consistently people will, will use it for things that end up getting them, getting their program into just an incomprehensible state, because we’re effectively overriding names upon overriding names or messing with things with where they are on the file system. In general, where the packages live should be the most boring question that you can possibly ask of a package manager. It should be rote and simple and dumb because doing anything else makes things really confusing.

Robert Blumen 00:49:01 I was going to ask you this question about the lock file, which I forgot. So, I’m going to combine this to the quest about dependency, should the lock file be committed to your CM? And what about the dependencies when they’re downloaded? Should those be committed?

Sam Boyer 00:49:17 I’m agnostic. There are reasons to do it. It’s easy to point to things like left pad, which is a decade old now, and even know how old it is, you know, where people have pulled packages from public repositories. And you’re trying to guard against that. Sure. But most of the strategies for trying to defend yourself against like the, the world changing tend to have a limited shelf life. I’m not saying that they’re wrong and everybody has a different way of sort of thinking about the risk profile here, in general I tend not to give it my dependencies, but there is, I view it as mostly like a sort of risk assessment argument that, that people, reasonable people can disagree on.

Robert Blumen 00:50:03 Language, like GO, or the C languages, the end of the build systems that produces a binary, which may incorporate all of the libraries to the point where you don’t need them. I understand there are some things out of dynamically loaded libraries compared to Ruby or Python where you need the packages because it’s interpreting them at runtime. And so that would be a stronger case why you should possibly check them in because you can’t deploy without them.

Sam Boyer 00:50:37 I mean, you know, the question there is whether you want failure to happen at deployment or like failure to happen with your developers, failure is still happening. I’m not sure how meaningfully different those things are. And that may depend on your context.

Robert Blumen 00:50:53 So, we have a bit of time left. You had a discussion in your article, and if you can condense this in the remaining time, the steps that happened, we’ve talked about the components, broadly, what does the thing do? What are the steps that happen when the package manager runs, referencing the manifests lock file dependencies and other state.

Sam Boyer 00:51:19 The steps. So, when it runs referring to our get or add, remove, or change, which one?

Robert Blumen 00:51:28 Let’s start with, let’s do add, that’s probably what we have time for.

Sam Boyer 00:51:32 Sure. Yes, so, mostly really what’s happening here is we’re trying to reconcile a few different states, right? We are, depending on the language, possibly trying to scan over source code, like you can do this in GO, scan over the source code to look at import statements and try to ensure that all of those are satisfied. That’s something you may be doing. And you may be doing that independent of which command you’re running, but with an add command, the basic validation that you have to do in the command initially is like, well, is there actually, I think the add here, like if you try to add FU and FU IS already in your manifests, then what are we doing? It doesn’t make any sense. But once we’ve established that, like it’s a sort of valid request, then you have to resolve the name. How do I know what FU is?

Sam Boyer 00:52:17 And that’s going to depend on the things we were talking about earlier for the package manager, about what thing you use resolve names. And if the thing resolves, you’re going to fetch it down, and at that point, you have to then kick off, well, what’d you going to do at that point is you have to kick off first and selection. Cause there’s, there’s no way to, there’s the thing about version selection, because it’s all a graph. You have no way of knowing a priori which packages might depend on which other packages, in order to add A, you have to reconsider the entire graph. So, you run through reversion selection process to basically pick a new version for all of your dependencies and including, including A with that. Then you assuming that that was successful, you get a return out. And that ends up getting piped back to the user. Ideally, that manifest file gets updated and a lot file gets updated as well. And you got a successful return to the user.

Robert Blumen 00:53:11 I think that’s a good place to stop Sam. We’ve covered a lot of ground here out those centers have a better understanding of this somewhat murky area than they did an hour ago. Before we wrap up, is there any place listeners could find or follow you that you’d like to point them at?

Sam Boyer 00:53:30 I’m on Twitter@SDBoyer, nowadays, but yes.

Robert Blumen 00:53:38 Any open source projects?

Sam Boyer 00:53:40 So, I have a new thing that I’m working on, which is not unrelated to package management. And the repository is, well in a week it should be much more consumable, but I’m calling it Scuemata. It’s written in the new language queue. It’s a system for writing schema, but it’s a system for writing schema that lets you evolve the scheme over time, including across breaking changes and encapsulate a logic for writing the schema and the dealing with breaking changes into one place. And that’s a GitHub/Grafana/scuemata. And I assume that we can put, well, should I spell it out? How should it be?

Robert Blumen 00:54:19 Oh I’ll get it from you and I’ll put it in the show notes.

Sam Boyer 00:54:23 Cool. I think that’s probably the most interesting thing, it’s like I said it’s not unrelated to package management, but it has much more to do with the way that you, the way that you deal with breakage, which is certainly a major topic.

Robert Blumen 00:54:36 Sam Boyer, thank you very much for speaking to Software Engineering Radio.

Sam Boyer 00:54:41 Thank you. It’s been great.

Robert Blumen 00:54:42 This has been Robert Blumen and thank you for listening.

[End of Audio]

 

 

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

Facebooktwitterlinkedin

Tags: , ,