SE Radio 538: Roberto Di Cosmo on Archiving Public Software at Massive Scale

Roberto Di Cosmo, professor of Computer Science at University Paris Diderot and founder of the Software Heritage Initiative, discusses the reasons for and challenges of the long-term archiving of publicly available software. SE Radio’s Gavin Henry spoke with Di Cosmo about a wide range of topics, including the selection of storage solutions, efficiently storing objects, graph databases, cryptographic integrity of archives, and protecting mirrored data from local legislation changes over time. They explore details such as ZFS, CEPH, Merkle graphs, object databases, the Software Heritage ID registered format, and why archiving our software heritage is so important. They further consider how to use certain techniques to validate and secure your software supply chain and how the timing of projects has a great impact on what is possible today.

Show Notes

Roberto di Cosmo Bio
Twitter: @rdicosmo
Twitter: @swheritage
LinkedIn: @roberto-di-cosmo
Software Heritage documentation
Software Heritage.org Archive
Software Heritage.org People
What is Software Heritage?
Features
Intrinsic-vs-extrinsic-identifiers
Growing adoption of Software Heritage Identifiers
Software Heritage Ambassadors
Submit and Origin Save Request to SWH
https://ceph.io
Wikipedia: Ceph_(software)
Wikipedia: Merkle tree
Wikipedia: Ralph Merkle
spdx.dev
www.swhid.org
Wikipedia: Salt_(cryptography)
Software Heritage Newsletter

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Gavin Henry 00:00:16 Welcome to Software Engineering Radio. I’m your host, Gavin Henry, and today my guest is Roberto Di Cosmo. Your bio is very impressive, Roberto. I’m only going to mention a very small part of it, so apologies in advance. Roberto has a PhD in Computer Science from the University of Pisa. He was an Associate Professor for almost a decade at Ecole Normale Supreme in Paris. You can correct me on that. And in 1999 you became a Computer Science full professor at the University Paris, Diderot, I think.

Roberto Di Cosmo 00:00:49 The first school is École Normale Supérieure. The university is now University of Paris city.

Gavin Henry 00:00:56 Thank you, perfect. Roberto is a long-term free software advocate contributing to its adoption since 1998 with the best seller Hijacking the World, running seminars, writing articles, and creating free software himself. He created in 2015, and now directs Software Heritage, an initiative to build the universal archive of all the source code publicly available, in partnership with UNESCO. Roberto, welcome to Software Engineering Radio. Obviously, I’ve trimmed your bio, but is there anything that I missed that I should have highlighted?

Roberto Di Cosmo 00:01:29 Well no, I can just sum up, if you want. My life is very three lines: 30+ years doing research and education, computer science, a quarter of century advocating about software and the use of free software in all possible ways. And the last 10-15 years it was just trying to lend a hand in building infrastructure for the common good and software, which is the main work at my hand today.

Gavin Henry 00:01:32 Thank you, perfect. So for the listeners, today we’re going to understand what Software Heritage is. Just a small disclaimer: I’m a Software Heritage ambassador, so that means I volunteer to get the message across. So we’re going to talk about what Software Heritage is. We’re going to discuss some of the issues around storing and retrieving this data at global scale. And then we’re going to finish off the show talking about Software Heritage IDs and where they come in and what they are. So let’s get cracking. So Software Heritage, Roberto, what is it?

Roberto Di Cosmo 00:02:29 Well, okay to put it in a nutshell, Software Heritage is something we are trying to build at the same time a “Library of Alexandria” of source code — a place where you can find the source code of all publicly available software in the world no matter where it has been developed or how or by whom. And this is a time of revolution in infrastructure at the service of different kind of needs. So the needs of cultural heritage preservation because software is part of our cultural heritage and needs to be preserved.

Roberto Di Cosmo 00:02:59 It is an essential infrastructure for open science and academia that needs a place to store the software used for doing research and restorability of this art. It is a tool for industry that needs to have a reference repository for all the components of software that are used today. And it is also in the service of public administration that needs a place for safely storing and showing the software that is used in handling citizen data, for example, for transparency and accountability. So, in a nutshell, Software Heritage what this is trying to address all these issues with one single infrastructure.

Gavin Henry 00:03:38 When we talk about publicly available software, is this typically things that would be on GitHub or GitLab or any of the other free open-source Git repositories or is it just, is it not limited to Git?

Roberto Di Cosmo 00:03:50 Yeah, the ambition of Software Heritage is actually to collect every piece of publicly available software source code, no matter where it is developed. So, of course, we are archiving everything that is publicly available on GitHub or GitLab or GitPocket, but we’re going much broader than that. So we’re goings after tiny small forges distributed around the world, and we’re going after package managers, we’re going after distribution that shares software. There are so many different places where software is developed and distributed, and we actually try to collect it from all these places. In some sense, one infrastructure to bring them all in the same place and give you access to mankind’s software in a single place.

Gavin Henry 00:04:36 Thank you. So if you didn’t do this, what problems arise here?

Roberto Di Cosmo 00:04:40 Very good question. So, why did we decided to start this initiative? We need to go back seven years ago when this was started. We were doing in our group here some research on how to analyze open-source software, finding vulnerabilities, or if they are better quality etc. So the question goes at the moment saying, okay, let’s see. Would we be able, for example, to scale some software analysis tools at the level of all the public available software? And when you start discussing about this you say, okay but where do we get all the public available software? So we started looking around and we discovered that we, as everybody else, were just assuming the software was safely available in the archived and maintained on the public forges like GitTortoise or Google Code or GitPocket or GitHub or GitLab or other places like this. Remember seven years ago. And then we realized that actually not one of these places were actually an archive. On any collaborative development platform, you can create a project, you can work on it, you can erase a project, you can rename it, you can move it elsewhere. So, there is no guarantee that tomorrow you will see the same thing as today because somebody can remove things.

Roberto Di Cosmo 00:05:57 And then in 2015 we had this incredible shock of seeing very large — at the moment, very popular — code hosting platforms shutting down. It was a case of Google Code where there were more than 700,000 projects. It was a case of GitTortoise where there were 120,000 projects. Then later on, remember 2019 GitPocket phased out support for the Mercurial version, and there was a quarter of a million projects unbranded. You see the point? So, what happens here is somebody by clicking a finger can remove hundreds of thousands of project from the web, from the internet. Who takes care of making sure that this stuff is not lost? That it is preserved, that it is maintained for people that need to reuse it, to understand it later on? And so, these were the core motivation of our mission, making sure we do not lose the precious software that is part of our technological revolution and our cultural heritage. So, motivation number one: being in archive in some sense. Without an archive, you take a risk of actually losing an incredible amount or significant part of our technology today.

Gavin Henry 00:07:09 Thank you. And was there other things that you explored — for example, like the Way Back Machine? Is that something that they were interested in helping with, or did you just think ‘we have to do this ourselves?’

Roberto Di Cosmo 00:07:21 Yeah, very good question because we are kind of software engineers here, so the good point is to try not to reinvent the wheel. If there is already a wheel, try to use it. So we went around and we look at the different initiatives that were involved within some sort of digital preservation. So of course, there are archives for maintaining videos, for maintaining audios, for maintaining books. For example, the Internet Archive does an incredible job for actually archiving the web. And then you have people that maintains archivable video games, for example, but looking around, we found nobody actually doing anything about preserving the source code of software. Not just the binaries, not just running a software, but actually understanding how it is built. Nobody was doing this, and so that was reason why we decided to start a specific operation whose goal is to actually go out, collect, preserve, and share the source code of software. Not the webpages, this is Internet Archive; not the mailing lists, you have initiative like GNU mailing lists that do this; not virtual machine, you have other people doing this. The source code — only the source code, but all the source code. And that was our vision and mission, and the mission we are trying to pursue today.

Gavin Henry 00:08:36 Thank you. Is it only open-source free software that you archive? You mentioned operating systems and…

Roberto Di Cosmo 00:08:42 Well, actually no. The point of the archive is to collect everything which is publicly available, which is much broader than just open-source software and free software. This has some consequences. For example, if you come to the archive and you visit the content of the archive, you can find a piece of software, but the fact that it is archived does not mean that it is open-source and you can reuse it as you want. You need go and look at the license associated with the software. Some is just made available publicly, but you cannot reuse it for commercial use. Some is open-source — actually, a lot is open-source, luckily. Our point as an archive is making sure we do not lose something which is precious and valuable that has been made public at some moment in time independently on the license that is attached to it. Then the people visiting the archive, even if is not open-source, they can still read it; they can still understand what is going on; they can still look at the story of what is going on. So, there is value even if you’re not allowed by the license to fully reuse and adapt it as you want.

Gavin Henry 00:09:47 Interesting. Thank you. And how does this archive look? What does it look like? Is it portal into different mirrors of these places, or you know what are the particular features that you offer that are attractive to use once something’s archived?

Roberto Di Cosmo 00:10:01 Very good question. So when we started this, there was a lot of thought going into: well, how should we design the architecture of this thing? So how do we get the software in, how do we store it, how do we present it, how do we make it available for people for use? Then we faced some very tough initial difficulties because when you want to archive software that is stored on GitHub or stored on GitLab, or in the distribution of a package manager like PiPi or MPM) or any other place like this one — and there are thousands of them — unfortunately, there is no standard. There is no standard just to list the content of a repository, like on GitHub, you need to plug into the GitHub direct feed, which is not the same as a GitLab direct feed, which is not the same as a Git Pocket, which is pretty different to the way you can request the Ubuntu distribution to give you the list of the source packages, which is a different way of interacting with MPM or PiPi.

Roberto Di Cosmo 00:11:04 You see the point. It’s a Babel tower here. So we need to build adapters to these contents and then the complexity still is there because even when we have the list of all the projects, then these projects are maintained in different ways. So some projects are developed by using Git, others are developed using Subversion, other uses Mercurial, I mean different version control system. Then the package formats are not same, they’re pretty different. So the challenge was how should we go? I mean, how would you — one who are listening — how would you go about preserving these for the long term? So the apparently easy choice would be to say, well okay, I make a dump of the Git repository, a dump of the Subversion repository, I keep it, and then when somebody wants to read it they run Git or they run Subversion, or they run Mercurial, or some other tool on this particular dump that we maintain. But this is a very fragile approach because then what version of the tool are you going to use in 5 years, or 10 years, 20 years, etc. so it’s complicated.

Roberto Di Cosmo 00:12:07 So we decided to go the extra mile and do this work for you. So actually we run these adapters, we decode all the history of development, we decode the package format, and then we put all these in a single gigantic data structure that keeps all the software and all the history of development in a standard uniform format on which we will probably spend a little more time later in this conversation. But just to make the point clear, I mean, it’s not an easy feat. And the advantage is that now when you go to the archive, you go the archive.software.com you end on a very simple landing page, with just one simple line where, like Google, you will type in what you’re looking for, and this allows you to look through 180 million archived projects. Actually, not inside the source code, you are searching in the URLs of the project that’s archived. And when you find one project that is interesting to you, it doesn’t matter if it was from Git, or from Subversion, from Mercurial, from GitHub, or from Git Pocket, et cetera, everything is presented in the same uniform way, which is very familiar to a developer because it is designed by developers for developers. So it gives you access to possibility of visiting, navigating inside the source code, and seeing all the version control history, identifying every single place of software there. So like before, like a contrasting platform, but it is an archive uniform, independent on where the software comes from.

Gavin Henry 00:13:45 So just to summarize that, so I can understand that I’ve got this correct in my head, so all the different places you archive, you’re not mirroring, you’re archiving it. So you mentioned MPM, you mentioned other packet managers, different source control projects like Git Subversion which could live on GitLab, GitHub, Git Tortoise, all these types of things. It’s not as if they all have an FTP access point to get in and get the software. You might have a read-only view through a web browser through https. You might then have to use the Git tools or the Subversion tools to get the actual source code out that you’re interested in to archive. So you mentioned that you’ve developed adapters to pull them all in and then effectively create kind of like a DSL — domain-specific language — to get all that data in a format that you can work with that is more agnostic and isn’t reliant on the different versions of tools that would need to change over the next 5-10 years. Is that good summary or a bad summary?

Roberto Di Cosmo 00:14:46 No, it’s a pretty good summary. The idea is actually, you know, our first driver was how to make sure we can preserve everything needed for the development in 20 years, for example, to restore our laptop (or whatever it will be instead after whatever happens in the next 20 years) to the exact state of a software project source code as it was at a given moment in time, so you can work on it. And so, the best approach was exactly as you described to do this conversion in a uniform data structure, which is simple, well documented, and that’ll be possible to use later on but independently of the future tools that would be developed or outdated or forgotten.

Gavin Henry 00:15:27 Did any sort of standards come out of this work that would help other people? Has there been any adoption of the techniques that you’ve created?

Roberto Di Cosmo 00:15:35 Yes, basically for people who use tools like Git you can think of the archive you have developed. It is a gigantic Git repository of the scale of the world. So all the projects are in a gigantic graph that keeps them forever. And so, there we needed one standard, and this standard is the standard of the identifier that are attached to all the nodes of this particular graph — this identifier you can use to pinpoint a particular file, directory, or repository or version or commit that you are interested in, and making sure that nobody can tamper with it, so you have integrity guarantees, you have permanent persistence guarantees. And these are the sort of heritage identifiers on which we’ll spend a little more time later on in the conversation. So this is a needed standard, and the work of standardization is starting right now. We hope to see this helping our colleagues and fellow engineers to have a better mechanism to track the evolution of the software across the full software supply chain in the future.

Gavin Henry 00:16:45 Yes, we’re going to chat about that in the last section of the show, the IDs that you’ve referenced there. Okay, so I’m going to move us on to the middle part of the show. We’re going to talk about storing all this data and retrieving it at a global scale. Because obviously it’s a ton of data. So my first question is going to be what sort of scale and data volumes are we talking about? And obviously that changes every day, every minute.

Roberto Di Cosmo 00:17:09 Absolutely. Indeed, if you go to the main webpage of the archive, which is archive.software.org, you will see a few diagrams that show you how the archive has evolved over time. So today, we have indexed more than 180 million projects. I mean origins, I mean places in the web, where you can find the projects. And this boils down to over 12 billion unique source code files. So, 12 billion source code files looks like a lot, but actually remember these are unique files, so the same file is used in 1000 different projects, but we count it only once. So we keep only once and then we remember where it comes from. And it also contains a little bit more of two and a half billion revisions, different versions or status of development of a particular software project. This is huge. The overall storage that we need to keep all this, you know, it depends on how you look at it. It’s one petabyte today, more or less. So one petabyte is big for me — if I want to put it on my laptop, it is too big.

Roberto Di Cosmo 00:18:21 It’s pretty tiny when you compare it to what Google or Amazon need to have in their data centers, of course. At the same time having one petabyte which is composed of 12 billion very small and tiny little pieces of source code poses significant challenges when you want to actually develop an efficient storage system to keep all these data over time. And then if you look at the graph — I mean, not just the files but all the directories, the commits, the revisions, the releases, the snapshots, and all the other pieces in the graph, and with all these things that stay inside this directory, this particular file content includes the age. But in this other directory the same file content is called something else dot C. All these graphs is today 25 billion nodes and 350 billion edges. And so, where do you store such a graph? Because you could imagine you can use some graph-oriented database, but graph-oriented databases for this size of graphs, which are specific topologies are not easy to build. Where do you store this? How do you store this in a way that is efficient to archive because our first objective is being an archive so we should be able to archive quickly and at the same time also efficient to read. Because there’s a moment when everybody is going to use software, so we’ll need to face an increasing demand of being able to provide results efficiently and quickly to people that want to visit and browse the archive. So these are big challenges.

Gavin Henry 00:20:01 Obviously, this isn’t done for free. What sort of costs are we talking about here, and how do you fund this project?

Roberto Di Cosmo 00:20:06 Yeah, indeed that’s a big question. So when you start something like this — so when we started some seven years ago, there was a significant time we spent on thinking about how would you go about building such an infrastructure in a sustainable way. So, there were different possibilities because I mean there is a cost of course; imagine just running the data center, and if you look in our webpage today, you will see all the members of the team — we are 15 people full time on the project right now, okay? So of course, it is not as big as a large company, but it is pretty significant, and of course you cannot just do it on your free time or as a volunteer. It requires significant funding to keep it up. So the possibility number one would’ve been to create a private company. Okay, it’s kind of a startup and try to raise funding to sell services to particular stakeholders. But you remember, 2015 we saw Google Code shutting down and Gitorious, which was another popular forge back then, shutting down after an acquisition by GitLab.

Roberto Di Cosmo 00:21:17 And then this summer we have seen GitLab more or less was considering removing all the projects that were inactive for more than a year. Going into the business space for such kind of an infrastructure was not the right approach. We have seen, for different reasons which are pretty legitimate — making money or satisfying your stakeholders or stockholders — companies may decide to switch off or to change the service they provide. So, you didn’t want to go that direction. So the point was to actually create a nonprofit, multi-stakeholder, international organization with the precise objective of collecting, preserving, and sharing the source code — of creating and maintaining this archive. And this is the reason why we have this agreement — we signed an agreement in 2017 with UNESCO, which is the United Nations Education, Scientific, and Cultural Organization — and the reason why we started going around and looking for sponsors and members. And so, basically, the project is run today by using money that comes from some 20 different organizations that can be companies, can be academias, it can be universities, it can be ministries on different countries that provide some money in form of membership fees to the organization in exchange for the service that the organization provides to all the stakeholders. So, this is the path we are trying to follow. It has been a long time. In seven years, we moved from zero supporters to 20, which is not bad, but we’re pretty far from the number that we need to have a stable organization and we need help going into that direction.

Gavin Henry 00:23:04 So it’s a pretty global project, which matches the goals you’re trying to achieve.

Roberto Di Cosmo 00:23:08 Absolutely.

Gavin Henry 00:23:09 Thank you. So I’ve got to dig into the storage layer now. We’ll touch upon I think in the Software Heritage ID section about the graph protocol or the graph work that you’ve done, as well. You did just mention that briefly. So how frequently do you archive this data? You know, how many nodes do you have?

Roberto Di Cosmo 00:23:27 Well, if you look — if some of our listeners here are curious, if you go to docs.softwareheritage.org, one of the first links in there brings you a nice webpage that describes the old architecture, more or less. The architecture, it was used up until a few months ago. So, how would you go about archiving everything which is out there? We actually have three ways of doing this. One is a regular and automated crawling of some sources where the sources are not all equal. They do not have the same throughput, of course, so you have much more activity on GitHub than on a small local code hosting platform that has just a few hundreds of projects; it’s not the same activity, of course. So, what we do is we regularly crawl these places; we do not archive all those on GitHub as soon as you make a commit. Technically it could be possible, right? I could listen to the event feed from GitHub, and every time somebody makes a commit I could immediately trigger an archive of it. But this is just not technically doable with the resources we have today.

Roberto Di Cosmo 00:24:37 So, we have a different approach, so we regularly lift — at least every few months — the full contents of GitHub. We put in the queue, of the projects that need to be archived, all the projects that have been changed over the lapse of time. The projects that didn’t change we do not archive them again, of course. And then we go through all these backlogs slowly. This is the ‘regular’ way. Then the other solution we have put in place is a mechanism that is called ‘save code now.’ So, imagine that you find that there is a project that is important to archive today, not in three months or when it goes on the top of the crawling queue. And then it is possible for you to go to this save.softwareheritage.org, point our crawlers to one particular version-control system that is supported and trigger archival immediately. And then, the third possibility is having an agreement with some organizations or institutions or companies that actually want to regularly archive their software with specific metadata and quality control. And this is a deposit interface, and of course, to use this deposit interface you need to have a formal agreement with the Software Heritage for doing that. I hope this answers a little bit the question. So, regular crawling that is not as quick as you could imagine but more so a mechanism for you to bypass this queue and say ‘hey please do save this now because it’s important right now.’ Or another mechanism allows people to actually put content into the archive. Then we need to trust the people that do this. So we need an agreement with them.

Gavin Henry 00:26:13 So, do you regularly hit API limits with the big guys, like GitHub or GitLab, or do you have to contact them and say this is what we’re doing, can you give us some type of special …?

Roberto Di Cosmo 00:26:23 Yes, indeed. And so, for example, we are very happy that we managed to sign an agreement with GitHub in November 2019, and the objective of this agreement was exactly to have specific elements in the API that they actually provide us to simplify the archival process and to have us some rate limit raised for our own crawling. Now why is it critical thing that people do things without saying anything to anybody they just, I mean bypass the limitation by spawning tons of clients of different organization but we would like not to do this. We prefer to have a direct support from and direct contact with the forges. But consider that we are a small organization, so setting up an agreement with all possible forges around the world is not something we can do. We would like to, but are not able to do. So we made this agreement with the biggest one, which is GitHub, and we do not have agreements with the others, but we would love to have an agreement with GitLab.com or with GitPocket. For the moment, we manage to crawl them without hitting too many rate limits, but it would be better if this could be written down in an agreement.

Gavin Henry 00:27:35 Yeah, I’d imagine it would be better doing something on the back end somewhere with big guys in the countries where they have most of their storage. And you mentioned anyone can submit data. So you’ve got save.softwareheritage.org. I’ll put these links in the show notes anyway, and then the main archive one. I added my own personal software project to it and it’s there. Did I miss any of the entry points?

Roberto Di Cosmo 00:27:58 No, it’s just a little extra information on ‘save code now.’ When you trigger the archive of a project that is in a platform that we know, then it goes immediately into the archival queue in this quicker type of fast lane — fast track, if you want. But if it comes from a platform we’ve never heard of — I mean, fu.bar.z or something — this goes into a waiting queue where one of our team members regularly checks that it’s actually not a copy of some porno video or something, you know? We try to check a little bit what people submit. But once it is vetted, it goes in.

Gavin Henry 00:28:37 I have another question about verifying data. Okay, you mentioned before a sort of 5-10 year or 20-year timeline you’re trying to preserve things for. What’s sort of realistic, do you think?

Roberto Di Cosmo 00:28:50 Well first of all, as you know, we don’t know if tomorrow we won’t be alive. But the point is that we really try to set up… all the design of everything we do has been thought out in such a way of maximizing the chances that these preservation efforts will last as long as possible. So, this means different things. For example, all the infrastructure — absolutely every single line of source code of our own infrastructure in Software Heritage is free software or uses free software and open-source software. Why? Because otherwise you could not ask us in preserving our own if we use proprietary components of which we have no control and that nobody could replicate if needed. That is one point. The other point, the organization again thought as a non-profit, long-term foundation trying to maintain it over time. But then there are also technical challenges. How can we be sure that these data will not be lost in some moment in time because imagine some of us in the team makes a mistake and erases all the data in one of the servers, or we get hacked, or there is a fire in one of the data centers, or many different things.

Roberto Di Cosmo 00:30:06 Or — it has happened many times — some legislation is passed that actually endangers the mission of preservation. How do we prevent this? Because if you want to last 10, 20, 100 years, these are all the challenges you need to seriously take into account. And so, to avoid the danger more technical, our approach today is to actually have replication all over the place. So, we have a mirror program in place. A mirror is a full copy of the archive, maintained by another organization, in another country, potentially on another technology stack, in such a way that if something happens to the main node, the mirror nodes can take up from there and all the data is preserved. This is one possibility. But this mirror program has also the advantage of protecting a bit from this potentially legal challenge because we mentioned if tomorrow there is a directive… actually let me tell the real story.

So a few years ago, here in Europe, we had a change in copyright law through a directive of the European Commission that made a lot of noise back then. What people probably don’t know is that one tiny provision in this directive endangered all the code hosting platforms for open-source, massively. And so it took us, in collaboration with many other people from other organizations, from free software organizations, from open-source organizations, from companies like RedHat, GitHub, or Debian, to spend an enough amount of time to have a change into this legislation, this directive, to actually protect open-source software and protect platforms like GitHub on one side but also archives like ours, or distributions like Debian. This has been kind of unnoticed because it is just software and not videos, images, culture et cetera in the whole discussion. But it was a real, real challenging danger. So imagine if it happens again in another moment in time, then it is important to have copies of the archive under other jurisdictions that would be protected from these kinds of provisions. So this is the way we try to minimize the risk of failing over time.

Gavin Henry 00:32:23 Yeah, that’s a very good point because at the point of archive or mirror, everything’s legal, but when it changes it’s only restricted by that part of the world and the laws there. So, if we dig into generic storage, lots of us are involved with data centers or network attached storage, that type of things. And we know the rule of thumb where storage devices fail generally around every three years or so. My question was how do you handle this? But I think you’ve just explained that by the master nodes and the mirror nodes, is that correct?

Roberto Di Cosmo 00:32:55 And actually, the mirror node is kind of an extreme solution to the issue. Of course, inside our… Maybe I can tell you a little bit more about what is going down under the hood. Today, we actually have three copies of the archive under our own controls, so not on the mirrors. One copy is fully on our bare iron that we have in our own data center hosted by the IRILL organization that hosts us, and then we have two full copies: one on Azure, which is sponsored by Microsoft, and one on AWS, which is gratefully provided by Amazon. So, you see we are separating things, we have the caps and checks and whatever on our own infrastructure, but we also have a full copy on Amazon that does the same thing with different technology, in Azure that does the same with different technology. So of course, nothing is fully fail-safe but we believe this particular setting today is relatively reassuring okay? against, I mean, losing data by corruption on the disc.

Roberto Di Cosmo 00:34:01 We also have some tools that run regularly on the archive to check integrity. It’s called SWH scrub, because of the disc and checks how things happen. And the extra point which is interesting for us is that — we’ll be going to this later on again — using this identifier that we use and that’s used all over the architecture which are cryptographic identifiers. Actually, each identifier is a very strong checksum of the contents, so it’s pretty easy to navigate the graph, then verify that there was no corruption in the data at every level — at every single node, we can do this. And then, if there is a corruption, we need to go to one of the other copies and restore the original object.

Gavin Henry 00:34:41 So you’re constantly verifying and validating your own backups and your own archive. You mentioned you use a very good model, which a lot of people that use the cloud try to do but sometimes costs get in the way: having multiple Cloud providers duplicating that way — you said you’ve got your own bare metal in your own data centers, and you’ve got Azure and you’ve got AWS.

Gavin Henry 00:35:05 Yeah AWS. So, for your own metal, just because I’m interested , and I’d really like to know.

Roberto Di Cosmo 00:35:10 Absolutely.

Gavin Henry 00:35:11 What sort of file system do you run? You know, is it a RAID system, or SFS, or all that type of stuff?

Roberto Di Cosmo 00:35:17 Yeah, okay. What I can describe to you is a core architecture, but we’re changing all this, I mean moving to a more resilient solution. So, the architecture is based on two different things. One thing is, ‘where do you store the file contents’ — okay? The blocks, the binary objects contained in the file content. And the other part is where do you store the rest of the graph? I mean the internal nodes in the relationship. Now for the file contents, these 12 billion and counting file contents, we use an object storage and this storage was — you remember our constraint is that we decided to use only open-source software in our own infrastructure. So I cannot use solutions that are proprietary or behind closed doors. Unfortunately, when we started this, the only thing that we managed to make run was using a ZFS file system with a two-level sharding on the hashes of the contents. This is a poor man’s object storage, right? I mean it’s not particularly efficient in reading; it’s necessarily particularly efficient in writing. But it was simple, clean, and could be used it.

Roberto Di Cosmo 00:36:25 Now we’re hitting limitations in this kind of thing because it’s too slow — for example, to replicate data in another mirror. And there we are moving slowly to another solution that is using, Ceph which is very well-known as an object storage, it’s open source; it’s actually pretty well maintained by an active community backed by RedHat etc. so it seems nice. The only point is that these kinds of object storage are usually designed to archive very large objects — not large, weights: 64-kilobyte objects. They’re optimized for this kind of size. When you are storing source code, half of our file contents have less than three kilobytes, there are some that are just a few hundred bytes. So there is a problem if you just use bare Ceph solution to archive this because you have what is called storage expansion. One petabyte, you need much more than one petabyte because of the block size etc. So now we have been working with experts in Ceph that we collaborate with — from a company called Mister X, and with support from RedHat people themselves — to actually develop a thin layer on top of Ceph that allows us to use Ceph efficiently.

Roberto Di Cosmo 00:37:42 So it’s a very well-known, very well-maintained open-source object storage, but add these extra layers that make it okay for our particular workload shape, which is different from things that our friends recently have probably have to handle. That’s for data storage; for the object storage. Then if you look at the graph — again for the graph, when we started we used PostgreSQL as a database to store graph information. As many of you well know, a relational database is not the best solution when you have graphs and you need to traverse graph, of course. But it is reliable, has transactions, which ensured that we didn’t lose the data at that time, and now we’re slowly moving to other solutions that will be more efficient in traversing the data. We have developed a new technology that is not yet visible (will be visible, I hope, next year) that allow us to use to traverse graph efficiently without hitting the limit of SQL approaches. But you see the complexity of this task is also on the technology side. When we engage in only using Open- Source component that we can actually understand and use, we are raising the bar of what we need to do to actually make all this work.

Gavin Henry 00:38:59 So just to summarize that, we’ve started off with ZFS on your own bare metal — I’m not sure what AWS or Azure will be doing — then you’ve hit the limitations of that and you’ve moved to Ceph, is that C-E-F or C-E-P-H?

Roberto Di Cosmo 00:39:15 It is C-E-P-H.

Gavin Henry 00:39:17 Yeah, that’s what I thought. I’ll put a link in. And you’re working with the vendors and all the open-source experts to make that specific to your use case. So that’s for the actual files, and you only store one instance of a file because you check the contents of it, so there’s no duplication. And the graph, what sort of graph are we talking about? Is that how to relate those binary blobs to metadata or…?

Roberto Di Cosmo 00:39:42 Actually, you know, when you look at your file system, any usual file system, this file system you have a directory; inside the directory you have other files, etc. etc. So, if you look at the picture representation of this file system it’s actually a tree, usually a directory tree. But actually, it is more than a tree; it is a graph because there are some nodes that are shared at some moment, okay? It has the same directory that appear in two other directories under the same name, so technically it is more of a graph than it is a tree. So this is actually the graph that we’re talking about, so the representation of the structure of the file system that corresponds to particular status of a development of a source code plus the other nodes and links that correspond to the different phases of the evolution. Every time you mark a version, a release, a commit, this adds a node to the graph pointing to the status of the source code in a particular moment in this directory tree. So this is the graph we’re talking about.

Gavin Henry 00:40:37 I did a show on B+ tree data structures where we spoke about graphs and things like that. I’ll put a link into the show notes for that. And we also did a show quite a few years ago now, back in 2017 with James Cowling on Dropbox distribute storage systems; there might be some good crossovers there. Okay, so the graph that you’re talking about, I think during my research it’s a Merkle graph. Is that correct?

Roberto Di Cosmo 00:41:03 Yes. This is the solution we decided to adopt to represent all these different projects and to make sure we can scale up with the rest of the modern approach to development — where every time you want to contribute to a project today you start by making a copy locally on your space and then you add the modification, then you make a pool or merge et cetera. That means that, for example, if you look at GitHub, there are thousand of copies of the Linux kernel. So, archiving each of them separately from the other would be silly; you are using the space in an inefficient way. So what we do, we build this graph as a Merkle graph — we will go into the details a little bit later — that actually has an ability to spot when two file contents are the same, when two directories are identical, when two commit are actually the same, and by using these properties, using these cryptographic identifiers that allow you to spot that a part of the graph is a copy of another part of the graph, we actually manage to compress and de-duplicate everything at all the levels. So if a file is used in different projects, we keep it only once but if a directory, a computer directory may contain 10,000 files is the same in three different project on GitHub, we keep it only once. And we just remember that has been present in this and that and that project, and all the way up. By doing this according to statistics we made a few years ago (it takes time to compute the statistics; we do not do it every time), we had a factor of compression of 300, okay? So instead of 300 petabytes, we have only one petabyte by avoiding copying and duplicating the same file, or the same directory over and over again every time somebody makes a fork in other copies somewhere else on the planet.

Gavin Henry 00:43:01 I suppose it’s a very similar analogy to creating a zip file. It removes all that duplication and compression.

Roberto Di Cosmo 00:43:07 In some sense, but in one sense it is less intelligent than a zip file because in a zip file you look for similarities. But here, we’re happy with identical contents. We de-duplicate only when something is identical to something else. It could be nice, it would be interesting to push a bit further and say hey, but there are many files that are similar one to the other, even if they are not identical. Could we compress them, among them and gain space, and the answer is probably yes but involves another technological layer that will take time and resources to develop.

Gavin Henry 00:43:43 Perfect, thank you. That’s a good place to move us on to the last part of the show. We’ve mentioned these terms quite a few times so it would be good to finish this off. When you build the graph and when you take the binary data or the blob of data, you then have to validate whether it’s changed or whether you need to go in archive things like that. And I think this is where the cryptographic hashes for long-term preservation otherwise known as the Software Heritage ID comes in. Is that correct?

Roberto Di Cosmo 00:44:13 Yes, absolutely. The S-W-H-I-D, Software Heritage ID, so we just call them ‘swid’ if you want to pronounce it quickly,

Gavin Henry 00:44:21 I came across in my research a blog post in 2020 about you exploring and presenting what an intrinsic ID is versus an extrinsic ID and where the SWHID, or the S-W-H-I-D fits in. Could you spend a couple minutes on explaining the difference between an intrinsic ID and an extrinsic ID?

Roberto Di Cosmo 00:44:43 Oh absolutely. And this is a very interesting point. You know, when you need to identify something — I mean an object, a concept, etc. — we have been used for ages, much earlier than computer science was born, to actually decide to use some kind of identifiers. So for example, you think about your passport number, that is an identifier. The sequence of letters and numbers is an identifier of you, that is used by the government to check that you have the right to cross borders, for example. How does it actually work? At some moment in time when you go and see somebody, you say I am here and they give you a number, which is actually put in a register, a central register maintained by an authority, and this central register says ‘oh this passport number, which is a number here, corresponds to this person.’ The person is the name, the last name, birthplace, and or other biometric potentially relevant information that are stored in there. Why we call this identifier ‘extrinsic’? Because this identifier has nothing to do, I mean your passport number had nothing to do with you except the fact that there is a register somewhere that says this passport number corresponds to Gavin Henry, for example.

Roberto Di Cosmo 00:45:54 And so, if in some moment the register disappears or is corrupted or is manipulated, the link between the number — the identifier that uses the number, the number that’s used as an identifier — and the object that it denotes as the person corresponding to the passport number is lost. And there is no way of recovering it in a trusted way. I mean, yes of course, I can read what is inside the passport; the passport could be fake, right? We have been using extrinsic identifiers for a very, very long time. So social security number, passport number, the number of a member of a local library, or whatever. But also, before computer science we have been used to actually using identifiers that are better linked to the object they are supposed to be identifying. Maybe one of the oldest identifiers of this kind, we call them intrinsic because the identifier is actually in some sense computed from the object; it is intimately related to the object.

Roberto Di Cosmo 00:46:58 So one of the oldest of these things is a musical notation, okay? You agree on a standard, you say well there are an infinite number of musical notes, but for this infinite number of musical notes we just agree that there are eight basic frequencies — the A-B-C or do-re-mi depending on how you coin them. And then you have the scales, the pitch and this once you agree on this, it is pretty easy: out of a sound, you can get the identifier and out of the identifier you can reproduce exactly the sound. And similarly in chemistry, chemistry we agreed on a standard of naming things which are related to the object. While we are talking about table salt, then you know it’s chlorine and sodium and this is NaCL in standard international and chemical notation. So, these are the difference between extrinsic identifiers where if you don’t have a registry you’re dead, because there is no link maintained, and intrinsic identifiers, where you do not need a registry, you just need to agree on the way you compute the identifier from the object. These are the basic things that were available even before computer science. Now with digital technology you find extrinsic identifiers in digital systems. Again, when you’re looking for a name on GitHub, or your user account somewhere, and this depends on the register. But you also find intrinsic identifiers, and these are typically these cryptographic hashes, cryptographic signatures all of our listeners are using daily when they do software development in a distributed way by using distributed version-control systems like Git or Mercurial or Azure etc. So, I wonder if this is clear enough to set the stage, Gavin, at this moment in time?

Gavin Henry 00:48:49 Yeah, that was perfect. Although with ‘extrinsic’ I think like ‘external.’ So you mentioned you’ve got the external register. But with the chemical engineering or chemical sector example and music, there is a third-party standard that’s been agreed that you potentially need to look up to understand. Which is kind of like a register.

Roberto Di Cosmo 00:49:09 Well, it’s more difficult to corrupt or to lose. Once you have a tiny standard that you agree upon and that’s okay, then everybody agrees. But with a register, who maintains the register? who guarantees the integrity of the register? who has control on the register? and this for every single inscription you make there.

Gavin Henry 00:49:27 And also the register is not going to be public, whereas the way to interpret the intrinsic ID and that data will be public because the standard. So it’s more protected. Thank you. So let’s pull apart the Software Heritage ID, the use of cryptographic hashes, and how that backs off to the Merkle graph so we can understand how changes are mapped, integrity’s protected, tampering’s proven not to happen.

Roberto Di Cosmo 00:49:48 Absolutely. But let me start with the preliminary remark. I mean, if there are some of our listeners that are familiar with the plumbing that is under modern distributed version-control system that is key to mercurial, etc, the too-long-didn’t-read summary is that we’re doing exactly the same. Okay? So we’re piggy-backing on that particular approach that has been successful. But for some of our listeners that actually never took the time or had the opportunity to look into the plumbing that underlying these direction control system, let’s explain what is going on. So, imagine you need to represent the status of your project in front of you. Okay so you have a few files, a few directories, maybe you made a commit in time so okay this is the status of today, how can you identify the status of your project? If you only need to identify a single file content, I mean that’s pretty easy, right? Okay, you compute a cryptographic checksum. For example, you run the common SHA-1 sum on the file; it does some cryptographic computation, and it spits out a string or few dozen characters that is a cryptographic signature which is strong, that means to say with two files which are physically different, there’s infinitely small chances of getting the same hash there.

Roberto Di Cosmo 00:51:18 So, you can take this cryptographic signature as a representation of an identifier of this particular file. Doesn’t matter if the file is two gigabyte, the identifier is always short or small hash here. That’s easy. Everybody has been doing this for a long time. Now, the big question is, but what if I want to represent not just a single file but a full directory? The status of the full directory. How can I do that? But the approach is, well let’s see, what is in this directory? There are many files okay, they have file names, some properties, and I know how to compute the hash, the identifier of these file names. Ah, so nice idea, let me put in a single text file, a representation of the directory that contains on every line, the name of the file, and the hash of this file in this directory, the type of object that typically a binary object log but could be another directory and the properties and basic properties, I put all them one after the other, put them together, I sort them in a standard way, this is where we need agreement like for chemistry, I mean how we solve them.

Roberto Di Cosmo 00:52:31 And this is a text file now that represents the directory. So on this particular text file, I can compute again the same hash, we have the same common, I get the hash. Now this hash is a representation is intimately related to this text file that represents all the other subcomponents of the directory. So if somebody changes a bit in one of the many files that are in the directory, then all this construction will produce a different key. A different identifier. So you see they’re exporting the property a cryptographic hash from a single file to a directory. Or again, if you look at the original paper of Ralph Merkle at the end of the 80s, he was describing an efficient method of computing a hash of a big chunk of data by using a tree representation. That’s why we call them Merkle tree, these kind of things. Okay? When you recompute the hashes on the internal node by doing this little process of representing the different components in the single text file but then you hash again. And you can push this process up to all the higher level of the graph up to the note of the graph.

Roberto Di Cosmo 00:53:45 And so, for example, if you are looking at the Software Heritage identifier, how they are split up. You have a small prefix that is called SWH, that says okay this is a Software Heritage identifier, then there is column, then there is a version number because I mean standards can evolve, but for the moment we have one. Then you have another column, then you have a tag that says ‘hey this is an identifier of a file content, of a directory, of a revision, of a release, of a snapshot of the full system.’ We put a tag, it would not be necessarily needed, but it is better to clarify what you’ve identify. Then you have another column and then finally you have this hash which is computed by the process I just try to describe, and I know it’s much better with an image, but I hope it was clear enough to give you the gist of what is going on. The end of this story, by doing this process in the graph, you are able to attach to each node of the graph a cryptographic identifier that fully represent the full content of the subgraph that is put there. So if somebody changes anything in the sub graph, the identifier will change.

Roberto Di Cosmo 00:54:57 This means that if you get a software identifier for a count of type of Software Heritage, you store it in contact for first sub-contractor saying I need you to use this particular version because it has security guarantees or you use it in a research article to tell your friends if you want to get the same result, you need to get exactly this version etc. You only give this tiny identifier there, then you go to the software archive with this identifier. The software identifier will tell you, ah you want this directory, you want this commit, etc. You extract the source code from there; you can recompute locally by yourself, with no need to trust anybody else. The identifier if it matches, it means it is exactly the same source code in exactly the same version. So you are safe by using it right now. So, this is a super big advantage of using this kind of identifier. And again, for our friends, please today, they know something like Git or other things they are used to have Githash etc. Yes, it is the same approach. The difference is that the way we compute this identifying Software Heritage do not depend on the version system used by the people who develop the software at a given moment in time. If the user then takes anything in the archive, identify exactly the same way. So the big advantages that you have in archive, something that is here will stay there and these identifiers are universal. They do not depend on a particular version-control system; they apply to every single one of the contents of the archive.

Gavin Henry 00:56:34 Thank you that’s a very good summary. I’m just going to pull some bits apart to get it clear in my head. Because I bet the listeners have the same set of questions. So, you would have a SWHID, S-W-H-I-D for each file, each directory, and then potentially the top of the project of the archive one that encompasses all those different IDs in the text file that you’ve made another hash of?

Roberto Di Cosmo 00:56:55 Yes, absolutely. You have these federal levels sorted by content: the directory, the releases which correspond the commit, the revision, the corresponding commit releases and the snapshot of the whole project and for each of them you have the software heritage identifier.

Gavin Henry 00:57:11 And is there any limit on the number of nodes of a directory, or is that down to the file system?

Roberto Di Cosmo 00:57:15 Not at all. There is no limit whatsoever that is imposed by the standards. You can apply this construction to any kind of… and by the way, if you’re curious, one of our engineers, who actually finishes his PhD thesis and now moved to Google Research and to mp3 under the direction of a brilliant researcher in our team. They actually did the study of the shape of this graph and then you discover that, for example, of course the nodes that correspond to the commits, the releases, and revisions, they can create chains that are extremely long. So, imagine that the Linux kernel has millions of commits. So you have this long, long chain of this, which actually has no limit of the number or the depth of this thing. On the other side, in the directory part it is kind of unbounded. Also you have places where you have tens of thousands of files in the same directory and we all represent the same thing in exactly the same way it just case up.

Gavin Henry 00:58:17 With the hashes, you mentioned we often think about hashes when we talk about password hashes and how the new recommendation comes out to use this format and that type of hash. When you’re talking about proving the integrity of a file, you mentioned SHA-1 somewhere there could be a potential of a clash. What type of hash do you use?

Roberto Di Cosmo 00:58:39 That’s an interesting, but first of all a little remark on the theory behind this, okay? So when you do cryptographic hashes, of course there will be conflict. So there will be objects that will end up having the same hash for the very simple reason that the input space of the hashing function is much bigger than the output space of the hashing function. But when the number of hashes we are storing is much smaller than the upper limit of the outer space, the big question is whether your hashing function is able to actually avoid random conflicts. What is the probability that you pick two different objects at random and they end up with the same hash? And for the history of cryptography, you have seen many, many different hashes evolving over time. So we had this year C32 that was just a small checksum on social memories, and then MD5 that ended up being useless when you have TOMs(?) that develop it, which was pretty safe until a few years ago when Google founded the project to actually fabricate two different files with the same hash and now people are moving to SHA-256, et cetera, et cetera.

Roberto Di Cosmo 00:59:51 It’s a constant process. This is the reason why we have this number of version in the standard in the identifier. Remember SWH version 1, for today. Now they correspond to using exactly in the same hashing function used by the Git version composite. This is a SHA-1 on the sorted version of the file. So you do not just compute SHA-1 on the file itself, you compute SHA1 on the file that has been prefixed by a little bit of information that is typically the type of the file, the length of the file that makes it more complicated to have a hash conflict. But in the future, we plan to follow what the industry standard will be. So it’s a moment in time we will need to move to a stronger hashing function. For the moment, it is not necessary, but we’re following what is going on and eventually we will provide a version two or version three of this identifier standard to cope with the needs that will evolve over time.

Gavin Henry 01:00:56 Thank you. As I understand it, the Software Heritage ID is — the Prefix, anyway — is registered with IANA, so it is a standard?

Roberto Di Cosmo 01:01:02 Yes. Well, actually the Prefix is registered with IANA, which is the first step, then we have the Recent property in Wikidata that correspond to some of the software heritage identifier. There is an industry standard which is SPDX, the Software Package Data Exchange, maintained by the Linux Foundation that mentions the software heritage identifier starting from version 2.2, and actually we are now in the process of creating a real ISO standard for these identifiers that will take several months of time where all the technical precise details on how the identifiers are computed, what is the precise syntax that need to be used. I mean, everything needed for anybody else to rebuild their own system, to compute, or identify the software they have is underway. If you are curious there is now a website dedicated to this that is called SWHID.org where if somebody who is technically knowledgeable wants to come in and lend a hand and participate in this standardization, the process is open to everybody. Just go to this website, you’ll see the pointers to the specification which is undergoing the renew. All the information to join the team that works together on improving the standard.

Gavin Henry 01:02:22 Thank you. Best take us on to wrapping up the show. It’s been really good. Just to close off this section for the last minute or so before we wrap up, what was the Software Heritage ID before? You know, what did you try before you got to that?

Roberto Di Cosmo 01:02:37 When we started this we didn’t have a very clear idea what to use, so before starting the project we looked to other identifiers. For example, in academia, which is my work, we’re used to identifying publication using something which is called the digital object identifier. But then we look at how this digital object identifier is designed, and we found that it was not the right solution. It is an extrinsic identifier, with a register etc., and you have no guarantees of the integrity of the content. But we were already using regularly Git and Mercurial and these kind of distributed version-control systems without asking ourselves how it works, okay? Just using it. And then we decided to look into how that was working and so we understood the underlying technology etc. and we said okay, this is the way of doing things, it’s exactly this, the way of doing things. But then we didn’t want to be stuck with one particular version-control system. We want have something universal. And that was a reason to actually propose these identifiers as an independent orthogonal approach to identification of software source code independently of the version code system that was used. Instead of saying, ah just put it in Git and then get an identifier was not a solution for us. We needed to have something that would work with software coming from where are the rest.

Gavin Henry 01:04:02 It’s something that happens time and time again where you ended up thinking around the subject, or I do personally, where you think this must have been invented somewhere or in use somewhere else for what I’m trying to solve. Let me go and look at a different, put a different hat on, think about the subject, go for a walk, and then like you just said, been using it in Git, so let’s pull this apart and see how to apply it for something else.

Roberto Di Cosmo 01:04:23 Yes, if I may add something, let’s say we very lucky up to now in this initiative because if we had decided to start 10 years earlier, so instead of 2015 we had decided to start in 2000 or something, this technology would not have been available, so we would probably not have the idea of using it, and who knows what kind of mess we would have made. Okay? So, we were kind of lucky in starting the project sufficiently late to have access to the right technology, and then you remember what we mentioned here, like for example Ceph, was not available then. And then different other tools we’re using were not available. So we’re kind of lucky for having started the project sufficiently late to be able to build on the shoulders of giants, as every good engineer should do, and sufficiently early to be present when the big, big dangers arrived — when Google Code shut down, when Gitorious shut down, when Git Pocket removed the quarter million projects, we were already there and this is the reason why we archived all that and you can find it in the archive. Now the big question is how long our good star, our luck will stay.

Roberto Di Cosmo 01:05:38 It also depends on our listeners today. If you can find the project interesting, have a look at it. You can contribute; it’s open source. Or if you work for big companies that do not know it exists, tell them. I mean, if you want to support an important, common, joint platform that can be useful, probably Software heritage is something you should look at and see how to join this mission in this moment. Again, you see, probably you have heard in this kind of conversation how much passion we put in this project. This is the reason why all the people in the team actually work overtime because we are passionate about creating all this. But this is what we are telling you about, it’s not the end of the story; it’s not even the beginning of the end of the story. It’s a start of the long adventure where all of us, in particular us coming from computer technology and computer science bear the responsibility making archive exist in the long term.

Gavin Henry 01:06:33 We often talk about software engineering, software development being an art form, you know art, and we need to protect art. So that’s what we’re doing here. Okay, I think we’ve done a great job of covering why the Software Heritage initiative exists, the challenges you’ve already faced and the ones that are coming up, and the various stages of the techniques you’ve developed to make it successful at the moment. But if there was one thing you’d like a software engineer or one of our listeners to remember from our show, what would you like that to be, Roberto?

Roberto Di Cosmo 01:07:04 A couple of things. One, what we are doing — I mean, developing software is not just tools, it’s much more. I mean, software is the creation of human ingenuity, the need to be recognized and the only way to actually showcase it is to keep and show the source code of the software we develop. The quality work we are doing day after day developing this kind of technology, is a form of art, as Gavin said. We made this clear in many statements and together when you remember when you work on software it’s not just for the money, not just for the technology, it’s because you are contributing to a part of our collective knowledge as humankind today. So that’s essential. And then, so this is not just Software Heritage, it’s software in general. But then about Software Heritage, well Software Heritage is an evolving infrastructure which is a revolutionary infrastructure in the service of research or in service of industry, of public administration, of cultural heritage, and actually we need you to help us in building a better infrastructure and making it more sustainable. Then there are many use case for industry we didn’t have time to cover here, but if you look at the archive, you will see there are probably many ideas you’ll have on how to use this to build better software.

Gavin Henry 01:08:27 Thank you. Was there anything we missed that you’d like to mention before we close?

Roberto Di Cosmo 01:08:31 Sure, there are too many things, you know, seven years in a few dozens of minutes there will always be something that we’re missing. But maybe in a last moment you have seen a rising worries about cybersecurity that we’re facing today. Well, this was not the original mission of Software Heritage, but actually the Software Heritage Archive, due to the way it was built, okay? If you’ve seen the Merkle trees, the identifier, de-duplication, traceability of the graph, etc. etc., it’s actually providing a fantastic infrastructure to help secure this open source software supply chain. So, we’re just again at the beginning of this, but next time you view the project or you discuss with people that ask questions like where does this project come from? can we trust this particular project? how can you ensure it has not been tampered with? etc, etc, it’s nice to have in back of your mind the fact that there is a place where actually some people are building this universal, very large telescope for the house to look at the way software is developed worldwide using cryptographic identifiers that let you actually track and check integrity of every single component contained therein.

Gavin Henry 01:09:46 Yeah. It could be that people prefer to come and get the archive from Software Heritage of their own project rather than trust it where they normally work. So, it’s a very good point. Where can people find out more? People can follow you on Twitter? How else would you like them to get in touch?

Roberto Di Cosmo 01:10:02 Well, there are many ways of knowing more. I mean, you can go to the main webpage that is softwareheritage.org. Look there, there are dedicated webpages for different people, there is a webpage for developers, there are webpages for users, there are FAQs with tons of information. There are different ways on how to use the archive. If you want to get a feed of news, our Twitter feed is SWHeritage — Software Heritage with SW in the beginning — and we have a newsletter that goes out every three or four months, so not very likely to clog up your email. You can subscribe by going to softwareheritage.org/newsletter where we try to summarize the news and provide you pointers to the things that are happening around. And last but not the least, as Gavin mentioned, there is a growing number of ambassadors willing to help spread the word about the project and they get direct access to the team and help us explain to others what this on and creating a large community what is happening. So, you contact them, they are on the webpage of softwareheritage.org/ambassadors. Thanks a lot Gavin, for being one of those ambassadors by the way. And so, there is space for many others, and do not hesitate in contact them if you want to learn more.

Gavin Henry 01:11:22 Roberto, thank you for coming on the show. It’s been a real pleasure. This is Gavin Henry for Software Engineering Radio. Thank you for listening.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

Join the discussion

You must be logged in to post a comment.

3 comments

Srinivasan Ramakrishnan says:

November 20, 2022 at 3:15 am

Superb initiative. Enjoyed listening to it. Yes, you have motivated me to pass the word around and contribute from India.
Chad Dougherty says:

March 2, 2023 at 7:47 am

All of the links in the references section of this page are broken. Could you please fix them?
Thanks…
SE Radio says:

March 29, 2023 at 4:22 pm

Thank you for letting us know about the link problem. All should be correct now.

SE Radio 538: Roberto Di Cosmo on Archiving Public Software at Massive Scale

Show Notes

Transcript

Join the discussion

3 comments

More from this show

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

SE Radio 725: Danny Yang and Sam Goldman on the Pyrefly Type Checker

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

Menu

Recent posts

Search

Search

SE Radio 538: Roberto Di Cosmo on Archiving Public Software at Massive Scale

Show Notes

Transcript

Join the discussion

3 comments

More from this show

SE Radio 726: Scott Kingsley on the Swagger Ecosystem

SE Radio 725: Danny Yang and Sam Goldman on the Pyrefly Type Checker

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

Menu

Recent posts