SE Radio 546: Dietrich Ayala on the InterPlanetary File System

In this episode, Dietrich Ayala of Protocol Labs speaks with host Nikhil Krishna about the InterPlanetary File System (IPFS), which is a protocol for distribution of data similar to HTTP. The major difference compared to HTTP is that IPFS uses content addressing to uniquely identify the data itself so that you can identify and access it from any location that might host it. They discuss how anyone could set up an IPFS node and host and publish content that can be consumed from different HTTP gateways by anyone who has the content’s unique address. The conversation turns to the technical details, starting with how IPFS encodes and hashes files to make them available on the networks and then looks at the CID, which is the key identifier for a file block, and the how we can use user-friendly addresses to access this content. Ayala describes the boundary of the IPFS protocol specification and what would be considered layers above the protocol, and how IPFS could potentially be used independently from the world wide web and HTTP. They close with a look at the libp2p package, which bundles a lot of the network stack (WebRTC, TCP/IP, etc.) so that it can be leveraged by any other application. Dietrich describes it as a “language-agnostic toolkit for building transport-agnostic applications.”

Show Notes

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Nikhil Krishna 00:00:16 Hello and welcome to Software Engineering Radio. My name is Nikhil. I’m your host for this episode. Today I will be speaking with Dietrich Ayala about IPFS. Dietrich leads the browsers and platforms group at Protocol Labs, making a more trustworthy underlying web through the adoption of IPFS, Filecoin, and libP2P in browsers, open-source libraries, developer tools, mobile apps, operating systems, and space communications. Before Protocol Labs, he spent over a decade at Mozilla building Firefox, shipping a smartphone OS, and running programs to scale developer relationships globally. Dietrich’s first computer job was as a webmaster at indie music level Subpop records, doing anything and everything digital. He has since worked at small startups and also household names like McCaffey and Yahoo. Before computerizing, Dietrich was a barista and chef. Welcome to the show, Dietrich. And is there anything I might have missed in your bio that you’d like to add?

Dietrich Ayala 00:01:20 Thanks for having me. No, I think that pretty well covers it. I started doing programming pretty late in my mid-to-late twenties and did all kinds of fun things then, in order to get there. Flash3, PHP3, going way back.

Nikhil Krishna 00:01:37 Wow. Yeah, it sounds like you’ve gone through the gamut. So, let’s jump into the topic of the day, which is IPFS or to quote its full form, Interplanetary File System. So Dietrich, could you give us an overview of what is IPFS?

Dietrich Ayala 00:01:55 Yeah, IPFS is a protocol for the distribution of data, similar to how HTTP is a protocol for the distribution of data. Things that it’s used for today quite often are publishing webpages, the availability of very large data sets, and also things like local subnet communication between applications. One of the differences between IPFS and HTTP that’s important is that HTTP uses a trust model of SSL, DNS combined with HTTP to be able to find and locate data, whereas IPFS uses content addressing — using the unique signature of the data itself as the address that we request it by. And this means that your phone can be a server or another computer on the subject can be your server or a remote computer on the other side of the world. And I’ll also be a server of the data that you’re looking for on the IPFS network.

Nikhil Krishna 00:02:52 Awesome. So you mentioned that IPFS is a protocol similar to HTTP. So HTTP obviously has a long pedigree. It’s been there, it’s been adopted as a standard. What is the status of IPFS in terms of the adoption as the protocols spec?

Dietrich Ayala 00:03:10 IPFS itself has been around wow, I think for seven years at this point. So it’s not necessarily new, but I think it definitely was an experimental phase for a long time. One of the major changes that we’ve seen in this last two years maybe is a real uptick in adoption for a couple of different use cases. The biggest one definitely in the last year, year and a half, is NFTs. When you have a digital asset and its metadata that you want to be able to live outside of a specific HTTP endpoint or server — something not tied to a specific DNS and URL — you need to have some way to identify it and make sure that it can be available everywhere. IPFS fit that bill pretty well being that you can address the content by its signature, not by a specific server location. So that definitely a large draw on IPFS usage from that community. Blockchains generally, you want to write to an immutable ledger, something that isn’t going to change or get pulled out from under you if somebody forgets to renew their update, their certificates, or changes companies — or even maybe just moves files in a directory. In HTTP, you’ll get a 404; in IPFS, you still have this ID where if that data’s still available from someone on the network, it’s going to be findable, and therefore those addresses ended up being high utility to things like blockchains.

Nikhil Krishna 00:04:37 So speaking of blockchains, there is this well publicized blockchain called Filecoin, which has a good relationship or is kind of leverages IPFS. Can you speak a little bit about the relationship between Filecoin and IPFS?

Dietrich Ayala 00:04:52 Yeah, sure. So, Filecoin and IPFS use some similar components. They use content IDs — that’s what we call IPFS addresses. These, these content addressable bits. They also use libP2P, which is a toolkit for building peer-to-peer applications. It’s a set of specifications that can be implemented in every, any programming language. Some of the big ones that we use a lot are the GO implementation and the JavaScript implementation. And Filecoin is a level-one blockchain. So, it is its own standalone blockchain where mining is comprised of file storage operations. So, what on another blockchain like Ethereum or Bitcoin you would call miners, we call storage providers, and the activity on the network is comprised of a couple of different things: proving that they have storage capacity and proving that they’re still holding the data that you ask them to store. IPFS and Filecoin are not dependent on each other.

Dietrich Ayala 00:05:53 You can use IPFS and never use Filecoin or its blockchain, for anything at all. You can use Filecoin to be able to store data, very large data. It’s designed initially for very large data sets. The default storage deal size is 32 gigabytes. So definitely not just for storing a couple of images. And you can do that without ever publishing that data to IPFS. It could be that you just want to securely store that data with one or more different storage providers, possibly in different geographic locations for redundancy and safety purposes and get it back maybe a few years later. But you never have to publish that to IPFS and that does not happen by default. So the two can be used in complimentary ways, but are completely decoupled and do not require the use of the other.

Nikhil Krishna 00:06:37 Right. So as you pointed out, IPFS is not coupled with Filecoin, and you can use one or the other in complimentary ways. So does that mean that I can take IPFS, the IPFS technology and incorporate it into my own blockchain, for example? Or can I even just leverage IPFS to build a web application, for example? Is that one of the things that are possible with IPFS?

Dietrich Ayala 00:07:07 Yeah, absolutely. You can do both of those things. Some blockchains or blockchain-based projects do bundle an IPFS node, which is a way of talking about kind of like the fullest expression of the capabilities of IPFS. Aragon was a project based on making it easy to be able to create DAO, a decentralized autonomous organization, out of the box using their application. They bundled IPFS with it. One of the places that we’re seeing a lot of uptake of IPFS is, like you were saying, people wanted to serve webpages on the network. And this kind of happens for a couple of interesting reasons. I think the regular web, the HTTP web, is something that, like you said, has been around for over 30 years now. All of our tools speak HTTP, our APIs speak HTTP, we’re all used to the pain and the peculiarities of that stack. Everything from having to understand the full stack itself, which is a really massive complex set of technologies, and also everything that’s required to be able to deploy to HTTP.

Dietrich Ayala 00:08:06 And we get used to the things that can go wrong. Updating SSL certs, moving files around, changing your whole back end, even trying to manage and collaboratively manage things like DNS at an organization. Challenges, but challenges that other companies have filled those gaps with. And with IPFS, there’s an aspect of DIY to it right now where you can do it at production levels, and a bunch of companies do this today, but also for the regular developer that wants to publish a webpage or even a non-technical person that wants to publish a static HTML ‘zine or a bunch of images, they can do that with IPFS in a way that doesn’t require them to go and set up a remote service somewhere, register a DNS name, and do all these things. They can add it to a local IPFS node. IPFS desktop is a very easy one to install — an electron-based app generally for non-technical users, not too bad to upload files to — and then share those addresses with other people. Those addresses often points to an IPFS gateway. We do run IPFS gateways to the HTTP network. That allows people to easily access files on the IPFS network from web browsers. And that’s a big goal for us is to really get native support for IPFS inside web browsers. And with my background working at Mozilla for a long time before joining Protocol Labs is something that I’ve been focused on this last couple of years.

Nikhil Krishna 00:09:30 Okay, great. So the way I understand it from you is that it is possible to leverage IPFS and =put files on IPFS and use it for your web application. You can easily upload files to IPFS, but you still need a way to kind of distribute it. And right now, basically that’s over HTTP and you have an HTTP-to-IPFS converter that Protocol Labs is running that helps you kind of make that translation so you can run your web application

Dietrich Ayala 00:10:00 For the most part, that’s right. But I think that last bit’s really important, which is anyone can run an IPFS gateway to HTTP. So the GO lang implementation of IPFS comes with that gateway feature. You can turn it on; that means that you can run a gateway. We actually have a website called the Public Gateway Checker that allows you to list your gateway if you’re running one. Protocol Labs has run IPFS.io and web.link — two different gateways — for some time. But one of the biggest out there is Cloudflare’s. Cloudflare runs an IPFS Gateway and serves really a lot of data through there. And we’re seeing more and more providers run IPFS gateways as part of their infrastructure for distributed web projects to be able to serve things like NFTs and to be able to serve the needs of growing blockchain usage.

Nikhil Krishna 00:10:46 Right. So just to kind of quickly double click on that a little bit. So Cloudflare is a well-known CDN — content delivery network — and they host files regularly for everybody. So when you said I can leverage Cloudflare to use IPFS, is that something that I can consciously, is there a setting in Cloudflare that I can just tell them, no, I want to use IPFS for my application? Or is this something that they’re doing internally transparently as a value-added service to all customers?

Dietrich Ayala 00:11:18 Yeah, it’s, for right now they’re running an HTTP gateway to the IPFS network that anybody can access and load data through. And this is one of the interesting things — kind of one of the paradigmatic differences between IPFS and HTTP. With HTTP, you can only access data from a publisher through their website, which is the intersection of that DNS name, the SSL certificate often these days, and the HTTP servers or CDNs that they are running. And if that company decides that they don’t want to serve that data anymore, that data is gone from the internet. I think the original research that Brewster Kale at Internet Archive did around the average lifetime of webpage was in the late nineties. And even then, it was like somewhere between 60 and 90 days. And these days with native apps and APIs and things like that, a lot of the information on the regular web doesn’t actually last that long.

Dietrich Ayala 00:12:15 And one of the things that IPFS does is because you can address data on the network and get it from anyone on the network through the peer-to-peer part of it, you could ask for the same address from Cloudflare’s gateway or the IPFS.io gateway or the local node that you may be running. Maybe you even have a local HTTP gateway running on your computer. All of them can fetch that address from the network from whoever might be holding it. So, with IPFS, you can get as long as someone on the network out there somewhere is hosting that one file that you asked for, and you ask Cloudflare, us, your own node, they’ll all be able to fetch that file from that one person out there who’s hosting it. That makes for a level of resiliency that, that you can’t have today with HTTP.

Nikhil Krishna 00:13:05 Yeah. So basically what that implies, okay, I set up my account with Cloudflare and my account for whatever reason is closed or shut down, or it runs out of money, or whatever, I can still run my website as long as some one of the other gateways has, I mean, I can submit that CID to one of the other gateways.

Dietrich Ayala 00:13:25 Yeah. I mean, right now you don’t even need a Cloudflare account to be able to do that. Let’s say you install an IPFS node on your desktop computer today, or a laptop, or whatever, and you add a file there. You get the address for that file. You can ask Cloudflare’s gateway for it and it will connect to the IPFS public network, which is a distributed hash table. It will ask all the other nodes on the network (or hopefully not all of them) it will find it much sooner than that based on the IPFS algorithm that the public DHT uses. It will find the node as efficiently as possible that is holding that data — the one on your computer that you installed and are running — which will return that to Cloudflare’s gateway, which will then return it to you. And that means that you can ask for the same image from different HTTP servers.

Dietrich Ayala 00:14:15 And because the address that you asked for is cryptographically verifiable, it is basically a SHA-256 hash with some added metadata. You can verify that the data you get in return is the data you asked for, and that means you have to care a little bit less about where it actually came from. So the side-effect of that addressing mechanism leads to a type of resilience in that you can ask anybody for the data that you need; you can receive it from anybody that has it, and you can verify that it was not modified. Those are really interesting properties that the HTTP web kind of doesn’t have. I mean by design, to some extent; dynamic data on HTTP Web is one of the reasons why we love it and use it. Right. But it’s, maybe a different set of use cases.

Nikhil Krishna 00:15:02 True, true. In fact, actually you’ve brought up a few terms and I think it’s not time to kind of like jump in a little bit and talk about some of that, right? So we mentioned things like CID, which is a content identifier. We talked about DHT, which is distributed hash table. And so maybe we can start from the basics, right? So, I have a file with me, right? And it’s a, I don’t know, it’s an image of my profile and I want to upload it. So, when I submit it into an IPFS, your IPFS desktop application to upload it, what actually happens to that file? So does that mean it, can you, can you describe a little bit about how a file is converted into something that can get uploaded into IPFS network?

Dietrich Ayala 00:15:51 Yeah, absolutely. So let’s say you install IPFS desktop and you’re running an IPFS node, or maybe you did brew install or you went through MPM or chocolatey — however you end up running software locally. IPFS, the GOlang implementation, is available in most of them these days. You’re running IPFS locally and you add that image file to it. If that file is under the default block size — let’s say it’s under a meg — that file will be added to a local repository, kind of like your local Git repositories where it’s a hidden directory with a set of files, breaks these files up into chunks with signatures and things like this. The IPFS repository, has them metadata about the file, but the file doesn’t go anywhere. It sits in that repository, and what your IPFS node does is it maintains a connection to the public DHT of what we call often the IPFS network.

Dietrich Ayala 00:16:51 It’s a public network of computers, several hundred thousand I think at this point — maybe even more actually now; it’s been a while. I should have checked those numbers before coming and talking to you so I could say how big they are — but the last check was a couple hundred thousand computers that everyone from us to Cloudflare to thousands of hobbyists, to lots of different companies that are running IPFS nodes. And it will ask, it will announce that it has your file. So it’ll generate that hash, that hash-based address, the CID. Will take that CID and announce it to the network. It’ll say, hey, I’ve got this file that has this or this data that has this address. And now that will be cached for a temporary period of time across a number of those nodes being like, oh, now we know that Nikhil’s node has a file with that address.

Dietrich Ayala 00:17:41 So then when people ask for it later, let’s say you then text that address to me and I’m on the other side of the planet, and I go to my Cloudflare HTDB gateway to the IPFS network, or I used to use my local IPFS node and I say, get me this file. It will then go to the network and ask, hey, does anybody have this file? And it maintains a connection like yours does to a number of nodes. And it will ask those nodes it’s connected to, hey, do you know anybody that has this file? And it will then do that until it finds the information about your node. It will then directly connect to your node and ask it for that data. Your node will return that data. So when you add something to the IPFS network, no data immediately necessarily — if you’re running your own node and it’s on your local machine — is moved at all. It just announces to that network at large, the global one, that locally you do have something with that address on it.

Dietrich Ayala 00:18:37 And if somebody else is asking for it, it will eventually respond to those requests because it will receive them through that public network. That’s kind of, in the most basic sense, how a single file, you can add it to your IPFS node, publish it to the network, which is really more of an announce not actually moving data anywhere and then respond to requests. This really, I hope, demystifies some of the magic that people think about IPFS which is like, oh, if I add a data to IPFS how do I take it down? Well, most often IPFS hosting actually works a lot like traditional web hosting. The only person hosting the data is you; if you want, like for me, the maximum of IP is if you want your data to stay available on the network, you need to ensure that it does so.

Dietrich Ayala 00:19:20 And often, this means using what we call a pinning service, a company that hosts your data on IP network to make sure that it stays available all the time, the same way that you would for a regular web host. So, in some respects, while IPFS is, as I say, power paradigmatically different than HTTP in terms of how we address data, who you can get it from and how you can verify that was modified, all things that HTTP can’t really do, it also does work like HTTP in that if you want some data to stay available, you need to make sure that it does. And one of the major differences from a deployment at scaling aspect, and this is something that is a really important characteristic of IPFS, is that with HTTP, if you upload a file to your HTTP web server and you serve that same image from Nikhil’s blog.com, you’re basically, aside from if you’re paying for CDNs or caches or things like that, the only verifiable place that that data can be retrieved from forever. That’s the ceiling of availability with HTTP.

Dietrich Ayala 00:20:26 But with IPFS, anybody hosting that file can keep that file available on that network. And so, for IPFS you hosting it once is the ceiling of availability of data.

Nikhil Krishna 00:20:39 Is the floor you mean?

Dietrich Ayala 00:20:41 Yeah, yeah. Sorry, it’s the floor. I’m not in Australia. I’m not living in an upside-down world. Oh yeah. So really availability starts by uploading one file once with IPFS, but it ends there with HTTP, for the most part.

Nikhil Krishna 00:20:55 Okay. Right. Thank you for that. That’s a, it’s a great description of the, how this kind of file gets published. And like you said, it’s the floor. Just a couple of quick follow ups over there. So, how do I actually set up copies? So you’ve mentioned a pinning service. Is that something that I have to use, or can I kind of send you the file and say hey, this is my file, I want you to also host it, put it in your node and it would automatically work. And does what I want?

Dietrich Ayala 00:21:25 Yes. Actually, when I first started working on IPFS related thing, I was making a browser extension that basically kind of had some of the underlying common denominators, the primitives that you would need inside a browser to be able to build an IPFS client or a DOT or secure scuttlebutt, other decentralized web protocols or even a if they’re in light client or something like that. And I discovered IPFS and one of the first projects I found was something called “IPFS with friends.” And it was the idea that friends were sharing the data amongst themselves in a way that allowed them to have fun and collaborate. And if some service provider went away, well, you and all your friends still had your data, or if they closed down your channel, or if they got bought by a bigger fish whatever, that you and your friends, well you and your friends still had your stuff cause your stuff is your stuff.

Dietrich Ayala 00:22:12 And I think, for me, that ends up being kind of one of the most fun aspects of these networks and these alternative ways of thinking about collaborating with and sharing data is that it becomes cooperative and you can build communities around it. There’s a thing called IPS cluster, which actually uses like a sync algorithm to be able to sync data between different IPFS nodes. And what people have done with IPFS cluster is come up with this idea of collaborative clusters where you might want to contribute to the hosting of critical data, like a scientific dataset or Wikipedia data or these other things that you might want to donate serving from your IPFS node and be a part of the community that keeps that data available and alive. So we’ve seen lots and lots of instances of collaborative data sharing in this way with IPFS that you know with HTTP, if you’re requesting it from HTTP and that one website goes down or you happen to live in a country that turns that DNS off, well you’re out of luck.

Nikhil Krishna 00:23:10 Right. So just to kind of get back to the other point, so we were so far talking about one image file, which you relatively small image file and just now we talked about Wikipedia, right? Which is obviously a lot more data. So, how does actually IPFS work with large files? Is it like getting a hash of that entire large file and distributing that? Or is there something more complex than that?

Dietrich Ayala 00:23:39 Oh yeah. More complex and more interesting. Big data, big problem. And I think in the cloud scaling world, maybe that looks like, S3 egress bills, but for here IPFS has ways of linking data that are really interesting. And so let’s say that, that image file that you have is a 10 megabyte image file. Well, IPFS operates on the idea of blocks. So up to a meg, one megabyte, it will just serve that data as one address for that one block of data. If you have that file and it’s a 10 megabyte file, when you add it to your IPFS node, what that local IPFS node will do, again, without publishing any data to the network yet, it will chunk that file, it’ll break it up into smaller chunks, it will give each one of those chunks an address, and then it will encode that data into a Merkel DAG, which is basically a data construct that maps the ideas of all of those blocks into one walkable directed acyclic graph. So, it’s a way of linking all of those chunks together.

Nikhil Krishna 00:24:45 So, when you say ID over there, is that a hash ID of that block, or is that some other kind of ID?

Dietrich Ayala 00:24:53 It’s the same. It’s a CID, but it has a metadata encoded into that block that says the CIDs that are the leaf nodes.

Nikhil Krishna 00:25:00 Ah, okay, cool.

Dietrich Ayala 00:25:01 Yeah, it still ends up being an immutable data structure, but you can reference those blocks individually. So now let’s say me on the other side of the world, I ask for the root CID, the star at the tip of that tree, and it goes out and asks the network finds your computer and then it will ask for all of the blocks going down that graph, each individual, and this is really efficient for a couple different reasons. One might be immediately obvious, which is let’s say someone else has some of those blocks, but not all the blocks. I can now start receiving those blocks from both of you, and you might serve me half the blocks. They might serve me half the blocks. And then you can get, when you think about very, very large data sets, like Wikipedia, 650 gigs or something like that for like the base mirrorable image of it, or a large operating system distros, being distributed over IPFS becomes very efficient at that point.

Nikhil Krishna 00:25:57 This is similar to BitTorrent, isn’t it? Isn’t that what BitTorrent also does?

Dietrich Ayala 00:26:01 Yeah, a high level that pattern is very similar. The way that data addressing happens at BitTorrent and verification is quite different. And also, the way that you basically advertise and publish on the network is also quite different. But the same pattern, absolutely.

Nikhil Krishna 00:26:17 Okay. So, we’ve got this Merkel DAG of hashes, and now that is actually what is getting published when you said the file stays with you and the thing gets published, the CID gets published, now it’s a Merkel DAG that gets published.

Dietrich Ayala 00:26:32 It’s still a CID though. So what it’ll do is actually publish the CIDs of each one of those blocks, from the roots all the way down to the leaf nodes, all those branch nodes. It will publish those, the CIDs of each block. And that’s how you get that network-level efficiency.

Nikhil Krishna 00:26:50 But doesn’t it also have to publish the relationship between the blocks, which block is first or which block is second?

Dietrich Ayala 00:26:56 And that’s encoded into the block.

Nikhil Krishna 00:26:58 Ah, it’s encoded into the block itself.

Dietrich Ayala 00:27:01 That does mean more round trips, and so trade-offs, right? You end up getting some resiliency in exchange for some trade-offs. Something you’ll notice with IPFS, it is not immediately in some cases as fast as a centralized network where you’re just asking one party for one thing that you hope is not a 404 and they actually have it and they just return it to you if they it. Big or small, nothing complex. So performance is definitely one of the challenges. Performance on a distributed network. Well that’s, that’s been an academic and practical challenge for quite a long time. We’ve made huge strides in making IPFS very, very performant in different applications and different context. But ultimately the type of performance that end users need is relative to the trade-offs that they have in their given use case. So, for example, if you want to be able to get some data from somebody on the local network but there’s no internet available, you could do that with IPFS and it’s going to be really fast because it’s going to run a local network and you have no internet connectivity. So that in that use case for example, very useful. Whereas HTTP you’d be like, well it’s on a server on the internet somewhere, but neither of us have internet access, so can’t do anything. So we love, we love this pattern of, like, local collaboration is something that you can do with IPFS, it’s really difficult. Like you can’t even get an SSL cert for local network addresses yet. That’s been in process W3C for many, many years. It’s not really going anywhere.

Nikhil Krishna 00:28:25 Right. So that, just to kind of focus on the CID thing. So, you mentioned one of everything interesting, which was that the CID is got an encoding of some metadata about the hash, like what is the position of the hash, sorry, what is the position of the block that this particular CID is addressing, right? So does that mean that it’s not like simple SHA hash of the file? It seems to imply that there is more to it than just a hash of the file content.

Dietrich Ayala 00:28:54 The relationships in that micro tag and that structure is not encoded in the CID. It’s in the data that you get back when you asked for the CID. The CID itself, actually, is that SHA-256 hash by default. I mean it’s, but I think this is actually a great entry point into what is the CID? And a CID is more than just a hash, it’s basically a super address that is designed to be upgradable and configurable. So HTTP URLs today, they’re not necessarily versioned. You can say, hey, I support a given version of HTTP and you can do that in the beginning of your HTTP request and response. But URLs themselves, is a pretty static format. With CIDs, you can configure the chunking algorithm, you can configure the hash that you want to use.

Dietrich Ayala 00:29:43 If you don’t want to use SHA-256, you want to use something else, something you configure. And there’s a set of specifications that comprise these technologies. One is multibase, one is multi-hash and these are again, futureproof upgradeable data construction specifications for these different components of a CID and multi-hash and multi-base are both actually we’re going to propose them at the ITF to go into draft status there. So we really want to standardize these, what we see as very important ingredients in internet that can be resilient for the next, not just the last 30 to 50 years, but the next 3 to 50 years in a way that lets that data stay available and resilient and malleable and upgradable instead of being held back by technologies that are location based that maybe aren’t as gradable and be very difficult to kind of like try to bolt-on functionality for upgradeability into things like HTTP.

Dietrich Ayala 00:30:44 I think we saw this with offline-first movements, right? It’s very difficult to get a protocol like HTTP, which is designed about remote request response and exchange of data, to be thought of as offline. And then that, I think that architectural mismatch is still problematic today or multi-party — things like cores are a great example of like, alright, as soon as you violate the trust boundaries of the origin security model go HTTP, things get really, really hard and we have to be very, very careful, and things have to be very, very safe and difficult, and then people end up just not doing it so much.

Nikhil Krishna 00:31:15 Okay, so speaking of CIDs again, the format that I’ve seen is one of the arguments that you could make against it is that it’s not the most user-friendly, right? It’s not as easy as www.example.com, which is something that rolls off the tongue, so to speak. So, is there a way for us to kind of map these complex multi-hashes to a more simpler naming system that we could remember and share with friends?

Dietrich Ayala 00:31:44 Yeah, there’s a bunch of different ways that people do this. So, DNS link is one of them and that’s basically using DNS text records to be able to point a traditional domain name to an IPFS CID, and that’s something that is used by a lot of different web hosts that support IPFS today. Fleet.co is one, and then even things like ENS if your name service other services like this, use technologies like DNS and another one called IPMS, which is a way having a key that you can use and publish a public key that represents a pointer to a given CID. So, this is a question that often comes up, given that a CID is immutable, but let’s say I want to publish a new version of my profile image. An example you used earlier. You would use something like IPNS to say here’s the public record on the IPFS network that is a mutable pointer to immutable data.

Dietrich Ayala 00:32:37 So from a publishing standpoint, web developers are really familiar with saying, all right, I’m going to set up my DNS name and it’s going to point to an IP address. And then at the IP address I’ll have my web server and that will serve whatever I want out of there and I can change it all I want. With IPFS and public naming, the way you’re talking about, we flip that model a little bit. We push the mutability from your web server up to the DNS level. So, let’s say you publish a new version of your static website, you use whatever your JM stack approach is. You generate the static html, you publish it to your regular website. You also can then get the IPFS CID of that static content after you IPFS and update your DNS link and point it to that new CID. And that allows you to be able to publish mutable data, dynamic data on IPFS in a way that still allows people to navigate it using their user agents that they use today. Typically web browser.

Nikhil Krishna 00:33:33 Right. Okay, cool. So from what I understand, so you have the concept of the DNS link, which has the regular HTTP DNS concepts and that maps to this IPNS, which is essentially kind of like a pointer to the actual CID, which kind of allows you to then say, hey, okay, I made a mistake with my profile, I’ve got a better profile, I want to update the profile and so I can just update the CID and change the pointer to point to that new CID and then just share the DNS link to my friend and he’ll see the new profile.

Dietrich Ayala 00:34:09 Yeah, they have to reload the page. I mean, I wouldn’t call this haircut a mistake, it was an improvement.

Nikhil Krishna 00:34:15 Absolutely. Yeah, but we all keep adding gray hairs and experimenting with color, right?

Dietrich Ayala 00:34:25 ,

Nikhil Krishna 00:34:26 So moving on, we talked about how you can publish your file and it’s not really publishing the file. You’re basically just putting a pointer out there, the CID out there, we basically said that, okay, it’s up to you. You should, you want to take that, you don’t copy the file or you don’t give the file to anybody. It’ll still be only one copy. What about folks that are really interested, like you pointed out like the Wikipedia folks or the people who want to keep internet archive, right? They’re very invested in keeping long, I mean, copies of data. They want to have this working for a long time. Is there any kind of limit or there is any kind of minimal ceiling in the IPFS protocol that says if you want to retain keep this around for a long time, you should keep N number of copies across. Why number of nodes? Or is there anything like that? Or is this just generally right at this point, well, the more you keep the better it is.

Dietrich Ayala 00:35:26 There’s a good question and it’s something that I’ve thought about doing some like projecting or modeling, but it’s pretty use case dependent. It kind of depends on okay, what the level of demand is for the file and what the use cases are for it. So, if you have some like cold kind of, cold storage data archives that you aren’t going to be requested very, very often and where you think that they’re generally pretty safe run by a business or something. Maybe you, you have a copy that’s your published copy and one that you keep in your own note or something like that. But I don’t think there’s a hard and fast rule there. And I think for use cases you could come up with some kind of availability, but it would come down, it actually probably would start looking like global CDN availability points of presence, right?

Dietrich Ayala 00:36:09 If you want to have data replicated and available, highly available for a geography where there’s a given demand, well you want, might want to pick an IPFS pinning provider that has a presence in that geographic region and then make sure that people using IPFS can get that initial data more easily available. Maybe you even use something like IPFS cluster to be able to sync that data out to the other nodes that you want to have that data available. So we definitely see patterns like that as different IPFS providers and kind of IPFS providers as an industry is growing, people are operating these nodes at scale. This is really something that is not nearly as formalized and as well trodden ground as Cloud distribution and publishing and scaling today with HTTP web. So we have a community of people that operate at IP node that are sharing information about how to do that.

Dietrich Ayala 00:37:02 Cause IPFS has a repository, local repository though that doesn’t really operate the way that a database does today. It’s a storage and is pretty, pretty relatively simple compared to an RDBMS or object educated based database or something like that. So it definitely, it’s a publishing and addressability layer and a peer-to-peer network in one and scaling that sometimes these days still takes some magic and some dark arts. I’m hanging out with other people’s operators and figuring out what’s worked. But I think that’s one of the biggest growth areas that we’ve seen. There’re just now a bunch of companies doing this and they weren’t doing this a year ago, two years ago at nearly the same scale and level. NFT drops in particular, you want to put 2000 things up for scale. Like the dependency on availability of that data at a given time is crucial to be able to do that drop.

Dietrich Ayala 00:37:56 So there we’re seeing a lot of interesting innovations happening around bulk uploads, availability times. There’s a group here, NFT.storage that there are a team within Protocol Labs and that might be turning into a subsidiary at some point actually. They work together with Cloudflare to make IPFS data available in Cloudflare Edge workers in really interesting ways to be able to ensure the high availability of NFT assets and metadata. So there’s some of these use cases are really pushing on the state of the art in high availability performance and distribution of IPFS data.

Nikhil Krishna 00:38:30 Right. Okay. Cool. So, it sounds like, from what you said, the IPFS specification per se doesn’t actually have an opinion about this is kind of like a layer above, right? So, thinking about like when I put my distributed systems cap on and I come back and say, hey, okay, is this similar to your standard, I don’t know, Kafka or your standard database cluster? We are not talking about a system — IPFS works at a level lower than that. And then from what you’re saying, it looks like these other concepts like the IPFS cluster and the node providers that are working on top of it would be potentially building these applications that then become concerned about things like the CAP theorem and the availability and petitioning and stuff like that. Am I right?

Dietrich Ayala 00:39:19 Yeah, I think that’s probably a good way of describing it, right? Like I think, one of the conversations that we’ve had a lot is what the most minimal version of IPFS is. Do you need to participate in public DHT? Do you need to actually have libP2P? Do you need to have a transport-agnostic protocol layer underneath you? And I think what we’ve ended up at is that content and addressing, using IPFS CIDs to address data, is really the minimum requirement for air quotes using IPFS. If you’re addressing data in that way, you get a lot of the benefits and you kind of get to choose how much of the underlying infrastructure you want to implement and how. You get a way of addressing data that can live beyond that initial use case or even beyond the initial publisher.

Dietrich Ayala 00:40:08 And that has its trade-offs and challenges too, but it ensures that the application itself doesn’t have that location-based complexity built into it. It can still address that data, even if that data lives somewhere else. Different data center, a different domain name. So that, that use of that addressing it also means because that you can get that data from anywhere because of that cryptographic verifiability. Because the address is generated from the data itself, if someone changes even one pixel in that image, it’s going to have a different address. So, you ask for something and you can verify that what you get in return is actually what you ask for. And that’s built into how we address data from the root of the project. So I think that’s a really good way of thinking about it, that the minimum viable IPFS is using CIDs, content identifiers, that are based on the contents of the data and really outside of the stack from there, we’re seeing all kinds of permutation of IPFS from highly centralized HTTP IPFS data networks to where anybody can still get that data, but they get it from a single source, use an underlying P2P network to private networks.

Dietrich Ayala 00:41:14 So two or more computers that have a private DHT between them and they’re sharing that data, it’s not connected to the public network or even transient IPFS networks where, let’s say you and I have, mobile is a really a good use case, right? Let’s say you and I are in the same room, there’s no internet, well, we’re not going to run like a full IPFS node realistically on our phone cause that’s going to open up a bunch of listening sockets and all types of one, it’ll drain your battery real fast trying to run a server like that. It’s just not optimized for the architecture of mobile devices or the radio architectures of their network connections either. Right? But if you and I have a IPFS-based application that can communicate over say BLE or nearby or even kind of like the underlying network bits that iOS ships that power things like Airdrop, if you’re addressing data by that CID, we can still have an app that completely communicates directly. I can share photos with you and we can be typing into an app and chatting back and forth even though there’s no external network at all. Right?

Nikhil Krishna 00:42:14 Yeah. And potentially it could be more efficient if there are, if you’re sharing files that are similar and that have similar blocks, right?

Dietrich Ayala 00:42:20 Absolutely. We’re actually seeing some groups like apply this towards things like refugee camps where they can’t get video and DNS resolution outside of these places, even though there’s great local network connectivity, and then other use cases like emergency situations like earthquakes or something like that where municipal services might be down, but you can do things like store-and-forward messaging that are content addressed over things like IPFS on devices or through local wifi subnets that are set up and things like this. So that resiliency I think is going to be an asset in the long term, but I think right now we’re still in relatively early days of under a decade in the life of this technology in terms of developer tooling, high availability, cloud deployment, like all this kind of stuff.

Nikhil Krishna 00:43:05 Great. I want to kind of just also now delve into a slightly different topic, and this is something that came up when I was kind of looking into IPFS that is libP2P, right? So, my understanding essentially is that IPFS essentially, from a code organization perspective is a bundle of different components, right? So, you have libP2P and you have the multi-hash concepts, and then you have IPLD and Unix FS and stuff like that. LibP2P, basically, can you kind of go into what it is, and what does it have to do with IPFS and what is the relationship between the two?

Dietrich Ayala 00:43:44 Yeah, so libP2P is a toolkit for building peer-to-peer applications. And I think the best way to think about it is to think about how would you build applications that are transport-agnostic? And that’s less about P2P necessarily but where your application layer has a consistent API that it can use to be able to communicate with a network — whatever network that is — the underlying network could be Bluetooth between two phones. The underlying network could be the internet itself where you have TCP and UDP and all these protocols that can operate at high scale. The underlying network could be MQTT-only on a IOT sensor network. With libP2P you have an abstraction layer where you can write application code that doesn’t have to care about those underlying network connectivity specifics or network transport availability specifics, necessarily.

Dietrich Ayala 00:44:39 Maybe the initial author who deployed it onto the hardware had to figures that bit out, but at the application layer, you don’t have to do so as much. And this, so this has a lot of benefits in reducing complexity at the layer above and be able to have application code that is portable across maybe some of these different run times and is not locked into things like checking for HTTP headers or anything like that, right? There’s some benefits regardless of if you’re building web applications, if you’re building systems code and tools, but also it’s not required for IPFS; we’re seeing more and more IPS implementations that don’t necessarily bundle all of libP2P. LibP2P itself is a — in order to provide that simplicity itself can be a complex set of specifications that need to be implemented and does provide some constraints up into the application layer around those APIs, as well.

Dietrich Ayala 00:45:28 So it’s one of those things that for us has been a key foundational piece in being able to build things like IPFS. Initially, it was bundled into IPFS and was split out as a separate layer, this set of components and now other projects like Ethereum2 are using libP2P, even though they don’t necessarily have IPFS or something like that built in. So, using that underlying toolkit, it also has a pub-sub, a publish-subscribe feature. So, you can do messaging where yeah, where you can subscribe and unpublished messaging and distribute information that is maybe short-lived and not immutably referencable the way that IPFS is. And one of the key pairings that we see a lot is IPFS publishing data to IPFS and then sharing those CIDs over a pub-sub channel libP2P and there you get this really nice feedback loop and application model around nodes that are participating around the given application and when changes happen, they get notified over that Pub/Sub channel of, hey, here’s the new CID. Nikhil updated his profile photo, here’s the new CID for it. So, you get that type of functionality, which has been pretty complimentary and has led to some really interesting applications.

Nikhil Krishna 00:46:34 So you mentioned that libP2P is kind of built and it’s being used by other projects. So, does that mean it’s kind of like distributed separately from IPFS? Can I just go directly and load the libP2P library and use it in my application?

Dietrich Ayala 00:46:50 You can go to libP2P.io and there’s libP2P has a JavaScript implementation that is available on MPM. You can integrate it in zero libraries. It works all the way out into the web layer, but of course any web content code has constraints on whatever connectivity is available. So, you have to do things like set web RTC or web socket connection to connect outside of your webpage of the network.

Nikhil Krishna 00:47:12 So libP2P is primarily focused on the JavaScript and web layer web community. It’s not kind of, I can’t kind of like take this and write a C application with it.

Dietrich Ayala 00:47:24 Oh, no, no, absolutely. Like the Rust implementation and the GO implementation or kind of the network heavy lifters for the IPFS implementations in those languages. It’s a language-agnostic toolkit for building transport-agnostic applications. I just noted that one of the places where libP2P has to operate quite differently is when you actually publish it on the web content, right? You got fetch, web RTC, web sockets, and browser determines, won’t let you open up a listening TP socket from a webpage. And that’s probably a good thing.

Nikhil Krishna 00:47:56 Right. So, you mentioned that libP2P is mainly, the way I understood it is it’s kind of like an abstraction over the network stack, right? So, you don’t really care about how the message gets communicated to the other side. LibP2P kind of handles that, you have a standard API of saying, hey, okay, this is the message, send it somehow. Right? Now coming back to the IPFS relationship, so does libP2P actually contain the code or the parts of IPFS that relate to the distributed hash table and connecting to other nodes and how that hash table is maintained?

Dietrich Ayala 00:48:36 Yeah, so the IPFS implementation that has the DHT functionality, basically constructs that using libP2P components.

Nikhil Krishna 00:48:44 Okay, cool. So this is kind of a little bit of a tangent, but I wanted to go into that a little bit as well. So, we’ve been talking about DHT and we’ve been, we’ve kind of expanded it into a distributor hash table. I’m sure some of our listeners would love to understand what is a distributed hash table and why Is that kind of like the way we are communicating with or discovering nodes? Can you talk a little bit about that?

Dietrich Ayala 00:49:08 Yeah, I’m probably not the best person to define what a distributed hash table is. It’s basically, given a network of computers, they’re sharing information about state that allows you to understand what is where on that network. At a high level, that’s probably enough for your ability to understand how a set of IPFS nodes can share states such as, hey, I have these sets of addresses, or they have those sets of addresses. And so, for IPFS it serves a key purpose in that that ability to share that state across a broad number of nodes in that network allows us to route users to content quickly and efficiently. So, using a Kademlia DHT algorithm, you can say, hey, who’s holding X? And you can get there in a very short period of time without having to say, do a full exhaustive search of the network.

Nikhil Krishna 00:50:00 Right. That actually brings up a slightly interesting follow up, which is, so if I’ve got my application, my IPFS — I’ve written my own IPFS application using libP2P and I want to connect it to the IPFS network, right? Where do I kind of start? I mean, shouldn’t I get some part of the DHT, or how do I actually figure out which node to start with?

Dietrich Ayala 00:50:23 So yes, we have what are called bootstrap nodes and typically we, anybody who maintains an IPFS implementation will usually have a configuration file that has a set of bootstrap nodes. And these are publicly available nodes that either Protocol Labs runs or other people run where over time we’ve either learned that they have the level of resilience and availability to be there and they will then connect you to more nodes. Also, once you are connected to the DHT, you’ll learn about more nodes. So the way that libP2P connectivity model, for IPFS anyway, and the way it uses libP2P works is that it tries to keep a number of nodes – a low watermark and a high watermark, a range in a number of nodes that is permanently connected to and not permanently that it stays connected to. So, some might drop off. Let’s say you have a minimum connectivity 200 nodes. I want to be able to have connectivity to 200 nodes at all times to increase the performance and probability that any requests I make are serviced performantly.

Nikhil Krishna 00:51:24 How do tell if I have got connectivity to 200 nodes? Is that like a heartbeat or some kind of way to kind of tell whether okay, the 200 node that I’m supposed to be connected to are still alive?

Dietrich Ayala 00:51:36 Yeah, yeah. So, the IPFS node is basically a daemon that runs and maintains connectivity to those nodes. And so, we’re using the IPFS CLI, there’s a whole list of commandments that you can use with IPFS CLI that will give you the state of your current connection to the network. You can do everything from like diagnose the availability of a given CID? You can say, hey, IPFS tell me how many nodes on the network are currently serving the CID? And so, there’s everything from connectivity and state management to data availability on the network globally, to commands to introspect your local data store. What do I have? How big is it? Things like that. And so that is a way where you can say, hey, tell me how many nodes am I currently connected to? IPFS companion is a browser extension that is a companion to your local IPFS node.

Dietrich Ayala 00:52:28 And it does things like it has an ambient display of the number of connections to peers that you currently have. But the way that that connectivity is managed is, let’s say, let’s say 10 peers drop off, IPFS will then get more peers, ask the network for more peers until it gets back up to that range of healthy connectivity that it wants to maintain. So, and that’s one of the reasons why when we think about what IPFS is, often people think about running an IPS node, which is like running a server that connects to a bunch of other servers and is available to them and answering their requests, which running a server isn’t ideally the ideal architecture for all use cases. It is ideal if you want to actually have a high-availability connection, you maybe want more decentralization and you’re not so worried about the centralization aspect. You’re like, no, I’m cool with some centralization. That’s fine.

Dietrich Ayala 00:53:15 That hybrid mode is totally legitimate. And so, designing a software architecture for a service like that, has to be respectful of local use cases, of local computational and resource requirements. Things like mobile, like I mentioned, having 200 persistent connections from a mobile phone, it’s not going to last long, right? But it might be fine for laptop that’s plugged in and you’re doing a bunch of stuff that you want to pull down video or something like that. So, it really depends on the use case that you have. And IPFS is honestly itself, like libP2P is a toolkit for building peer-to-peer applications, IPFS itself is a data and distribution and addressing toolkit. When you talked about, what are the best practices for posting an image or publishing an image, should I do make sure six nodes have it or 12 nodes have it or 10 nodes have it? And it really depends on the use case and IPFS is not idiomatic necessarily about the application layer. Like HTTP, it’s like, hey, here’s some basics on how to do addressing. Here’s some things you can do around level of connectivity or the things that may be specific to the environment that you’re running it in, but it’s not going to tell the application layer up above how it should behave too much. As long as it meets those basic requirements around addressing of data. So that’s where really a lot of that value comes from.

Nikhil Krishna 00:54:39 Awesome. Cool. I think we’ve been discussing for a while now and I kind of looked at the time and we’ve been chatting for about over an hour. So, I think let’s kind of like wrap things up a little bit. I think one of my last questions essentially would be around as an application developer. How do I actually, can I leverage or make use of IPFS and libP2P? Is there an easy method for me to start getting into this technology? Would it be better to use like build a website or a web application, or do you think maybe a more level CLI application or a desktop application is the way to go?

Dietrich Ayala 00:55:17 I think it depends on what your background, experience and interests are. So, webdevs that want to get there, that just want to publish the websites to IPFS, services like fleet.co really make it easy where they hook up into your GitHub CI, like they’ll automatically publish, they’ll update your IPNS name, they’ll update your ENS name even. So, making your website available on IPFS for static content and static webpages, totally easy to go that route. First way then you could share it with people and send them to that address or IPFS, right? If you want to install IPFS desktop, that’s an easy way to install basically like an electron-based tray application. You can see how many peers you’re connected to. You can upload / download files that also installs the CLI. So, then you can start playing around with CLI and start introspecting your connection to the network, asking the network for data, publishing and seeing how it works. That’s another way.

Dietrich Ayala 00:56:07 If you want to get a little bit of both worlds, the Brave Web browser actually has IPFS built in. So, you can actually download an install Brave. If you load an IPFS address of Brave, it will ask you what, it’ll just connect to an IPFS HTB gateway by default. But we’ll ask you would you like to install a full IPFS node? And it actually downloads and runs the GO implementation, what we call KUBO implementation of IPFS and then manages that service. It manages it for you. So, it’ll spin it up, turn it up down, you can go to Brave:IPFS and manage your node from there. You can see how much data it’s hosting, and that allows you to natively inside Brave, load then view IPFS data. You can save data to your local IPFS node through Brave, like right-click on an image, save it there, things like that.

Dietrich Ayala 00:56:55 So that’s a pretty fun and easy way to get started that doesn’t even really acquire any developer capabilities, but if you want to build apps. That’s a pretty good way. And then lastly, I think there’s two more things. Rust IPFS definitely is something, there’s one called Iroh, I R O H. There’s a new implementation of IPFS in Rust that a lot of people are really excited about. And then JSIPFS is the implementation in JavaScript and there that is an entire toolkit of different libraries that make it super easy to use NPM and whatever your entire JavaScript build environment is to be able to work with IPFS. And that’s both server-side and client-side implementations as well.

Nikhil Krishna 00:57:32 Okay, cool. Thank you, Dietrich, it was a great conversation. Is there anything in this episode that I missed that you would like to talk about? Or do you think we’ve done a good job of covering what IPFS is?

Dietrich Ayala 00:57:47 We covered some ground for sure. We covered a lot of it. One thing if people want to learn more, we just had IPFS Camp, which is over 500 people that gathered to talk about IPFS — loads and loads of tracks and talks. All those talks are on YouTube and those are available to watch if you want to learn more. Basically, the entire universe of IPFS is the set of speakers, all the tracks for IPFS Camp Easy way. They have a 101-level curriculum, 201-level curriculum, and then all kinds of different sub things. There was an entire libP2P day, so a whole lot if you want to learn more about libP2P there as well.

Nikhil Krishna 00:58:21 Nice. Okay. Thank you, Dietrich, for coming on Software Engineering Radio. I had a great time talking with you. Thanks.

Dietrich Ayala 00:58:28 Thanks for having me.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

SE Radio 546: Dietrich Ayala on the InterPlanetary File System

Show Notes

Transcript

Join the discussion

1 comment

More from this show

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

Menu

Recent posts

Search

Search

SE Radio 546: Dietrich Ayala on the InterPlanetary File System

Show Notes

Transcript

Join the discussion

1 comment

More from this show

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

Menu

Recent posts