Episode 549: William Falcon on Optimizing Deep Learning Models

William Falcon of Lightning AI discusses how to use the Lightning platform to optimize deep learning models, noting that optimization is a necessary step towards creating a production application. Philip Winston spoke with Falcon about PyTorch, PyTorch Lightning, the Lightning platform, the needs of training vs. inferencing, scalability, machine learning operations, distributed training, the LightningModule, decoupling the model from the hardware, and the future direction of the Lightning platform.
►▼ View TranscriptThis transcript was automatically generated. To suggest improvements in the text, please contact content@computer.org and include the episode number and URL.
Philip Winston 00:00:16 Welcome to Software Engineering Radio. My guest today is William Falcon. William is the CEO of Lightning AI whose platform we will discuss today. William is on leave from earning his PhD at NYU, New York University. His research there was at the intersection of AI and neuroscience. Previously, William co-founded and was the CTO of an AI startup called NextGenVest. And before that he was a software engineer at Goldman Sachs. Let’s start with an introduction. What is PyTorch, and then what is Lightning?
William Falcon 00:00:49 So, if you’re building neural networks, you need a framework to basically help you do automatic differentiation. So PyTorch is, I guess — it’s a lot of things, but for people working in deep learning, it’s really a way to build neural networks in a way that you can automatically differentiate and compute gradients. So, it’s a computational library, I guess. And Lightning is a way to organize PyTorch code so that you have flexibility, modularity, and the structure of the code allows teams to work together and scale the workloads. So, what we’ve discovered is that training things like foundation models, LLMs, diffusion, and so on, can’t really be done through scripts easily and this is how people normally code. You actually need to structure like Lightning to do it. This is why even companies like Facebook use Lightning internally even though they created PyTorch, right?
Philip Winston 00:01:35 Great. And just to kind of frame what we’re talking about here, can you think of some specific projects or products that are made with Lightning and PyTorch that just kind of give us a frame of reference throughout the conversation?
William Falcon 00:01:48 Yeah, so we have around 10,000 companies using Lightning today to train models. So, one big example is obviously Facebook, itself. So Meta, a lot of the stuff that you touch on Meta today, uses Lightning to train under the hood. Now, they use of combination of other things, of course, but it’s a big part of their stack. There are self-driving car companies, there’s major banks, there’s retailers I guess we cannot talk about publicly. So, these are all public repos that I’m talking about right now. So, you can go to Nvidia, there’s a bunch of repos that use Lightning. So, NEMO is a library by Nvidia that is actually built on top of Lightning, which is cool. You can actually browse through our GitHub repo. I don’t know how much I can say about clients without their permission on this, but one major public example that came out recently is, so if you guys know, stable diffusion, right? A really cool company called Stability AI built that and all that code is PyTorch, right? So those models are trained using Lightning as well.
Philip Winston 00:02:39 Great. So that’s a pretty broad adoption right there. Before we talk about Lightning in detail, let’s discuss deep learning in general and the challenges of developing and productionizing deep learning applications today, focusing maybe some more on the operational challenges. For an in-depth introduction to deep learning, let me refer you to Episode 391, Jeremy Howard on Deep Learning and fast.ai. And here are three other past episodes related to machine learning: Episode 534, Andy Dang on AI/ML Observability, Episode 522, Noah Gift on ML Ops, and Episode 473 Mike Del Baso on Feature Stores. So, firstly today in 2022, how much of the machine learning industry and focus is on deep learning as opposed to other types of machine learning?
William Falcon 00:03:33 Well, I’ll start with the easy answer. If you have data already, and if you’ve already been doing basic machine learning, then people are starting to look into deep learning right now. What I’m seeing across industry is that companies in finance, healthcare, self-driving cars, they’re pretty deep learning already. They’ve already adopted a lot of this, even retailers. There is some industries that are lagging, but it’s mostly because they don’t have the data collection efforts in place. So, I think before you can do deep learning, you basically need the data. Now that may not be true now with foundation models where you can actually fine tune, but you know, that stuff changes every week. So, things are changing rapidly there. Now if you apply the deep learning definition more broadly, which is differential programs, well logistic regression, right? You have linear regression, those can all be written PyTorch, so you can actually distribute the code across GPUs and everything else. So, if you put those algorithms into that category then it’s a pretty broad use case already.
Philip Winston 00:04:26 Okay. And to flip it around, are there application domains where you think deep learning should not be used? Just to kind of frame the scope?
William Falcon 00:04:35 A few years ago, I would’ve said an NLP because before Transformers 2018, before that paper came out, I would’ve said NLP probably mostly because there was not a lot ability to transfer learn. So, you had to like have a lot of data to train stuff, and I had a NextGenVest, we used a lot NLP back then, right? So, we were deploying NLP models and you needed a ton of conversations to do stuff. Today that’s not necessarily true anymore. So where do you not need deep learning? It’s maybe less, I think you can apply deep learning now to everything, but it’s more about like are you going to get a massive boost? It’s in certain areas. So, like random forests are still great. I think XG Boost is also still great, right? So sometimes a deep learning algorithm will be a bit overkill when you have those. So, no matter what we do, I’m an AI researcher even and I do deep learning. I always start with the baseline. That’s always like linear regression, random force or something like that just to see what we can do beyond that.
Philip Winston 00:05:30 Okay. You mentioned NextGenVest, I think that was a company you founded previously. Can you just mention what they did?
William Falcon 00:05:37 So we were a service for helping high school students figure out how to pay for college, low-income students. So, we would go into low-income high schools throughout the US, and we would basically give seminars and onboard the students there into our platform, which is all text-based. So you use, you know these are low income students where a lot of them don’t have smartphone or laptops so it would be like regular Nokia phones, that kind of thing. They would text with us and we’d pair them with a money mentor is what we called it, like a financial advisor. So, you know, people that we trained and they would help you figure out how to apply for scholarships, how to get money. Like, we help people who are like so good that they didn’t even know they could apply to like Harvard or like Ivy League universities.
William Falcon 00:06:14 And so we’re really trying to democratize education and the way that we scaled that, we had at the time about 65,000 users and we only had about 60 money mentors, right? We’re having a ton of conversations daily. So, we used a lot of NLP to basically do that. We didn’t do like automated chats, we did a lot of like human in the loop to help them like get more context about what they had to answer and like things to say to them, right? Today, I still wouldn’t do it fully automated because even these like GPT models and LLMs, they hallucinate a lot, right? So, you don’t just want to drop these things without any human supervision.
Philip Winston 00:06:45 Yeah, that’s my impression that sort of human-assisted AI is going to be super common going forward for quite a while.
William Falcon 00:06:54 Yeah, absolutely. And it’s interesting because, like you know, the way I think about these foundation models — so you might have heard of diffusion and LLMs; like, they have a bug and a really nasty bug right now, which is if they don’t know they’re just going to make something up, right? So, we call that a “hallucination.” It’s not great in text because it’ll be like, yeah I don’t know someone did X and that that actually didn’t happen. And they say with so much confidence that it can trick you. But for diffusion where it’s generative art, it actually becomes a feature because like hallucinating actually makes an image look more interesting, right? So, it’s kind of fascinating in my view how those things are evolving.
Philip Winston 00:07:25 Yeah, I can see the difference there where precision is required and maybe where it isn’t as essential. So, talking about optimization, what does it mean to optimize a deep learning model? We know from regular code we’re generally talking about the speed that it takes, the memory that it uses. Is that the same with deep learning, or are there other, you know, specific types of optimization we’re talking about?
William Falcon 00:07:49 Yes, so it really depends how deep you want to go I guess? I think today you don’t have to worry about a lot of this because of libraries like PyTorch and Lightning, right? Where we handle a lot of this under the hood. But if you really want to optimize deep learning and depends on the scale, I’m talking scale of like hundreds of GPUs or thousands, right? If you’re talking about single GP or a few, I think that’s very straightforward to optimize. But to optimize something like multi node training where you have thousands of GPUs — and this is the stuff that Lightning was built on in 2019, right? So, today’s a big deal but like back then we were really pushing the edge. You have to optimize the way that the data gets loaded, make sure that it doesn’t bottleneck the training, especially in the Cloud, right?
William Falcon 00:08:25 So a lot of like our Cloud offering is actually, that’s what it solves. Makes sure that like we do all that stuff under the hood for you. And then once you have the data on the machines, then you have to optimize literally like what actually the models are doing. And that’s going to be through the algorithm itself. Like, how are you computing the gradients? Like, what type of operations are there? Are you broadcasting or not? So, it gets really, really performance-heavy there. You can even optimize coda kernels if you want to fuse operations or something like that. And then the networking itself is a big bottleneck as well. So, it kind of goes all in. So, like if you need like millisecond training or millisecond latency for inference or something deep learning, it’s actually a lot of work and it takes, it has a lot of moving parts like I think regular software engineering generally doesn’t. Right? And then not to mention the math that we’re taking for granted because the model already gave you the math, right? So, you’re just kind of like black boxing, that thing.
Philip Winston 00:09:13 Yeah, so how about training versus inferencing? Is optimization strictly an inferencing sort of production issue, or maybe iterating during this training phase — is optimization also a factor there?
William Falcon 00:09:27 It really depends on your use case. So, let’s say you’re like CNN or something like that, right? Or Buzzfeed or something. You process news immediately, right? Wall Street Journal. And you get the news immediately, and now you need to train a model and then recommend things for that. So, you now have literally a few minutes to train that model and then deploy it because, like, by the time that you’re done that may be stale; other people may have picked it up, right? So, you want to drive traffic to your site. So, it just really depends on the use case. Like, for those guys real-time training — like as quickly as possible — is super, super important. Whereas, for someone like maybe, yeah even fraud detection I would say — like, let’s say you’re a bank and you’re doing fraud detection. You may only need to train that model once a day and it’s fine, right? So, you don’t actually need it to be super fast. Inference also, it kind of matters, depends on the use case, right? Like if you’re in a self-driving car, well you really need to do like kind of real-time inference, like sub-millisecond inference, right? If you are a website powering something like stable diffusion where you put a prompt and you generate an image, like, users can wait a few seconds for that, it’s not really a problem, right? So, it just kind of comes down to the specific use case.
Philip Winston 00:10:30 Yeah. So, I wasn’t thinking of real-time training at all. My recent exposure to AI are models, the AI art and chatGTP, and in those cases I think they do a huge expensive training run and then the model can be used for months or longer, maybe years. But I see what you’re saying about needing to train almost all the time. That’s a different use case.
William Falcon 00:10:54 Even with those use cases, like, you still want to fine tune the models if you can, right? Like, if you have no data, no money, then take the pre-train model. If you have some data on some money, you still want to fine tune it to make it more appropriate for what you’re doing. Fine tuning is literally training; it’s just not training for months, it’s training for a few hours. But even training on LLM — fine tuning it where that model, you know the way I explain it is like training like an SVM or like a regular kind of convolution network, it’s kind of like cooking a steak, right? It’s like, it fits on a pan. Like, you can do that and if you want to train many of those you can just get many pans, right? And then kind of have separate steaks for that. Training on the LLM is kind of like you have the whole cow there; it just doesn’t fit on the thing.
William Falcon 00:11:35 Like, you just can’t do it, right? You need, like, a grill or something bigger. So that model is so big that it just cannot fit in one machine, so you actually have to distribute it across machines even if you’re only going to train for like five minutes, right? And it’s just the size of the models. Now, you have techniques like deep speed or FSTP where we’ll actually offload GPU memory into CPU, for example, to actually try to fit these models into it. This is like a very active area of research. But yeah, I mean I would even fine tune if I could, right? If I had the money and the data I would still recommend people fine tune.
Philip Winston 00:12:05 Yeah. So, I definitely see optimization being important to training and inferencing. But how about just sort of deep into early development where you’re still experimenting and then you get something you like and you’re going to make this transition to production. What sort of concerns come to mind, or do you have to kind of keep it in good shape sort of no matter what phase of development you’re in?
William Falcon 00:12:29 Yeah, I mean, I think my main concern is how you actually go through that process. Because if you use notebooks, like, yeah maybe it’s nice because it’s cool to like scratch paper, try some things out, but actually putting that into production or even training it at scale is impossible, right? So, I recommend people just always use scripts directly, right? Like write your own Python scripts; do it that way and we encourage everyone to do that. It just, I think that’s the biggest bottleneck I would say. It’s like okay, even though you went and POC’d and you did scratch paper for a while, you should still take the time to then write something, clean that code up and make it scalable so you can actually go do stuff with it. So probably main concern is people leaving the code in notebooks and not doing anything after that because the other stuff I think is a lot easier to handle.
Philip Winston 00:13:12 Yeah, I can see that. And you talked about distributed training with, you know, hundreds or more GPUs. Is that a big transition point where maybe you can train on one GPU for a while and you’re sort of happy with that and then you scale it up and you have to make that jump? Or is it more of a smooth use as many GPUs as needed, in terms of what’s been done historically?
William Falcon 00:13:35 Exactly. So normally the number of GPUs are a function of how big the model is and how much data you have — and to some extent, your willingness to pay, right? If the model fits on one GPU, so fine you can use one GPU. Now the problem is how much data do you have? If you have like a few thousand examples, then that one GPU model will train in a few hours maybe and you’ll be done. But if you have billions of examples, now suddenly that becomes six months of training. So, if you go to two GPUs, that becomes three months. If you go to four GPUs that becomes month and a half, right? So, it just depends. Do you want your results in a month and a half or in an hour? If you go to an hour you’re going to need like 2000 GPUs, right? So, it just depends on your particular use case. I think this is where Lightning shines because to you it doesn’t actually matter, it’s a number that you change in Lightning, you say one or a thousand and we take care of the rest. So, it just comes down to like your appetite for time and money, I guess.
Philip Winston 00:14:29 Yeah, I’ve definitely seen that in non-machine learning cases where you have a batch process and there’s certain bright lines, like can you do two iterations a day? You know, or do you have to do one per day, so it really can you do four per day? You know, it really does matter sometimes how many hours.
William Falcon 00:14:47 Yeah, and I think even in the research phase, like there is a POC phase where you’re just trying to see if this thing is going to work. I think deep learning is a little bit different than, I don’t know, traditional machine learning, I guess, where the hyperparameters matter a lot. Like, if you mess up the — I don’t know, in an SVM the setting for the gap, like, how big that is or the decision boundary, like, that’s not really going to change things that drastically for you. Whereas in deep learning, the wrong learning rate, it’ll just give you garbage the whole time, right? So, you have to do more hyperparameter sweeps usually. When you get these prebuilt models, they already have the settings in there, but they may not work for your data set. So, some people get like 70% and they’re like, okay fine. But you know, you could have tuned the hyperparameters. Tuning the hyperparameters means you actually have to run parallel versions of that model. So, I’ll take that same model on one GPU and then change the hyperparameters and run it in a separate GPU. So, that’s going to limit your iteration speed. So, if you need to try a hundred hyperparameters then if you have a hundred GPUs so you can do it all at once, or you’re going to have to sequence them, right? So that’s where I would start thinking about distributed as well.
Philip Winston 00:15:48 Okay. I mentioned before we had an episode on MLops, but for the purposes of the rest of this conversation, what are these operational tasks, whether during training or production? Can you give examples of some ML ops tasks that can soak up a lot of time or take a lot of effort?
William Falcon 00:16:06 Yes. So again, really depends on your workflow. I’m going to pick a pretty complicated one. You have to move data around, you have to process it, you have to extract features. Maybe you store those features, maybe you store the output of that into like an S3 bucket and then from there you train a model. The model training part, itself: how do you get the GPUs, how do you get the hardware? How do you link up the data? What permissions do you have? Like sure, do you have access to that data or not? Is that model even allowed If you’re in finance, are you tracking everything that happened to that model, so when auditing happens, like you can actually explain what’s going on to it? So, it’s not as simple as just train a model. And I think this is where a lot of people get cut up because they’ll be like, oh I can train this model and they like go do this somewhere and then suddenly they’re like stitching together a billion tools to get all of these things done, and it’s just impossible to maintain, right?
William Falcon 00:16:55 And so, I think the complexity’s going to add up depending on the field that you’re in. Even if you’re in healthcare, you may not even have access to the machines because the data cannot leave the machines, right? So, there are a lot of nuances that go into this. A lot of what we do on the platform is basically abstract that stuff away so that, you know, we work with tons of customers to make sure that this is already embedded into the tools and like true to the Lightning way, all you have to think about is hey solving cancer, whatever you’re doing without all these other elements that go along with it, right? And I actually think MLops will get more complicated because, you know, we’re starting to apply regulations to these things, right? We’re starting to learn what should not be done. Remember this is like a brand-new field, it’s like the wild west. No one knows what’s going to change, right? So, if you’re not future proof, maybe a year from now, whatever you build no longer can pass whatever test some agency puts together for models now, right?
Philip Winston 00:17:42 Yeah. And I can imagine that part of the issue is segmenting, from a hiring point of view, what are the skillsets you need to hire for? And maybe back in the day having the PhD researcher be the same person as doing operations was possible. But I imagine in any company of a significant size today you have different roles for these different things.
William Falcon 00:18:05 Yeah, and you know, the thing is so complicated with deep learning that, like, if you look at the company structure that we have, we have people across the full stack: we have experts in infrastructure, we have experts at the PyTorch level. Like, we literally just started a PyTorch team with the old core lead at PyTorch. And that means that things can be optimized full end-to-end. Most companies don’t have that, right? So, you’re not going to have people who are, like, expert cuda kernel people, or compiler people, or distributed training people, or any of that. So, you’re basically going to be offloading a lot of that to the tools, like, the open-source thing. So, we take care of that for you. And I actually think it’s great because it lowers the barrier to entry. Like, in my world you have an undergrad and CS who can go build the most complicated diffusion model in the world without really having to deal with all the things that happen over the hood. It’s kind of like today when you, you go drive to the store, you’re not like learning fluid mechanics and, like, physics to do that, right?
Philip Winston 00:18:55 Yeah. And some of that’s maturity and some of it maybe the demand for these skills is increasing. So, let’s get into Lightning, in particular. Let’s talk about where did this idea for PyTorch Lightning come from — when it was called PyTorch Lightning — and just kind of the story about how it evolved into Lightning, the platform.
William Falcon 00:19:17 Yeah, for sure. So, I was an undergrad at Columbia University and working in computational neuroscience. This was around 2015, I want to say — 2014, 2015. And, you know, I was working on neural decoding, so we were measuring activity in the brain — in our case, the retina — and we were trying to figure out what produced those neural activities. They’re like spikes, right? So, it’s called, the reverse process of going from spikes to the input is called a neural decoding. Well, the process of decoding, we were experimenting with neural networks, right? Specifically, VAEs and GANs at the time. We ended up publishing a paper in NeurIPs about this. That was like one of the first papers to do this with deep learning models. And you know, back then I started writing code so that I wouldn’t have to refactor anything when I moved to a new project, right?
William Falcon 00:20:00 And this became — I called it like research lit back then — and this was, like ,my internal project and actually was written first in Theano, so it was not PyTorch, right? Because that didn’t exist back then. So, it’s Theano and then eventually we wrote it in tensorflow because the models were too slow in Theano, so we wanted to move super, super fast, and that was really good. I mean, it did speed things up, but it was really hard to work with, especially on the debugging side. So then, when PyTorch came out we rewrote it in PyTorch and kind of kept that library around for a while; it was always my personal library. And then in 2018 when I started my PhD, you know I was working with Kyunghyun Cho and Yann LeCun at NYU, and Cho was like, hey, why don’t we maybe open source this and like let other people in the lab use it, right?
William Falcon 00:20:40 And so, we started working on that, and then basically, you know, kind of made it open source in early 2019, I want to say, for the lab — for CILVR, NYU CILVR. People there started using it, and then a few months later I joined Facebook AI research, and then kind of brought it there and continued using it. Then a few teams at FAIR started adopting it. And at the time what I was trying to do was train on YouTube — like, literally all of YouTube: the biggest model I could find. You know, there was a lot of complications of getting the data, so we ended up doing something different. But back then, the models that we were training were one model and about a thousand GPUs, 1024 GPUs. And I was doing sweeps of like five or six of those models, right? So, it was like a few thousand GPUs being used at any one time, and I was like a research intern, and I was like this one person blowing up the cluster.
William Falcon 00:21:25 The other three teams were like full-blown like five- six-person teams that were like just starting to reach that scale, as well. And I worked closely with a lot of teams like Fairseq, you know, PyTorch team itself. I was sitting next to them and most of the time there. So, a lot of the knowledge gets embedded into Lightning during that time for distributed training. And then it kind of takes off across other companies, finance, everything else. Facebook itself starts using it more. And then, you know, later that year, late 2019, basically there was a lot of demand from companies to say, hey, how do we do this on the Cloud? How do we scale? Like, cool, like yes, I can set this number to a thousand, but like how do I actually get the machines? And there’s so much going into that, right?
William Falcon 00:22:03 So, I decided to start the company. And back then we started basically helping people train models in the Cloud, right? So, this is called Grid, and that went well. And then, basically, my vision for what we wanted to do was not necessarily just train models, right? It was really, how do you work with every tool out there, every single open-source tool, every single private tool that you have, get it to work together, how do you build, like, an operating system basically, right? And we kind of introduce that idea like I think it’s common today and people take this for granted, but remember before Lightning you didn’t just get TensorBoard out of the box with things, like, we introduced that idea; we introduced the trainer as well, right? So, we started embedding all of these other secondary tools together to work into a cohesive experience.
William Falcon 00:22:41 And so, people started getting this notion of a platform at that point, right? But through the framework level, which is really interesting. So then, we generalized that idea and we said, well you know, it turns out that this paradigm that we implemented is not just good for training, it’s actually good for the whole thing. And so, we launched Lightning AI basically as a platform where you can train models and then build full-blown apps, AI apps, right? And so, what people know as PyTorch Lightning today has been renamed to Lightning. And in there, you now have three classes that you didn’t have before. Before you had LightningModule and then the trainer, right? To train models. Now you also have a LightningWork and a LightningFlow. Those two things introduce something called like a reactive paradigm. So, if you’ve worked on like reactJS, it’s basically like react for Cloud, where you can actually build full-blown apps.
William Falcon 00:23:24 And apps is a weird word, but what we mean is workflows, right? So even if you’re doing research, it means I can train models, fine tune them, I can train LLMs, monitor if they fail or not. So, it is kind of this abstraction layer that actually allows you to automate a lot of the operational aspects of this all through plain Python. So, all of this kind of grew out of that. And I think holistically the goal for me is that if you’re a scientist, you only have to be a scientist in the thing that you need to know, like biology or chemistry, and you don’t have to learn anything else and be an expert at that and then let the tools do the rest for you. Right? That’s on the science side. Yeah.
Philip Winston 00:23:57 Yeah. Going sort of way back, you mentioned, you were talking about the number of GPUs you were using and you mentioned something like six sweeps or what, what does the sweeps mean in this case?
William Falcon 00:24:08 Yeah, so I took a model. That model was training on a thousand GPUs, maybe using a learning rate of like 0.02, and then I want to try a learning rate 0.01, and 0.04, and something else. So, I at the same time run that model with different learning rates. I’m parallel. So now I have four models each model on a thousand GPUs, but each one has a different learning rate at once. So, it’s an experiment, it’s four experiments, right? What I’m trying to see is how does the learning rate affect convergence? Is it going to slow it down? Sometimes it doesn’t even converge, sometimes it blows up. So, it’s a very R&D process.
Philip Winston 00:24:41 And am I right that the learning rate is a hyperparameter, where a hyperparameter means something that you are explicitly setting outside of the learning process? Is that the correct terminology?
William Falcon 00:24:52 Exactly, yeah. So, you give the learning process some parameters, like the number of layers, the size of the image, right? And then the model kind of gets created to fit those requirements.
Philip Winston 00:25:04 Okay. And you mentioned LightningModule, which I was going to ask about in a little bit. But to stick to an introductory level here, what are the steps to create a new Lightning project — or sort of, what are the prerequisites before you can set up a Lightning project?
William Falcon 00:25:20 Yeah, so first thing to keep in mind is LightningModule is an organizational layer in PyTorch. It’s not an abstraction, right? So, it’s only organized in the code. By doing that, you’re getting rid of a lot of boilerplate. So, if you already have a PyTorch project, you’re basically going to go into the inner loop. So, the part that has like right where you get the batch of data from there until you do backward dot backward, that part goes into the training step of the LightingModule. It’s a method on this class, and you just put all that information in there and then you get rid of everything that has to do with accelerators, like .cuda, .cpu, all of that because Lightning will automatically handle that for you, which makes it hardware-agnostic. Then if you have a validation loop, so if you’re doing something like validation or even cross-validation, you can basically take that code and put it into a validation step.
William Falcon 00:26:06 And this is kind of what most people do. And then if you’re in research, you usually have a test set because you’re publishing a paper or something. You also have a test loop there. So that’s called a test step. So those are the ingredients that you need to create a LightingModule. So, it’s like a recipe for a model. That means that everything else does not the computational graph because that’s literally what you put in there is abstracted away in the trainer. The trainer will execute the training loop to do that, right? And so, it automatically will handle that. Now, here’s something that’s a little bit different from what we’ve done in the past. So, from basically like 2012 to 2019, all we did in research and deep learning was try to figure out better computational graphs. Meaning, like, does this network have one layer or two or three? Does it have skip connections? whatever.
William Falcon 00:26:48 So, all of that you could spend all your time on the LightningModule messing with, right? And so, if most people are doing deep learning, that’s kind of what they’re doing. In 2021 I think, or 2020, we introduced GPT. GPT is BERT with a ton of data, right? Now that’s not discrediting anything because that problem with scaling that model to that much data actually requires a lot of engineering. That engineering is usually handled by the trainer in Lightning. So, what happened at that time is if people started working on that kind of research where you’re trying to figure out how do I do gradient distribution, how do I offload gradients or activations, right? Kind of Lightning gets in the way because it wasn’t designed for that, right? The trainer was meant to abstract that away from you.
William Falcon 00:27:33 But sometimes people are like a little bit annoyed at that. You’ve run into this if you started playing with GAN, as well, or reinforcing learning, or something like that. It’s kind of like the trainer gets in their way a bit. So, what we’ve introduced is something called Fabric, which is basically a way for you to build your own trainer. So, LightningFabric is kind of like a gradient step between plain PyTorch and then LightningModule. So, you can basically start with plain PyTorch and add Fabric to basically get your LightningModules, callbacks, and all the accelerator stuff handled for you. But now you have full control over the Python loop. So now you can actually do the research for distributed training and all of that. So, you can basically end up writing your own trainer, and if you don’t care about that kind of work then you just use our trainer and then you’re good to go.
Philip Winston 00:28:13 Okay, thinking about best practices, or a new user possibly making a mistake, is there anything that you hear people doing and you’re thinking, oh boy, don’t do it that way — I mean, in terms of Lightning, like, ‘pitfalls’ I guess.
William Falcon 00:28:27 I think the main pitfall is when people start learning about deep learning, they kind of go through this process of learning the raw PyTorch first, which I think is fine, but then they, you know, you should do that to understand what you’re doing because you’re going to write, like, a training loop, and it’s going to be very basic. Now, the problem is you’re going to think that that’s enough to actually do what you want to do, and it feels simple, but the second you try to actually do something meaningful, like for a company, that’s not going to work, right? And so, the problem that people end up doing is that they think that that simple example will generalize to something meaningful that can actually train at scale. Unless you’re, like, literally a PhD in this or you’ve been doing deep learning for a long time, it’s going to be extremely difficult to get everything right.
William Falcon 00:29:05 Like, when do you checkpoint? How do you do backwards? What happens in distributed training? How do you aggregate gradients? How do you do early stopping in distributed training? There’s a lot of nuances that go into this, right? So, this is where we would basically recommend people, once you have the working knowledge, just switch to the LightningModule because now the area for error for you is like dramatically decreased. All you’re going to mess up on is literally the model definition and the training process, but you’re not going to mess up on any of the other hard things that just take a long time to learn — like, in what order do you call things, and it’s really easy to mess things up and it’s just super subtle. You just won’t see that that it’s actually wrong.
Philip Winston 00:29:41 Yeah, that actually leads to a question I think I had elsewhere, which is, like, what is the nature of bugs with deep learning? Like, how does it differ from a regular software application, which you might think of crashing or not returning any output, or things like that. What do you actually see in deep learning that you’d consider a bug that you have to track down and fix?
William Falcon 00:30:04 Sure. So, software engineering, you basically have two bugs: you have logic, which a compiler can’t catch for you — like if you added wrong or you did a for loop wrong or recursions wrong — and then like syntax errors, which a compiler will catch for you, right? Even with Python you can do these kind of things now. Okay, so that’s kind of it. Now, the other errors on deep learning are: okay, even though let’s say you wrote your code and the logic is good and the compiler says everything checks out, your math could be wrong. So, like, you’re not going to know, like, did you multiply or add, who knows, right? Did you do a matrix multiply instead and you should have done an inverse or something? So that’s a first bug that’s going to come out. If you’re writing your models, you’re going to run into this.
William Falcon 00:30:45 If you’re not, you’re not going to because we’ve already done that for you, right? The models are going to be written by people who know exactly what they’re doing there, so hopefully that part of it is gone. Now, the last function and everything else that you’re attaching to it, you can still mess up. So that’s the first element of it. The second element is the data, right? So, you might be pushing data through the model, but if you didn’t transform it, then it’s not going to work, right? So, if you didn’t normalize the data, if you didn’t put it into the right structure — even, like, sometimes it’ll crash because the dimensions are wrong. That’s not what I’m talking about. I’m talking, like, if it’s an image, for example, did you remove the colors? Did you jitter things? Those kind of transforms. If you do those wrong, that’s also not going to work. So it’s super, super subtle.
William Falcon 00:31:23 And then after that, you have the kind of like unseen bugs when you go to the Cloud, right? So, let’s say you’re training on multi-node and you’re like “cool, nothing’s breaking, everything’s working exactly, except that it’s taking four hours. But what you don’t know is that your networking is wrong and actually you have a bug in there, and now your training could actually be taking one hour but you just didn’t set up the networking correctly, right? So, there’s just so many nuances here that just depends kind of what you’re going through. It’s a lot more complicated to get these things running.
Philip Winston 00:31:53 Okay, and let’s dive into the LightningModule in a second, but I wanted to ask first, what other frameworks or libraries does Lightning integrate with — or more to the point, you know, what would you expect to see people using in concert with Lightning?
William Falcon 00:32:09 Yeah, so the traditional PyTorch Lightning, LightningModule trainer integrates with any experiment manager that’s out there. So, all of them are supported, TensorBoard and so on — any model. So, you can fine tune models from any library you want — open AI and so on — you can bring them in. It’s like three lines of code to fine tune or train these things, Torch, Vision, et cetera. What else? So, we integrate all the accelerators, so GPUs, HPUs, TPUs, we have really tight partnerships with all the people who build the accelerators so that they’re actually pretty much sorted it out before they launch so that by the time they launch, you already have access to them in Lightning, which is really cool. And then as I mentioned, now we’ve evolved the project that it’s called LightningNav. So, on that side of it with the Lightning work and the Lightning flow, you can actually integrate any framework.
William Falcon 00:32:53 Even, like, TensorFlow, you can actually train TensorFlow models with Lightning now, right? You can train TensorFlow in the Cloud, you can do JAX, you can do anything really, as long as it’s machine learning you can do NumPy, SK Learn — all of it, right? So that was the original vision is we wanted Lighting to be able to lead this kind of, like, central nervous system that everything can work through. I mean that’s kind of what it’s become today. So no matter what tool you’re using, it’s definitely going to be able to work together. You’ll get the modularity and you get the abstraction, you get the fun of having the Lightning handle all the hard things for you.
Philip Winston 00:33:23 You mentioned TensorBoard. Can you explain what that is just for?
William Falcon 00:33:27 Sure. So TensorBoard is a way to plot your experiment outputs, I guess. So, like a way that you know that deep learning models have finished training I guess is to monitor convergence. So, you usually have a metric you’re optimizing towards, which I guess it’s another way of optimizing, which you asked earlier. There’s like literally the math way of optimizing. So, if you’re training a model, it means you’re reducing the error between an expected output and the actual output of the model. When that error goes to zero, you’ve converged, right? Now, that’s like a very crude high level version of this. You can monitor other metrics, but like, whatever metric you’re monitoring here, you can actually plot that and see it go into zero and see the slope of it, and you need libraries to do that, right? So TensorBoard is one such library.
Philip Winston 00:34:11 So when you say see it going to zero, this is presumably as you get deeper, deeper into the training process, the model is getting more refined or better and then you’re maybe testing it or evaluating it.
William Falcon 00:34:23 Exactly. And like I said, you may not be monitoring something that goes to zero. You may be saying, hey my accuracy needs to go up, so maybe it goes up to a hundred, right? So, it just depends on what’s important to you. But yes, it’s about a minute or a max of some quantifiable metric that you care about and you watch start change over time.
Philip Winston 00:34:41 Okay. Let’s talk about some details. Feel free to steer me to what you think is more relevant, but can you explain what the torch.nn module is? I guess this is what Lightning extends or improves on, but just from the PyTorch point of view — like, what is that class or module?
William Falcon 00:34:58 Yeah, so it allows you to define arbitrary operations on a computational graph and it lets you chain them together, right? So, if you think about what’s happening under the hood, it’s back propagation, right? So, you compute some activations on the forward pass and then you figure out how to do assignment of those errors on the backward pass. So, you figure out what knobs do I have to tweak to minimize the error. So, the nn module basically gives you like notes on that graph that like interconnect and allows the automatic differentiation to go between those notes. So, it’s — I don’t know — I guess, the highway for the math under the hood.
Philip Winston 00:35:34 Okay. And then LightningModule extends that or wraps it, what’s the right terminology?
William Falcon 00:35:40 Yeah, so extends it so that you get all these added benefits to it, right? So, you get the abstractions, it allows the training process to know what to call when. That structure of the program allows us to optimize it for you. There’s something important here, which is I think it’s not obvious to people, but flexibility and performance are always at odds. If you want ultimate flexibility, you’re very, very likely to not have the highest performance unless you really, really know what you’re doing, right? Like really, if you literally work at Nvidia and write cuda kernels all day, right? So, flexibility is something that we want as researchers. I mean, I want that as an AI researcher, I want the most flexibility that I can get, but not at the expense of performance. Performance is something that you usually get through structuring your code. That’s how you’re going to get that. So Lightning is — the LightningModules, that’s what it’s doing. It’s trying to give you that hybrid between all the flexibility that you need but the structure enough so that we can actually optimize the code for you.
Philip Winston 00:36:37 Okay. Looking at the LightningModule sections — I guess, like, the central for our train, validate, test, and predict — can you kind of just talk about are those temporal steps that a researcher would do one after another, or does the training do those four steps, or how does that work?
William Falcon 00:36:56 So the trainer is kind of running this workflow for you. It’s a training workflow where it takes a batch of training data and it feeds it to the model, and then it takes the outputs out of that, computes some loss for it, and then it like back props that loss through the model. So, that will happen for every batch of data. So, trains, updates, trains, updates, and so on. Every one of those we’re calling a step. So, in the training process, we call that a training step. So, Lightning is going to give you the data, and then you just figure out what you need to do with that data, how you’re going to process it, and then you send it back to the training process, which is a trainer. And the trainer will, like, do the back propagation for you and automate update all the other things.
William Falcon 00:37:37 With fabric, you have control over that; you don’t need to use a trainer. You kind of need to know what you’re doing, but you can call that stuff yourself now. Now, the same thing applies to validation and testing. It really depends on the workflow that you’re doing. For example, my research is mostly in self supervised learning, and in that workflow I tend to train an epoch over, like, ImageNet, for example, or like a big data set. And at the end of the epoch I validate. Sometimes, if the epoch is long enough — so, just to clarify, an epoch is when you go through all the data once; so, if I have a million examples, I go through all the million examples once. In certain, like, foundation models, you may only ever do like two epochs. And those epochs may take many, many months to the point where people don’t even call them epochs now; they just call them steps — like, I’m just going to train for a million steps. How do you know that things are working along that route? So maybe at step 1000 you want to validate, right? Or maybe at step 2000 you validate. So that frequency is really up to your research requirements, and that’s what validation step is, right? It’s calling the validation and the validation split of the data at a known interval that the training process is set up for you, and you configure that based on your needs. So, you could either train all at once and then validate at the end, or periodically validate as your training.
Philip Winston 00:38:53 Can you clarify a little bit on validate versus test? Is validate running, inferencing, and seeing how accurate it is? Or, what does validate mean?
William Falcon 00:39:02 Yeah, so if you have a data set, you basically want to do — if you’re like in academia you want to do three things, you want to take the test. So, you have a dataset. Let’s say that you have, I don’t know, 60 data points, right? So, you want to take 20 of those data points and put them aside and never look at them. That’s called test set. That’s like the very last thing you ever do. So, the test set is just like right before you publish that paper, you run that once, and that’s the thing you actually report. In industry, we tend not to do that, right? Test set is kind of like the production system. It’s like that’s what happening in real time. So, then you have the other 40 data points now; that 40 you need to break up into two sets, one for training and one for validating.
William Falcon 00:39:41 So, just depends on how much data you have, and so, the rule of thumb is something like 80/20 — basically, 80% training, 20% validation. And then throughout the training process you can use that validation pretty frequently to see how you’re doing. If you know statistics, it’s like out of sample testing is what you’re doing, right? You’re just saying, hey, on this unseen quantity of data, how does my model perform? You need to do that because if you don’t do that, you’re going to overfit, right? And overfitting means that your model’s going to, like, memorize the data and not going to be able to generalize. So, the validation is testing how well a model can generalize.
Philip Winston 00:40:15 Yeah, generalization, that seems like super important. So yeah, in these different cases you’re trying to improve your performance on data that you didn’t train on. So, you mentioned some integrations. We have a previous episode on feature stores. Can you explain what feature stores are just briefly and then kind of how Lightning works with them or interacts with them?
William Falcon 00:40:40 Yeah, so let’s say you build a Lightning workflow now that maybe trains a model and then deploys a model, right? So, you would use the Lightning flow to do that and the Lightning works. So, you’d have one Lightning work for the train, one Lightning work for the deploy, and then the Lightning flow is the orchestrator that coordinates both, right? So, the feature store would basically be something that you plug into the training work where what you do is you process the dataset ahead of time and then — let’s say it’s like images, or maybe tabular data is better: You process the dataset ahead of time and either transform — like, you normalize time so that it’s like in a clock or whatever people are doing nowadays; maybe you normalize by age, whatever the categorical values are. And that process can be pretty expensive so people do it ahead of time.
William Falcon 00:41:22 So it’s kind of like caching the features ahead of time, right? So, you do that and then you know, it really depends on the features you use; I’m sure some of them store them in memory, some of them store them on the disc itself. So those optimizations will happen, but what it’ll actually do is it helps you basically remove all that pre-processing from your data. So, it can actually dramatically speed up training, which is really cool. So, I recommend it if you can do it, do it. For images it’s a little hard to do I think because you usually want random transforms, but maybe there’s like a way that people are doing it nowadays.
Philip Winston 00:41:51 So in that sense you might be calculating these features once, but then experimentally doing different training, but you don’t want to necessarily redo the features cause that you just trust that that was done adequately?
William Falcon 00:42:04 Exactly. So, like certain features are deterministic — like, if I normalize everything by N, because N is the average age on the data set, and the data set’s not really changing, I don’t have to calculate that over and over again. So, I should just cache that and be done with it, and then I can iterate on that for the next few months. In certain use cases, though, like computer vision, the pre-processing steps are to randomize the image, and you can basically cache that. But some people use it as a way to augment a dataset. So, if I have 10 images, for example, and I apply 10 random augments to each image, suddenly have a dataset of a hundred images, which is cool, right? So, in that case you could keep generating infinite versions of those images, in which case you may not be able to cache them, right?
Philip Winston 00:42:48 Okay. And how about data versioning? What are the challenges, operational or otherwise, related to data versioning? And again, does Lightning have any impact on that or connection to that?
William Falcon 00:42:59 Yeah, so again, in Lightning what we solve is a connection problem, right? So, you can actually plug in other tools that solve this. So, we don’t have like a data store, we don’t have like a data versioning, none of that. But we allow you to plug in with the other third-party tools that do that really well, right? So, for data versioning, I think people use like DVC or something to do this nowadays and tools like that, it’s really hard and it just really depends on the industry that you want to do. The way that this works with Lightning is at the end of the day, all we care about is where’s the data sitting and then what model do you want to train that data with, right? How you got there and what data version and all of that. It’s like up to you on the other tools, right? So, in the data version you will just have different S3 buckets, you would have different indices on the same S3 bucket. So, it just depends on how that third-party tool operates, right? From the Lightning perspective, you’re just kind of giving it at the end result where you’re saying, hey, use this version of the data and then off you go and do your thing.
Philip Winston 00:43:52 Okay. And we talked a little bit about TensorBoard, which maybe is an observability feature. How about logging or monitoring or debugging? Are any of those, whether it’s related to Lightning or not, are can you sort of explain any machine learning, deep learning specific attributes of those? Or is it just similar to any piece of software?
William Falcon 00:44:13 No, so actually debugging is something that we spend a lot of time on and logging. So tracking everything that you’re doing. And these are all implicit things on the platform. These are all implicit features on the platform. So if you go to lighting.ai and you go train a model there, what it’ll do is it’ll actually track the logs for you, like the out logs from the machines. You also have an interactive coding environment with the machines that you’re actually training. So, you can actually train and debug in real time. So, you can like pause the GPU and see what’s happening in that process and understand, I think debugging is probably one of the hardest things to do in deep learning, especially when you’re doing distributed training or distributed inference. So, we’ve done a ton of work to make sure that like the same environment that you build your code is the same that goes into production.
William Falcon 00:44:53 So, you can take, you can take the stuff that you’ve built in kind of like your, at your workspace where you’re developing on the Cloud and then ship that as your production environments out of the box, right? So that, you know, there’s no real gap between how you developed and how you actually ship things. This is really important because, you know, we’ve done this in the past as well, we can simulate stuff on your laptop all day long, but I can’t simulate so many terabytes of that on your laptop, right? So it’s just really hard to actually like, you know, I like to the phrase it’s like train how you fight, right? So it’s like you have to really, really be in that same environment. So that’s a lot of what we try to do on the Cloud. And I would say it’s probably one of the big benefits of the platform.
Philip Winston 00:45:28 Can you mention a specific bug when you were sort of confounded or confused and then were able to figure it out? You or your users?
William Falcon 00:45:36 Yeah, so I mean luckily, the tools are pretty robust nowadays, so most of the bugs that were debugging are like distributed training bugs. There was definitely a period where like, oh man, is this what’s going on here that we saved the right checkpoint? Like the right metrics go in there. I think the most subtle things to debug are like, I personally I hate these kind of bugs, but when you’re doing distributor training and you’re calculating something on two separate processes on two different GPUs and then you average them or something and then you’re like, gradients go to zero and you’re like, ah, what happened, right? So, when the gradients go to zero, you’re done like the, or sorry, not zero, they go to NaN. So basically like the models just can’t learn anymore. That’s the worst. So, debugging NaNs — and usually that process means you have to, like, pause to machines while they’re doing the distributed training together and then go inspect each GPU independently at the same time, see what the values are in that process, see like what is X right there, print it out, and then do the computation by hand.
William Falcon 00:46:36 So you do like the all reduce or the average yourself, and then probably what you realize is that your precision is wrong, right? Maybe your precision was set to 32 bits and then it turns out that the output of that is actually 64 bits, so it blows up the precision. So it usually comes down to some sort of setting like that. Also, I think if you do kind of bypass that and you’re like, okay, I solved why that’s NaN, if you calculate a gradient then and basically your weights are like either too high or too low and then you actually apply the update rule to it, when you back prop, then you’re going to end up exploding the weights, which means that the network also doesn’t learn anything. So hopefully you don’t run into these bugs because if you’re seeing models already kind of pre-built for you, these things are already solved. But I would say those are the hardest bugs to debug, mostly because of how you have to actually interact with the machines at the same time.
Philip Winston 00:47:23 What does it mean to decouple data from the model, from the hardware? I can sort of imagine the hardware in terms of you want to be able to train with different vendors or different types of hardware, but how about those three layers decoupling the data from the model, from the hardware?
William Falcon 00:47:40 So remember that most of what we’ve been doing, not just us but the AI community as a whole, has been to figure out how do you structure code for deep learning? Like, what are the right abstractions, right? We didn’t know any of this. Software engineering’s known this for 20, 30, 40 years at this point. In AI, we don’t know, like should the data be coupled to the model? Like there was a point in time 2014 through 2017-18, maybe even 19, where when you saw code written, it was models that were like specifically written for a dataset, right? And it was like, well, is that really the right way to go? Like should this model really be written for MNIST and then you go try a different dataset — and I have to rewrite the model for ImageNet or whatever it is. So we’ve learned as a society now that we need to decouple these things because it means I can take a model and then swap the dataset to it so that I can iterate quickly. I can try things locally and then on the Cloud, right? So I think it’s a necessary evil, and it also makes model more agnostic as models have evolved into like LLMs and foundation models. Turns out that you can use the same architecture for different data sets, right? Which is really cool. So you no longer have to write like a specialized model for specialized dataset. So all in all, I think a net positive, what’s happened there, the data separation side.
Philip Winston 00:48:54 Something I saw recently was they were able to use, I think it was stable diffusion, which is normally about generating images to generate spectrograms, which could be turned into audio files. I don’t know if you saw that. That sounds pretty generalized.
William Falcon 00:49:09 Yeah, exactly. It’s interesting. I mean, like, you know, I saw it a few days ago and I’m not actually, I haven’t looked into it, but like I’m not actually sure of the viability of that. Like, so we did a lot of work on text-to-speech, right? And a lot of how you do that work is you have to generate spectrograms and then you use a CNN convolutional network to actually output stuff. So spectrograms are images at the end of the day — they can be viewed as images, which is why this kind of works. Then my question is, like, will it actually encode correct waveforms in the way that they’re supposed to? Like, maybe it’ll output a spectrogram that, like yes, if you play it, it will sound like music, but like, is it actually cohesively put together? Is there actually thought behind it? I’m not sure. But I think it’s like a really interesting avenue, and I think it opens the door to get there. I think to get that to work, you’ll probably have to take that diffusion model and probably attach it to another model that can actually more like guide how the decoding is happening under the hood so that it’s not just kind of randomly put into the spectrogram.
Philip Winston 00:50:08 Lightning is built on PyTorch and we’ve been focusing on PyTorch, but you did mention that you can train TensorFlow models with Lightning. Are there any projects or types of projects where you think TensorFlow is in fact the right approach these days? Or how would you guide someone towards PyTorch or tensorFlow?
William Falcon 00:50:24 Yeah, I mean we’re fully a PyTorch shop in in general, right? So, I don’t personally use TensorFlow and haven’t in a long time. We mostly enable this because we have clients who have legacy code who do it. Like, I think frameworks have converged largely at this point, so PyTorch versus TensorFlow, I think that’s like a 2019-20 debate. Today I don’t think it’s really a debate. It’s more like, I don’t know, do you like red cars or blue cars? Like, it’s kind of preference at this point. And I think both have a lot of similar features nowadays. Both have a lot of the same kind of design mechanisms under the hood. So it just comes to the experience and like, which feels better to you? Right? To me, PyTorch feels better, but that’s personal preference. So I don’t know. I can’t say because I’m not an expert TensorFlow user, but we see our clients using it still, and I don’t know the use cases. I think you could probably do the same use cases with both frameworks, right? I’m not sure.
Philip Winston 00:51:13 Yeah, I’ve definitely seen that with other open-source projects where they have quote-unquote competitors, but they end up being both so featureful that they can clearly both do the job and it does come down to maybe either preference or just other constraints you have, or staffing, who knows, who knows how to use it.
William Falcon 00:51:31 Yeah, I mean I tend to look at things beyond that. Like I tend to look at how open source is it and what are the motivations of the people behind it, right? So, one thing that I’ve noticed is that a lot of these open source libraries are built by big tech companies where today their incentives are aligned with the open source, largely, but that might change when a recession comes. So, what happens to those projects, right? I’m not talking about Pieter TensorFlow specifically. I’m saying there have been times where like projects come off from companies, big tech companies that are open source and then those people get promoted and then they leave and then they kind of leave you hanging with the project and you’re like, well is anyone maintaining this anymore? And that’s just kind of how it works at Facebook. Like you’re motivated internally to build things, ship them and then, but you’re not like incentivized to maintain them, right? So that’s kind of my main concern generally is if you’re going to use open source, like are the companies behind them open source themselves? Like, Lightning is fully open source, like our job is to provide open-source software, right? It’s not like we have a different agenda. So those are kind of considerations that I would look into as well when using any open-source code.
Philip Winston 00:52:33 You touched on this, but can you contrast the needs of academic researchers compared to commercial users? Like I said, I think you kind of alluded to some of these differences, but just to be clear, what differences have you seen?
William Falcon 00:52:47 I think academics are more, I would say they overlap a lot more than people think, but academics are more interested in the understanding and experimentation phase of things. Industry is more interested in, like, kind of running things once or twice maybe, and then doing something with the output of that. They care less about what happens under hood, and they care more about the outputs. Whereas in academia they really care about like every part of that and they want to understand every single detail as they should because that’s what their PhDs are usually about. And sometimes they’ll create novel ways of doing it, right? So maybe, I don’t know, I guess it’s really more expected that the researchers would be caring more about trying to push something novel at this point, but there’s not really a lot of novel things that you can push on the tooling side anymore. It’s really on the algorithm and the data side, right?
Philip Winston 00:53:33 Okay. We’re going to start wrapping up. Are there any resources or documentation that you would like to recommend for someone to get started with PyTorch and Lightning?
William Falcon 00:53:43 Yeah, so we actually just launched a class “deep learning fundamentals” and this takes people from completely knowing nothing about deep learning to learning about it fully. So, it’s Lightning’s class for teaching about deep learning, right? So that’s on our website, it’s lighting.ai, uh, and then search for deep learning fundamentals there, and that’s taught by our lead educator, Sebastian Raschka. So, he’s an associate professor at University of Wisconsin, Madison, as well. And he I think is one of the best educators in the world. So, we partner together to basically bring masterclass level to this, and we envision Lightning as being a platform where you’ll be able to learn about any machine learning topic in, like, a very beautiful and intuitive way as well.
Philip Winston 00:54:23 How about the user community? A thriving user community is an important feature in a platform these days. Aside from this course you just mentioned, how are you nurturing a healthy user community, and what kind of what type of community do you feel you have around Lightning?
William Falcon 00:54:38 Lightning is all about community. That’s kind of our bread and butter, and it’s everything that we’ve built and done. Everything that we do is for the community. I mean, every decision that we make on the tools, every decision that we take with classes, it’s all how do we empower the community more? Our community lives a lot on Slack, andwe have forums as well on the website, so definitely if you’re not on the Slack, join it. But the way I think about it is, you know, sometimes it can be lonely in machine learning, and we want people to work together and just, like, make friends and, lik have you know, same interests, work on really cool things. Everyone in the online community is like an expert at this stuff and even newer people who are coming in and learning about this.
William Falcon 00:55:16 So, I would say, like, if you’re kind of starting out in deep learning and you’re just kind of getting into it, I would join the community. There’s a lot of stuff that people will help you, they’ll teach you things, and also if you’ve never contributed to open source, it’s a really cool way to do it, contribute to any repo. We specifically have issues that are tagged for you to learn, but what’s cool is that we’ll help you land those pull requests, and we’ll teach you software engineering along the process. So don’t be shy. I mean my first pull request ever was like documentation and, like, that was amazing so go for it. .
Philip Winston 00:55:45 Yeah, that’s an interesting point that you could have people who are really interested in deep learning and AI, but they don’t necessarily have software engineering skills to start, and it sort of becomes something they can do to extend their reach or to give them more ability to impact on these project is by learning essentially to program.
William Falcon 00:56:06 Exactly. And we even have a class for that also. It’s called ‘software engineering for researchers.’ So, we teach you all those like how to use GITs, and what are notebooks, and how to think about scripts, and you know, it’s like when you show up to a PhD and you’re like never really done software engineering, there’s like a whole year where you’re like, how did you guys learn all these things? That’s the stuff we teach in that class.
Philip Winston 00:56:27 What upcoming Lightning features or releases are you most looking forward to? I guess this is in any time in 2023 at this point?
William Falcon 00:56:36 Yeah, so the main thing that we released now is Fabric, right? So, Fabric allows you to build your own training loops, allows you to build your own trainer, gives you full control when you want it, and when you don’t want it you can use a trainer as well. So that’s really cool. That I think, in my opinion, this is like the last deep learning framework you’ll ever need because I’m not sure how much more flexibility we could give you at this point . So that’s really cool. And then the other main thing for us is we are empowering people to build their kind of like ML startups and AI products with Lightning now, right? So, if you want to launch the next stable diffusion company or the next open AI, like, we have a platform for you to do that now very, very easily. You no longer have to go raise $10 million and hire 50 people. It’s one person can do it in a few weeks, which is really cool.
Philip Winston 00:57:22 You’ve talked about creating an operating system for artificial intelligence. I think this is going to be the last question. Where do you think Lightning is today relative to being an operating system, and what additional features do you think are necessary to realize that vision?
William Falcon 00:57:37 So, an operating system has a few parts to it, right? It has a scheduler, it has memory, it has the execution of programs, it has the hardware that it’s running on, the machine and so on. I’m not an OS expert, but I think there’s like one or two missing in there. I think we have a lot of the pieces today. There’s, you know, we’re still thinking about like what is the equivalent of memory for our platform? So the hardware, the laptop is the Cloud, right? So that’s what we have today, Lighting AI, so, you can kind of sign up and create your own clusters and all of that. So, you have the hardware there. You have the orchestration mechanism like the what runs apps, how they do it, and so on. So that’s the Lightning framework. So what used to be called PyTorch Lightning, that’s there under the hood already.
William Falcon 00:58:19 And then the ability to connect data, then you have the ability for memory as well. So, the ability to share between apps, and I guess what’s interesting to me is that like, is an operating system for AI really going to be the same as an operating system for a laptop? It may not be because an operating system for AI may have to account for the fact that you have multiple machines, huge data sets. These are things that you didn’t have to account for in a laptop, for example. So, what are other considerations that are going to happen? I don’t know. And that’s really exciting to me because it means that as the world evolves and we learn what we actually need to do deep learning, I will be able to bake that in.
Philip Winston 00:58:54 You mentioned the Lightning.ai website and there was a course there, I think maybe two courses you mentioned. Where else would you like to point people to find out more about Lightning or yourself?
William Falcon 00:59:06 Yeah, it’s all on our website, right? So, if you go to lighting.ai, there are examples there that you can play with. There’s a ton of documentation. Create an account, you’re going to get free credits every month. So you can run, you can code on the Cloud, you can do your research there, you can train on GPUs, you can deploy things, and all of it is centralized. So we have the forums, we have the community there. If you have any questions, please, please ask. Don’t be shy. We’re super excited to help. We’re super excited to teach you how to code. We’re super excited to teach you how to do open source, do deep learning, data science, all of that. And I think you’re stepping in at the right time because I think this is the next big thing. Like in general, I think the first was web, second was mobile, third is AI, right? So, the next billion dollar companies are going to be AI companies, and I hope that all of you guys will build them.
Philip Winston 00:59:49 How about social media? Is there a place people can follow you?
William Falcon 00:59:52 Yeah, so we have a Twitter account, you can look it up. It’s @LightningAI and then we also have, LinkedIn accounts there as well. We do have a newsletter, so if you want to keep up to date with what’s happening in AI, go to LinkedIn and sign up for a newsletter there. We will let you know what are the latest interesting things in research, what are the latest things in engineering for deep learning, data science, what are startups working on? So, keep a pulse in the community that way.
Philip Winston 01:00:16 Okay. I’ll put links to those and to the four episodes I mentioned in the show notes. I really enjoyed talking with you today, William.
William Falcon 01:00:24 Yeah, thank you Philip. This was really fun. I hope everyone gets a lot of value out of this.
Philip Winston 01:00:28 Great. This is Philip Winston for Software Engineering Radio. Thanks for listening. [End of Audio]
Related links
- Episode 391: Jeremy Howard on Deep Learning and fast.ai
- Episode 534: Andy Dang on AI / ML Observability
- Episode 522: Noah Gift on MLOps
- Episode 473: Mike Del Balso on Feature Stores
- Lightning AI
- AI Education
- PyTorch Lightning
- Guide to Distributed Training
- LightningModule
SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)
Podcast: Play in new window | Download
Subscribe: Apple Podcasts | RSS
Tags: ai, deep learning, IEEE Computer Society, podcast, SE Radio
Excellent episode, well done!