SE Radio 479: Luis Ceze on the Apache TVM Machine Learning Compiler

Luis Ceze, CEO and co-founder of OctoML discusses Apache TVM, an open source machine learning model compiler for a variety of target architectures. Luis talks about the complexity in writing assembly code on different hardware targets for machine learning which contains predominantly numerical operations that benefit from specialized vector/tensor instructions and special memory layouts.

Host Akshay Manchale spoke with Luis about the motivations for a model compiler, the internals of how TVM compiles machine learning modes, the intermediate representation that TVM uses to optimize and the types of optimizations that can be performed on computations involved in machine learning models. Luis talks about ways TVM can use machine learning to search the space of possible optimizations for a particular target architecture and ways to tune for efficiency and performance.

This episode sponsored by NetApp.

Show Notes

Transcript

Transcript brought to you by IEEE Software
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].

SE Radio 00:00:00 This is software engineering radio, the podcast for professional developers on the [email protected] se radio is brought to you by the computer society. I believe software magazine online at computer.org/software as radio listeners. We want to hear from you, please visit se-radio.net/survey. To share a little information about your professional interests and listening habits. It takes less than two minutes to help us continue to make se radio even better. Your responses to the survey are completely confidential. That’s S e-radio.net/survey. Thanks for your support of the show. We look forward to hearing from you soon.

Ashkay Manchale 00:00:46 Welcome to software engineering radio. I’m your host Akshay Michale. My guest today is Louis and we talk about a bunch of TBM. Louis has a, is the CEO and co-founder of Aqua man and machine learning acceleration platform designed to help developers, deploy machine learning models on the idea of hardware cloud and edge devices. Optimal was spun out of the university of Washington where Louis is also a professor there Lewis and a number of his co-founders created a budget DVM, a deep learning compiler, which Octa ML is built on Louis. Welcome to the show. Really great to be here. So before we dig into TVM, I’m going to cover some fundamentals. So just to set the context, can you start off by telling us what a budget EVMS,

Luis Ceze 00:01:28 It’s a machine learning, deep learning model optimization and compensation package that takes models within all of the major frameworks that TensorFlow PI torch and MXNet carrots. And so on it ingests a Institute’s internal representation and does a bunch of optimizations like operative fusion data layouts. It can do quantization and so on, and then it produces a highly tuned binary for the forest specific hardware targets. Essentially subsumes a lot of the manual engineering required today to get your model to be performance and to be ready to be deployed in your platform of choice. And it supports a variety of harder targets, including mobile CPU service abuse. What would you use server GPU’s accelerators FPDS and so on.

Ashkay Manchale 00:02:11 Let’s talk a little bit about the models itself. So in the entire life cycle of say, machine learning from when a data scientist looks at a particular problem, can you describe how they go about building a model and what does the model really entail? What does it contain or represent?

Luis Ceze 00:02:26 Absolutely. And it’s also great to make it clear where Apache to fits in this end to end from data to deployed model, right? So, so once upon a time we get data, data scientists, we got a bunch of data. They curate data sets to train the models. There’s a lot of data management and make sure you have the right data to build your model, to test your model and so on. And then they find a general model of architectures that are for the problem that they are interested in solving frequently could be a computer vision, focus. Architecture could be natural language processing could be time, serious predictions. There’s a bunch of different categories, architectures that data scientists can consider. And then they use, you know, there’s various tools to help identify what is the best initial model. And you would represent that model in a framework like PI torch or TensorFlow, which are declarative models that you can declare what a model is supposed to do.

Luis Ceze 00:03:17 And then you run a training process. The training process takes this model architecture that you specified in the framework. And then EHS is the training data. And we need trains the parameters of your motto to make sure that they perform well on the training data and also passing it against your test data set to make sure that the accuracy translates to two data that was not including the training sets. Okay. So the result of all of that is essentially code that specifies what your model does and a ton of data there, the parameters that represents your model, and as you might have seen in the technical news, and then the number of parameters and the complexity, the mother’s keeps growing up, you know, a lot, you’re talking about models with tens of billions of parameters. It’s not uncommon, right? Especially the language models have. They’re very, very large, right?

Luis Ceze 00:04:00 So this model has a collection of data and code and a bunch of data needs to now run on your deploying meant hardware, right? So you might want to run it on a mobile phone, run it on a smart camera on a self-driving car, or you might have run it in the clouds as part of a cloud service, right? So, or a something service application, right? So that process tends to be pretty labor-intensive because typically models are effectively what we call interpreted, right? So there is a for each part of your model, there’s a bit of codes that hardware vendors have written in the form of libraries. For example, Nvidia has accordion and library that has bits of code for each parts of the model say may took them to publication convolution dot products. And so that’s a, the execution engine stitches them together as you’re evaluating the model and what Apache two VM does instead of actually interpreting the model, it produces fresh codes, highly optimized for the and hybrid targets, right? So you have a model which is a bunch of code and data, and translates that into a binary that runs in your target Honduras. So

Ashkay Manchale 00:05:07 Let’s talk about different kinds of models. Oh, very simple machine learning model that people will get introduced to is a linear or a logistic regression model. And then the more sophisticated ones that you are talking about, say deep learning that is involved in computer vision. And what do of emerging applications is a more complicated model. So can you tell us what is the simple one what’s computationally involved and what’s computational involved in the more sophisticated ones?

Luis Ceze 00:05:31 It’s a great question. And I’d say that yes, it’s linear regressions is decision tree switches, like I would say classical machine learning, right? So, and then there is deep learning and you know, the more sophisticated models for logistic regression support, vector machines and deep learning in the end, it all boils down to a bunch of linear algebra, you multiplying matrices and multiplying factors. That’s the difference between a simple model and a compact in a complex model is really just amount of linear algebra computation that you do. I want to be clear that even though my core background is not in core machine learning, it’s really about machine learning systems, right? So I’ll tell you that it’s a thing that people say often is that if there’s a simple machine learning model that works well for what you do, that’s what you should use, right? You should pay more complexity and pay more computational costs and more engineering.

Luis Ceze 00:06:18 If it actually brings you value in the form of better predictive power and better, more accuracy for the problems that you care about. Right? The big difference really fundamentally is just how much competition you’re doing underneath, right? With deep learning, you have multiple layers and you have a lot of what we call operators that are stitched together that do convolutions matrix simplification, you know, activation functions and in a very complex data flow, let’s put it this way. You were asked, you know, a simple logistic aggression. It’s just a very, very simple linear algebra computation, right. But then there’s another big one here that I want to mention, which decision trees and, uh, one example of a package that’s pretty common. There’s XG boost who happened to have been cited by, by our CTO as well. TNT 10 X you boost is, is essentially the boosted trees, models, essentially collection of, of decision trees that can be evaluated, can be trained to find what are the decision points from training data defines the boundaries of the decision points in your, in your boosted three models index.

Luis Ceze 00:07:14 Your boost is a package that evaluates these models, right? So that’s also part of what people call classical machine learning. It’s very important, very widely used. And interestingly from a competition point of view of going deep down, getting closer to how your processor sees it and how you execute that at a first approximation would be a bunch of then conditions, right? So then you can evaluate it, see it as a bunch of expense, right? So, and that’s one way of doing it. Another way of doing it is also doing it as linear algebra. Now it’s sparse linear algebra. You can actually convert this decision three evaluations as sparsely in the algebra evaluation expresses as, as a linear algebra expression. But once you do that, it can actually also run that through in a package like TVM. That just treats that as if you were regular than algebra code and then still takes attention about the optimizations. I know it was a long answer to your question, but I hope it was useful, you know, so,

Ashkay Manchale 00:08:02 Oh, I see. Yeah. Yeah. It’s a, it’s a good overview, interesting to know about like XC boost and how it works with decision trees. So one aspect of TBM, I think, is the portability of your models where you can run it on different target architectures. But I think there’s another aspect which also involves the performance and the efficiency aspects of it. If I have a simple, deep learning model that solves some tangible problem, where does that efficiency really come from from specialized hardware or why specialized hardware is actually useful?

Luis Ceze 00:08:31 Let me answer your question in two parts first let’s let’s ignore specialized hardware first. Let’s just talk about specialized codes. Where does the optimizations come from and how do they bring performance to the deep learning model on existing, typical CCPs and GPU’s right. So where did he am shines is really just doing optimization optimizations, like, okay, once you have an operator or you fuse two operators in optimizing your motto, they’re all in the end. As I said, a bunch of linear algebra code that gets represented as loops over data and perform your patients like. So when you look at that, there’s many things where you can change to get performance. For example, you’d lay out your multidimensional data structures in memory, in a way that you make better use of your Cassius, make better use of memory by just organizing any memory in a more convenient way for the hardware.

Luis Ceze 00:09:20 The second thing is picking the right instructions, you know, pick the right vector instructions, the right scalar instructions. You want to make sure that you keep the data in registers as much as possible. There’s a bunch of scheduling decisions that you can make other things are in what order can evaluate the loops, right? So for a given deeply nested loop, you can change the order of your loops have the same computation, but a given loop ordinate is much better than another one. So if you do the cross-product of all decisions on data layouts loop, ordering how you tile the traversal of your data structures, and then the kind of research is used to generate large number of candidates for a piece of code, right. And picking the right one is one of the things that, you know, to the end as well. Okay. So great.

Luis Ceze 00:10:00 So, and then what you’re getting at the end is really code that’s highly specialized to that model because that’s what CSUs specialized to the data layouts to the parameters of your code. You know, you specialize the code. Great. So that’s one thing now, how do we go further? We go further by making use of specialized hardware, right? So, and the trends today is pretty clear. As you can see, you keep hearing about more and more AI chips or features in popular hardware engines. And, you know, CPU’s GPU’s and new accelerators that are dedicated to deep learning machine learning in general, but they actually have hard-wired circuits that perform an operation, a highly specialized way, for example, a two deconvolution or an eight by eight matrix multiplication, or a dot product, or an activation function like a Sigma and function. So you really hardwired that into your circuits.

Luis Ceze 00:10:50 And when you do that, now I can put my computer architecture hat, actually my core backgrounds in computer architecture, right? So the efficiency there comes from the following that in the general purpose process or any general purpose circuit, you’re going to have a lot of circuitry. That’s making decisions of what you should be doing at any given time, but should it be doing this or that? Right. So, and that, that depends on the code. Well, if you build a hardwired circuit, you don’t have to make those decisions, you know exactly what you’re going to do. You’re doing only one thing. So you don’t spend any circuits, you don’t spend any energy or any time figuring out what you should be doing at that given time. Right? So that’s where fundamentally the specialization comes from is that you remove this unnecessary, you know, condition tests in the hardware, and then you can optimize a circuit.

Luis Ceze 00:11:30 You’re going to have shorter wires. You’re going to have less memory. And, you know, if you have shorter wires, it’s faster, right? So you can more versus a dedicated to actually use for computation. That’s when the harder point of view, but then that specialized unit in your hardware, it needs to be invoked by your software somehow. So whenever you have a specialized unit, there will be a specialized instruction that typically it looks very strange in there, like this really complicated instruction that you have to invoke somehow in making use of that, you can either be, you know, a super low level programmer. That’s going to go and use that assembly instruction right there and know exactly how to use it. Those are rare. And we don’t want to have to rely on those too much. Right. And for too long, unfortunately, a lot of low level libraries rely on low level assembly code tuning, which is something that that’s where you don’t want to be. But the way this works into VM and other machine learning compilers is that there are ways of actually matching what you’re trying to do in your model with a primitive available in the hardware. This problem sounds easy, but it’s actually fairly complex because the more complex the operation in hardware, the more difficult is a pattern matching to make sure that you’re using the right instruction. Right. So does that give you an overview? Does that sound good? Yeah. Yeah. Yeah.

Ashkay Manchale 00:12:38 I guess something to understand here is the complexity also involved in actually writing that sort of specialized instruction for assembly, right? So how complicated is that for something that’s very simple as a linear regression model, which is something that might be more sophisticated with multiple, you know, composed of many, many options,

Luis Ceze 00:12:56 Let’s throw a dollar in one, right? So let’s say if you’re just doing a simple eight by eight matrix multiplication in hardware that you want to invoke, let’s keep it super simple. Okay. So what do you need to do well that instruction that you’ve with a single primitive, the hybrid takes a bunch of inputs. You have to know how the harder expects to see the data in memory and where to expect it to be when you say go, okay. So when you actually invoked instruction, and then the output is putting in another buffer, they’d have to go get it from there and put it where it need to go for you to give the inputs to the next layer. So all of these things you need to understand what is the data format? Where should they be? You know, where’s the output. And then also, how long does that take to execute?

Luis Ceze 00:13:35 Because remember that it’s all about parallelism here. And then whenever you invoke in this instruction, there’s some other stuff going on. So you want to know that you scheduling this. Right. Right. So, you know, when I kicked this off, you’ll have another, and I know it’s going to take so long for the results to be ready. I can a bunch of the stuff in between, right? So this, I give you an idea of how complex this can get this matching. They puts an outputs, but also they’re sending things of what are the performance implications of using that instruction in what else could be done in parallel, right? So

Ashkay Manchale 00:14:01 You would do this with a single target architecture in mind. And I suppose you’d have to do this all over again. If you were to change your implementation, change your target architecture, you’re redoing all of this. Is that

Luis Ceze 00:14:13 Bingo. That’s exactly right. So that’s one of the reasons why there’s major, let’s say vendor lock-in and major reluctance in changing, you know, what architecture going to deploy your, your model too, because making the most out of a harder target for something that’s performance sensitive as this kind of code involved in machine learning is very, very labor-intensive. So you definitely want to stick with an architecture for as long as possible, unless you have an automated way of tuning your codes to different different architectures. So you can move more easily. Right? And performance portability is one of the things that we really believe in. And the fact is, if I may, we care about three piece here in machine learning, we care about performance portability and productivity, right? And then make the most out of the heart of the data deployed to be able to support it, right? So portability to change it’s different hardware. You want to be productive and you want to rely as little as possible on this hard engineering labor, or having to tune the low level aspects of your code.

SE Radio 00:15:08 I sponsor for this episode is spot by NetApp spot provides a comprehensive suite of cloud ops tools that make it easy to deliver continuously optimized and reliable infrastructure at the lowest possible cost. Imagine automating your infrastructure to proactively meet the needs of your applications. Imagine leveraging the latest in machine learning and automation to scale your infrastructure, using the most efficient mix of instances and pricing models, eliminating the risks of over provisioning and expensive lock-in from cause management to infrastructure, automation, and CD to running serverless spark on Kubernetes spot ensures you maximize your cloud investment. The end result is simply more cloud at less cost. Discover how the most innovative companies from cloud native growth machines to forward-thinking enterprises are automating, simplifying and optimizing their cloud infrastructures with spot by NetApp, check them out at spot.io/se radio, where you can find more information, request a demo, or give a try by starting a free trial.

Ashkay Manchale 00:16:06 Let’s dig into TBM itself, TVM in some sense, as a compiler that takes a know model and then gives you target architecture code. Traditionally compilers have this whole concept of having a front end and a back end where the front end parses and then the back end does the optimization and cogeneration. So can you talk about what TBM does in that context? Like as a classical compiler,

Luis Ceze 00:16:28 Let’s say that you have a model in TensorFlow. You can also have a model in a generic pharma called Onyx that has been around for some time and now it’s getting better and better. So you represent your model, you know, high-level framework when something like Onyx and then to the, I mean, Jess said to its first level intermediate presentation called relay relay is a type data flow graph. Okay. So it’s an ionic data presentation that represents your model as a graph where each nodes is an, operator’s say a matrix multiplication or convolution or an activation function and so on. And then the edges and the data transfers are typed. So, you know, what is the shape of your tensor? What are the dimensions and the data type, right? The dimensions of your tensor, the ranks, and also the data type. What is

Ashkay Manchale 00:17:12 The tensor? Can you

Luis Ceze 00:17:13 Describe that? Oh yeah, absolutely. Thank you. Sometimes it takes us for granted. Thank you for bringing it up. Yeah. And move to the patient data structure. Think of it as a generalization of a matrix. Right. You can have three dimensions, right? So the matrix is two dimensions and then a ten second could have an arbitrary number of dimensions for you. So machine learning, because then you can represent many things in each I mentioned. Right. So visits even as a general abstraction of a linear algebra data items. Okay. Okay. Yeah. Great. So then relay is, is data flow graph that spikes and at that level, two VM can do a bunch of optimizations, right? You can decide to fuse two. So let’s say that you have a convolution followed by matrix multiplication, and then you can fuse that into a single more complex operator that can do both of those together.

Luis Ceze 00:17:58 Okay. So those decisions can be done at that level. You can also make decisions like, okay. So for this group of operators, it’s better. If I run in the CPU for this group of operators is better. If I run energy P or for the other group of operators, it’s better. If you run into that specialized accelerator, it’s a great place for you to do what we call device placement, where these models can get pretty complex and different execution engines are better for different things. So you can partition where those all right. So TVM represents the model at that level, does is high level optimizations can do things like quantization as well. By that I really mean changing the data type. Say if you have a 32 bit floating point number, you might be able to just say, no, this might be okay. If I did an eight go into J with eight bits, you lose dynamic range, but you can still represent what you need with fairly good accuracy.

Luis Ceze 00:18:45 And then you just reduce the conditional cost dramatically when you do that. Right? So that’s plenty to go from phone to points, to fix points, but you also reduce the number of bits per data item. And then what you do is you actually, you keep lowering this, this was a data flow graph level. As you lower it down, you’re going to lower it to another intermediate presentation called the tensor intermediate presentation. So that’s pure linear algebra expressions. Okay. So you’re going to load from data flow to this linear algebra expressions. And that’s where you can actually get bits and pieces of this expression and start figuring out how do you better map that to the available primitives in the target hardware. So at that level, CVM can do a bunch of more harder, specific optimizations. That’s when you do data layout optimizations. And when you make selections of what kind of structures are you going to use, that’s where you make selections of what order you do, the loops and so on.

Luis Ceze 00:19:31 And that’s actually done via a process called Alto tuning, which maybe we can talk about later. But basically the tuning is about generating a large number of candidates and picking what’s the best one. And that specific, harder by doing a bunch of empirical experiments. And then once you do that, you keep lowering. We say lowering, we lower from the model to data flow graph level. It’s a bit of flow graph to attends or expressions that this expression to low level machine code, or sometimes a low level integrator presentation, like for example, LLVM or a lower level I R.

Ashkay Manchale 00:20:00 And what happens with target hardware architectures that don’t necessarily expose an instruction, but give you a library is the lowest thing translating into the library.

Luis Ceze 00:20:09 That’s an excellent question. Yeah. So we use the lowest level API available for your heart rate targets, right? So for example, Nvidia GPU’s, they do not expose the raw instruction sets, right? So we don’t know what it is, but we can generate Kuda code, which is essentially the low level API. I said the level program interface for that hardware targets. It’s not quite the library level between you expressing the actual tasks that you want to do in the hardware anyways. And this is, I think it’s a good moment to mention that the reason that CVM can do what it does for machine learning, it’s because it’s observation passes and the intermediate presentation preserves enough semantic information. What is it that you’re trying to do? I don’t want to get, you know, annoying. I remind you maybe have your compiler classes here, but it’s all about how much information is available for your optimization past. The more you know about what the program actually wanted to do, the better transitions you can do. And the great thing about specializing in a compiler stack to machine learning is that you preserve this intention, the intent of the programmer had, and that enables optimizations. That wouldn’t be possible if you had to rediscover that from the low level,

Ashkay Manchale 00:21:12 Put another way. For example, I can’t just straight a machine learning code and say, see, that’s effectively doing machine learning in a binary and then use that into TVM because you don’t really have that semantic information.

Luis Ceze 00:21:23 Exactly. Right. If I give you a bunch of assembly code, we take it hard for you to discover that it looks like you’re doing a matrix multiplication. Oh, it looks like you’re doing a logistic regression. Right. So that’s hard. Right? So, and preserving that at a high level enables put us into going up to my stations.

Ashkay Manchale 00:21:37 So what sort of libraries of model descriptions do you support? Can you describe that landscape of what’s available? Like I know pike torches one, Onyx is one. So what are these things? Describe the models as, and how do you consume them?

Luis Ceze 00:21:52 I think it’s fair to sit there and declarative ways of describing your model. You declare with the layers, look like you say, I have a convolutional layer. Now I have an activation function. I have a mitten vacation. I have, you know, you describe what the layers look like and where the inputs go to. What’s your effectively describing this data flow graph that relay reconstruct in an abstract way. Right? So it’s effectively what all of them does. In fact, TensorFlow, the name really means TensorFlow. You float tensors between operators. It’s a fairly general concept of declaring and data flow graph and expressing very clearly. What is the data type?

Ashkay Manchale 00:22:24 Let’s talk about the optimizations that you mentioned that DVM can actually perform on your input model. So one thing is you said, operator optimizations where you confuse. So do you have a simple example, maybe that people not deep into machine learning can possibly follow or understand what sort of optimizations are actually possible?

Luis Ceze 00:22:43 The highest level one is, is really what we call hyper parameter tuning, right? Essentially you have parameters in your model hyper-parameters of describes aspects of your architecture. You can tune and optimize for the use case and for the target deployment. So these are outside TDMs domain, because the way we think about this, that the model that TBM takes as inputs, it’s essentially a mother has had this parameter tune experiment like a type of parameter, as soon as even a parameter is all trained for. That’s not part of our, the male. The means really we treat TVM treats a model as a program to be competitive, specific, hard to target it doesn’t change the model except for quantization where you change the data type by and large CDNs does not change your model. And the type of optimizations that it does is what I described before. Right. It really finds the way, the best way of how to lay out the data structures in memory for each one of your tensors, right? So it figures out what are the right instructions to use in the, in the, in a harder target. If he goes out to know what order you want to execute the loops and all of that leads to a large space of possibilities that CVM navigates and produces the right, the right code. Did I answer your question actually? Is that what you’re asking or

Ashkay Manchale 00:23:44 In a way regarding the kind of instructions that you can use? I think people who are writing normal systems programs for general per CPU, they might be familiar with

Luis Ceze 00:23:53 X86 Cyndi instructions, which operates on a Waechter off of data. So are there more specialized ones coming out of specialized hardware that you can exploit? Absolutely. So some of them are activation functions that are hardwired, right? So you have a sigmoid function that you’re trying to represent to software. You can just call it in a specialized hydro might have one that does that in a single, like in one goal, right? The other one is convolution to deconvolution some popular convolutions that you tend to express as codes with multiple hardware primitives in a typical processor with a series of Cindy’s structures, you can write, you can execute as a single operator in hardware. That’s another one. And then you keep fascializing right? So basically the game here that this AI companies are playing is essentially looking at the model. Architecture is looking, what are the popular models want to do? And then hard-wearing as much of that as possible as a single big blob that you can call. And this is why it’s becoming quite of a zoo out there, right? So because the number of species, the number of different primitives available in all the heart attacks is so overwhelmingly large and navigating. That is hard. That’s one of the big motivations for why we go to the M two is dealing with this Cambra and explosion of harder targets and harder options and offer a clean ups, traction,

Ashkay Manchale 00:25:05 The auto tuning, and searching through that space of possible instructions that you can benefit from for your model. Is that where it comes from

Luis Ceze 00:25:13 Part? Yes. So the ultra Tony’s general it’s even for just the fuse suppose that you have a convolution, right? So that you have code for, and now, as I said, there’s going to be nested loops. This is going to call some existing, harder primitives, the vector primitives that might be available, any takes us input, some specific data types, their rights and specific sensors. So now when you look across probably between data layouts loop, ordering tiling, and then harder instructional choices, it’s easy to get to, you know, hundreds of millions, not billions of possibilities, right? So for the same piece of code, all of them are symmetrically valid. All of them are likely to do, or they should do exactly the same thing that forbids the question is, how do you pick the fastest one? There might be one that’s a hundred times faster than the other one, right?

Luis Ceze 00:25:57 So one brute force way of doing that is trying them off. But now you might be thinking, ready, see you here. Let me think, well, this, this is going to take awhile, right? So imagine this you’re generating your code variance. You are compiling it down, you’re running it in, measuring it and making a decision whether or not that’s faster than your previous version. Right? So, and then you can pick the fastest one, right? So if say, if it’s in the other billions and it takes a second to do that, that’s a lot of time. It’s a lot of competitional costs. A billion seconds is a long time, right? So what are we doing to VM? And this is not others doing this too. And similar to general technique as being popular in high-performance computing for some time, it’s essentially creates a very fast and fairly accurate way of predicting whether a given alternative, you know, two is actually likely to be good or better than when you had before.

Luis Ceze 00:26:46 Right? So given a set of alternatives, can you rank them without having to compile, run and measure? So the way we do that is by building a machine learning model that predicts performance properties of pieces of code. So given this template that I told you about, and you have this set of parameters, you would fill in to go and compile, run, and measure. We extract that as features, we run a few times and we build a predictive model that says that based on these features of in decision making the codes, you know, how can you predict whether or not something’s faster, right? So you could try and predict the actual run time, but that’s hard. Then you can do that. People have done that to some extent, but I think it’s much more interesting to say, given a set of alternatives, tell me which one is likely to be the fastest one without running any of them, just based on their features.

Luis Ceze 00:27:31 And that’s what we do. Like we do this, it’s somewhat dependent on the specific code template and the hardware target. And we had the paper about this and the ribs, and it’s a good chunk of the chance. PhD thesis is a former citizen, a co-advised is co-founder and CTO over a company, and also professor at Carnegie Mellon university now. So really brilliant work that showed that he can actually create these predictors fairly accurately and use that to speed up to me. So how do you use that? Well, these models evaluate, instead of seconds, they can take me two seconds or less. In some cases you can actually push it down to microseconds. So that means I can make something a million times faster. So now you can actually get for all these larger tenants, you can pick which ones did you actually run now instead of actually running a bit and you can run just a handful, a hundred, maybe 50 or a hundred and a hundred, and then you pick the fastest one. All right. So, and this is fairly accurate and that’s how we do it. So we use that to predict what’s the fastest version. And in case you’re interested, you know, it’s a super interesting question of building these library, predictors for pieces of code to specific, harder target sites as translates to other hardware architectures that are similar enough. And other piece of code is similar now. So there’s transfer learning opportunities there too.

Ashkay Manchale 00:28:39 I think I’m missing one aspect. So let’s say I, I use TVM, I download the library and I feed my modeling and I have my target architecture, which might be a GPU or something. And I want to benefit from this auto tuning capability, but I’ve never used this before, but also you have a model that kind of predicts, which might be good. Right? So the way I understand it, sometimes you need some training data in order to kind of like say, oh yeah, this is actually faster. Or this decision is correct. So how do you bootstrap that in just right out of the box,

Luis Ceze 00:29:10 When you install it to them and you want to run it and you use all the tools, you have to have a hardware harness, like you have to have a benchmarking test bed, right? So, and that could be one GPU, or once you put whatever you’re doing, like you have a few of them to parallelize that. And it runs a little experiments to bootstrap that. So it actually starts by running a few experiments and extracts the training data. It trains it, and then it keeps improving the model along the way as a quick plug here, if I, if I may. So if you use say Octa mouse, SAS platform, we have that training data just ready to go. Typically we have the models ready to go. And also we have the hardware test bed ready for you to use both for CPU’s GPS that are available in the cloud, but also, you know, add CPS and GPU as well.

Ashkay Manchale 00:29:55 But otherwise you need the target architecture to be able to kind of like that get better

Luis Ceze 00:29:59 And you have to set it up and you have to imagine like, you know, part of the value add here of using the SAS platform is that you don’t have to do any of these turnkey, right? You can even invoke the service with two lines of code by our API. So

SE Radio 00:30:13 As you radio listeners, we want to hear from you, please visit se-radio.net/survey, to share a little information about your professional interests and listening habits. It takes less than two minutes to help us continue to make se radio even better. Your responses to the survey are completely confidential. That’s S e-radio.net/survey. Thanks for your support of the show. We look forward to hearing from you soon

Ashkay Manchale 00:30:38 With all of this sort of optimizations. One thing you mentioned earlier was quantization where you might convert say floating point ones into end teachers because they might fit better. So does that change the end result? Accuracy of,

Luis Ceze 00:30:52 Yes, it does. So the question here is that, how do you do that in a way that change it as little as possible, right? So there are many ways of dealing with that, like one way is to quantize and then retrain it a bit like refine the training a little bit. So you can refine the parameters. That’s has a cost. You know, it’s not ideal, but it’s things you could do. And we had a paper that Josh from our head of machine learning systems at October Mao and also U w PhD, he wrote a paper at Mount SIS. It was published last year in 2020, that talked about ways of actually estimating accuracy in a local way, right? So essentially you can make decisions just by evaluating portions of your model, not end to end. You can make good decisions about how much you can containerize without affecting the accuracy.

Luis Ceze 00:31:36 We have versions of that, and we can evaluate in bounds the level of accuracy degradation without having to retrain it. I don’t want to go too deep on the paper. We can talk more about that. It could be a whole other conversation, but basically you can make very local estimations of accuracy degradation with a synthetic inputs. You don’t have real data. So we don’t want to be exposed to real data because that’s complicated, right? So we can actually estimate the actual segregation with synthetic data right there for parts of your model. But then what we want to offer the user in the end is that for their data, when a given a predo curve that says performance gain and accuracy degradation, so you can pick what are your level of tolerances, right? So some people might say, okay, I want to take a, you know, 0.1% accuracy degradation. If I get at least two expert from a scheme, for example, you can make those decisions based on a curve of accuracy, degradation versus performance gain.

Ashkay Manchale 00:32:24 Is that something that you can tune into TVM as an input, or is that just an experimental kind of a thing where you try it out and see where it falls on that

Luis Ceze 00:32:31 You can actually build an altar loop onto them to, and some people have done that. In fact, we’ve had papers that showed that judicial, that that’s how riptides that this they just talked about, did it. And I know that all this has built, you know, this outer loop intervened to make those decisions to,

Ashkay Manchale 00:32:45 That’s very interesting in terms of actually deploying to that hardware, target architecture. You know, that’s kind of like towards the end of the machine learning life cycle, too, so to speak right, are practitioners, you know, data scientists somehow aware of where they’re going to deploy it in the end, is that going to change their decision making about whether they use a certain type of Paramita or

Luis Ceze 00:33:05 We want to data scientists to not have to worry about that. The unfortunate reality today is that data scientists and folks building machine learning models, tend to worry about the cost of their models and how they’re going to apply them too early. Like if they know they’re going to deploy an NGP, because that’s what they have available, that’s what cost effective. They start affecting their model design decisions too early. And I think that curse them because they might be making decisions that would not make them go in a direction that the model would be better, more accurate. And if they had gone that way at the end, they would have an accurate model and then let things like TVM or tensor RT, or other other optimizes, and they’ll do their job and recoup a lot of performance later. So our vision here is to free data scientists from having to worry about these deployment aspects early on in the model development. We want them to develop the models with their primary goal in mind, which is build the best model that they can with the data that they have in mind. Because today the reality is that we’d be the best model with the data that having mine and some deployment constraints. We want to say like remove department for strains, build your best model and abstract the requirement aspect away and leave that to folks like us that are, that do that for a living. Right? So

Ashkay Manchale 00:34:15 Yeah, no, that makes a lot of sense because, you know, if you’re a data scientist, you might not have a computer science background, a deep systems background to be able to make those decisions early on. So it makes sense for them to not worry about that. That’s very useful. Yeah.

Luis Ceze 00:34:28 Yeah. You know that this reminds me of this, this quotes early optimization is the root of all evil in programming. Have you heard that before? Like if you premature optimizations are bad because you increase code complexity and you make harder to maintain, and it’s not clear if it pays off, because if you do that, it might preclude other optimizations. Later, we say that in when writing regular code. And I feel like we were talking about tiers, a similar thing here in machine learning where, and they’ll try and talk to my four deployments in the process of developing your model early on is just for material optimization. They should not be part of the story, right?

Ashkay Manchale 00:34:59 There’s a lot of flaky magic happening underneath if I’m a data scientist that I am saying, okay, this is the model. And I hand this over to run it in some target architecture. There’s a lot of what appears to be just magic before the trans 100, the target architecture, how do they go about say debugging, understanding, you know, whether the model is performing the way they actually expect

Luis Ceze 00:35:20 The observability and debugging is a big topic in what is now being called ML ops, you know, parallel to dev ops, but for machine learning models, let’s start with the bugging, right? So part of the debugging, you know, within a model development just makes you like your mother’s misbehaving with some specific lots of inputs. A lot of that should be done. You know, if the model was done right, should have actually had a pretty good representative data sets to test, but sometimes it’s just not exposed to that. So what you need to do is just add instrumentation to your model. And then if there are surprises in deployments, you collect those inputs and then you feed this back and then go and refine your model. So that’s one way of doing that. You put some safeguards to know, Hey, your mother is misbehaving because we have ways of doing some control tests.

Luis Ceze 00:36:02 And then whenever it fails, or there’s some inputs where, you know, some guardrails are tripped, you know, you collect the data and then you go back to some other refinement. That’s one way of doing it. Otherwise it’s just like doing, we call model explainability, which is essentially trying to describe what is it that our machine learning model is doing to a, to a human being that is a domain experts. And then if you eyeball that, you can say like, okay, you know what, let’s let me give you a, an example. That’s a classic one. If you’re going to make a machine learning model for medical diagnostics, right? So typically this is a black box and it can be very accurate against the test sets, but then you can lead to surprises. That can be very costly when you’re making decisions that affect directly affect human lives in terms of healthcare decisions.

Luis Ceze 00:36:42 Right. So what do you wonder is to be able to abstract that model in a way that you can expose to a panel of medical experts that can look at the most like, oh, this decision here doesn’t seem, right. Right. So this model explainability is also a hot topic in AIML research. And it’s turning into practice today with several emerging companies. Like why labs, for example, works on model observability and model explainability that makes you understand, you know, not only what your mother is doing, but also, you know, how, how is it performing in deployment by looking for anomalies and so on?

Ashkay Manchale 00:37:12 You would have to do that early on rather than in the end, but you can observe what the end, I guess

Luis Ceze 00:37:17 You did that along the way, right? So let’s say that once you have a mother ready to be deployed, what you want to do is add instrumentation to monitor how the mother is doing. And collecting puts when some unexpected conditions are matched. That’s one way, then the other one is just straight to explain the model explainability. So before we deploy it to explain them all that, you just make sure that is it, are there any potential bad scenarios here that aren’t being covered or scenarios that are unlikely, but aren’t being covered well by the model so that you do that with model explainability, TBM,

Ashkay Manchale 00:37:45 And one in a way, you know, you can run this on multiple target architectures. And what are the reasons? One is like, we’ve discussed about performance in order to get the best out of your model, but are there other reasons why you would want to run it on different target architectures? Are there other motivations to use?

Luis Ceze 00:38:01 Many, one is you might have no choice, right? So let me give you an example. You saw a bunch of cameras that have local computing them, and you want to do computer vision in their manual. Cameras could be installed. Like you have a large installed base. You don’t have a choice. You’d have to live with whatever architecture is there. Your model has to run there. So that’s one that’s likely to be true for, for self-driving cars and then autonomous vehicles in general. Whereas you deploy a fleet of those. It’s actually, it takes work to rebuild and upgrade the car. You didn’t have no choice to go. What you showing you, if you want a better model to run and make your self-driving car safer and better and so on, and you got to live with the architecture. That’s that’s one, right? So hitting that case and also mobile phones, right?

Luis Ceze 00:38:38 You want to be able to cover as many mobile phones as possible safe. You are deploying your model to Android phones. There’s a large number of chips that are different across Android phones. And you have to live with the collection of the users that you have, right? So I need that case one, you have to live with that diversity. And second, you want to make sure that your model is fast enough and uses, has resource requirements are compatible with how you’re going to deploy it. And that’s doing a position like what CDN does is an enabler, right? So there’s some use cases we’ve had here was deployed in a pretty unique but popular architecture that is deployed in cameras. And we offered a 30 X better performance. So there was a difference between actually being able to deploy for something but to deploy, right? So that’s one, the second one is when you deploy them, all those in the clouds, you have throughput per dollar constraints, right?

Luis Ceze 00:39:27 If you’re running a model at scale, and you’re doing say tens of millions, or maybe hundreds of millions of inferences a day, it adds up pretty quickly. And that can be very, very significant cloud bill. And what’s great about that use case is that if you make your model to act faster, fundamentally make it two X cheaper because you used to ask your cloud resources, right? So I need that case where you wants to do is be able to pick for a given model and a wealth of instances that he can choose from. You want to pick the one that gives it the highest throughput per dollar, but today, if you have to do manual engineer for each one of those, first of all, you delay time to market too much, right? So because now you have to, to look and solve this architectures. And then the second, sometimes it’s not cost effective because you’re going to pay engineering time to do that.

Luis Ceze 00:40:09 And it might not be a winning the end because now what are you going to say with that? Vibrational cost is not worth it, right? So the able to move around and deploying to the most cost effective called architecture is another, another use case. And the third one is even for cloud based appointments. Sometimes the latest is a big factor by, so let me give an example. If you’re using document analysis, or if you’re doing any way, upload the document you want to know immediately, whether or not, you know, there’s something you should look at in your model, there’s human interactivity, right? Chatbots, and other one you want to, you want to be able to evaluate your language model fast enough, right? So that means that if you have a model that has better accuracy and leads, so better product experience, but it’s too slow to run on the available hardware today, you can ship it, right? So an enabler is when you hit a certain performance target and then it’s cost savings because you save costs in the cloud or even save costs in the edge. So if you’re building a purpose-built device to monitor something in the field, using machine learning, you want to find the cheapest harder, they can do that. Right. So instead of,

Ashkay Manchale 00:41:06 Yeah, that makes a lot of sense. I mean, like you said, it’s an enabler and we just, Y you know, as consumers, we see machine learning, being used in a variety of different ways in our daily basis when we interact with computers or even outside of it. So voice assistance and things like that. So that’s pretty cool. I want to switch gears a little bit and talk about the engineering behind DVM itself. Traditionally, compilers are difficult if you’re studying it in school, it’s probably the hardest thing that you can study if you’re actually working it. It’s probably the hardest thing that you can work on because there’s a certain expectation of like accuracy and correctness that isn’t normally expected of other softwares. I’m sure most people will go their entire software engineering careers without encountering a compiler bug. So you’re building something that is slightly different, where you have some amount of flake, fuzzy ness in what your programs itself are doing. So how do you go about engineering, building it and testing for correctness?

Luis Ceze 00:42:02 First of all, engineering, I can’t, that’s up to virginity to say that, you know, TBM is now has a big developed open source developer community. We have more than 570 lifetime contributor certificate. Um, so it’s, it’s really a huge efforts that we’re extremely grateful to. And we, and I think that compiler has given the, the complexity of what it does, but also the diversity of models, frameworks, and how to target the cross-product is wide. So the only way we can make the future proof and make that even tractable is by having folks contributing from multiple angles, you know, machine learning engineers to support new models, harder vendor support, new, harder targets, and folks in the middle, like to improve and reach the, the infrastructure. Right. So that’s just to say this a lot of really talented developers have, have contributed code to it. So, and then how do we do with correctness?

Luis Ceze 00:42:48 Well, one of the great things about compile is that despite being complex, right, is that you have very well-defined rules on how you do transformations in your code, right? So you can take a form of verification approach, you know, then folks have talked about formerly verifying parts of what TBM does, and that might have happened. And I apologize that I know exactly what part of it might have had that, but biologics just do like very, very thorough testing, right? So we have test cases that test to know the validity of transformations and in isolation, and then ways of showing that they can compose and so on. Right. So we could go and do more and more formal verification, but as you know, pharma verification does not scale very well. So you have to do it in a very, very limited way. So just, I would say that I know sound like a boring answer, but it boils down to the very, very thorough testing and have really good tasks coverage for correctness and performance regression, right?

Ashkay Manchale 00:43:41 Oh, would it have hardware targets? Are there hardware designers who actually write code for TBM in terms of like actual

Luis Ceze 00:43:48 One of our co-founders and my former PhD student Terry Morrow, in fact, part of his thesis was a open source, deep learning accelerator. The design is opened by the way. And it’s also part of the TVM overall package called VT versatile tensor accelerator. And that’s one example of hardware software. Co-design designed the architecture with an ISA that gets exposed to VM. And in that case, Terry who, the main harder designer there, those others involved with, he was the lead designer. They got involved in essentially finding ways of exposing the ISA, such that CVM can do its optimizations. Right? So that’s one example that happens story CLI, but to answer your question directly. Yes. So we do have hardware vendors that contribute code directly, and they tend to, they send you architecture really well and that’s can help. Of course, you know, you’d be able to have the right hooks to enable different types of optimizations in, into DM.

Luis Ceze 00:44:37 For example, it’s no secret that, you know, arm is a big participant in the TGM community. AMD is a big participant. Qualcomm is a big participant in tedium community. And we starting to see other hardware vendors contributed to like Xilinx, for example, and so on. And we see some, you know, folks on Nvidia tender conferences as well. And we saw a little bit of commits from them. Some examples of, of course, I’ll just have the vendors are sensitive to not embedding sensitive IP, but what they’re doing is not doing anything as hardware to that hardware. They’re essentially writing the right primitives to be able to doing out the tuning and doing the code transformations, be able to support their primitives. Well,

Ashkay Manchale 00:45:14 Let’s talk about how all of this comes together at a solution level. I’m trying to understand if TBM does something wherein I deploy a model to a target architecture, but my rest of the program is in x86. It’s running on a civil class machine, and then you have these pipelines of data coming in. You know, maybe I’m getting 16 data from the users. I’m getting something from a database and you have your specialized hardware, possibly just waiting for something to arrive at it for computation. So at a solution level, how do you, how do you see companies like actually optimizing learning performance? And during that

Luis Ceze 00:45:51 2:00 PM produces, as I said, quote does output that you have to package, right? So there’s many ways that you can actually package the on time with a well-defined API, but then that goes a little bit outside to VM. That’s. One of the things that we do at optimum as part of our solution is to be able to essentially wrap up your motto into a variety of packages that make it convenient to deploy. Like you have a Python wheel with, by some binding that you can actually call with a well-defined interface thought as so library, you know, we can support windows, packaging. For example, we go where the customers want to go or a Docker container or a GRPC function that he can deploy and call it as a microservice. And there’s many ways in which you can package it. But, you know, at a solution level, what we provide then is this various packaging methods and ways of calling into just a couple of lines of code, right? So you’re absolutely right that for this to be a viable solution and actually the grid with the rest of the application needs to have the right package, the right API. And that’s why we support a variety of those. So

Ashkay Manchale 00:46:48 Maybe this was something that I should’ve asked earlier when we were talking about the front end and the back end parts of the compilers, but can someone go in and look at the intermediate representation of TDM coming from a model and then actually hind to you in that?

Luis Ceze 00:47:00 Yes, absolutely. Yeah. And in fact, you can actually print out the TMR and you can eyeball it and you can, you can go and actually make transformations. You can write code directly to get my heart if you want it to. And in fact, some, some specialized operators, like one example that we’ve done, not super recently, but I think in the past year or so was through sparse operator for language models that was written directly in the TMR. And it has a bunch of benefits because, you know, you can express things in a more efficient way. We love to see scenarios like that. Yeah.

Ashkay Manchale 00:47:32 Yeah. I guess it’s kind of like not writing machine code, but writing something at a slightly higher level, but still

Luis Ceze 00:47:38 Yeah. Yeah. That’s exactly right. So it’s sort of like a slightly higher level, but low level enough for it to express efficient code. Right.

Ashkay Manchale 00:47:45 Yeah. And I guess you could also use the IAR to understand what the TBMs interpretation of your model

Luis Ceze 00:47:52 Exactly. And to make sure like, and you also, you have more and more understanding what is it that leads to your motto? Right. So at the EIR level two, which helps you understand where the performance is coming from, it gives you a truth of what it’s doing and so on. Yeah.

Ashkay Manchale 00:48:04 That’s very interesting. So where do you see this huge transformation in the machine learning space going one and two? Where do you see TVM going in the future?

Luis Ceze 00:48:14 First of all, I think it’s undeniable that machine learning models and ensembles of models are becoming part of almost every single interesting implications that we interact with. Right? So from, you know, language models applied in every text that you type to computer vision being done. And in video chats to, you know, intelligent user interfaces to code synthesis, and then to like just machine learning models, just going everywhere and bringing machine learning to the state of where development regular self development is today with the REITs integration methodology, CICB optimization, testing completion, all of this is something that needs to happen. I see this maturity starting to happen. Like, you know, I think as an industry Sioux relatively early, does it Mel ops flow, but I firmly believe that as machine learning, AI machine learning in general continues to deliver value and new experiences and applications are going to be integral parts and then there’ll be more maturity there and there’ll be more and more users that would just start integrating this as if you were in the other module.

Luis Ceze 00:49:10 Right? So that’s where I see it going is becoming just a natural part in how you build new new applications. Where I say to them going in the future is more and more automation and more and more support for these plethora of specialized hardware. Right? If you think about it, you know, even though we keep hearing about AI chips, you know, there’s very few that are truly mainstream yet, you know, Nvidia GPU’s arguably has a lot of specialized stuff there in an AMD. GPU is starting to become more and more interesting for machine learning, like in part, because TVM is starting to support you pretty well, but you haven’t seen any other ones. You haven’t seen graph core Cerebra, you know, all of these other ones that are about to hit the markets, right? So, and many others that I’m on grok and many others are about to hit there in those. So we definitely want to see TVM being, you know, again, continued to be the abstraction that you used to not have to worry about how you get your models deployed. And for that, we need to continue improving to make it easier to onboard new hardware targets. We have to continue building more automation to accommodate the front-end extensions and so on. And I said to him, continuing to be established as sort of like this open defacto standard, that support all model frameworks and all, all harder targets that people care about. So

Ashkay Manchale 00:50:19 That’s actually very interesting and you know, the systems community in general, to see the sort of rapid change that’s happening through machine learning, this motivated examples. Right? Louise, thank you so much for coming on the show. Thank you very much.

Luis Ceze 00:50:32 It was really fantastic chat. Thanks again. I really enjoyed this. You’re very talented, you know,

Ashkay Manchale 00:50:37 Thanks. So thanks for coming on. Uh, this is actually my Charlie for software engineering radio. Thank you for listening.

SE Radio 00:50:44 Thanks for listening to se radio an educational program brought to you by either police software magazine or more about the podcast, including other episodes, visit our [email protected] to provide feedback. You can comment on each episode on the website or reach us on LinkedIn, Facebook, Twitter, or through our slack [email protected]. You can also email [email protected], this and all other episodes of se radio is licensed under creative commons license 2.5. Thanks for listening.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

SE Radio 479: Luis Ceze on the Apache TVM Machine Learning Compiler

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 714: Costa Alexoglou on Remote Pair Programming

SE Radio 713: Héctor Ramón Jiménez on Building a GUI library in Rust

SE Radio 712: Dan Lorenc on Sigstore

Menu

Recent posts

Search

Search

SE Radio 479: Luis Ceze on the Apache TVM Machine Learning Compiler

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 714: Costa Alexoglou on Remote Pair Programming

SE Radio 713: Héctor Ramón Jiménez on Building a GUI library in Rust

SE Radio 712: Dan Lorenc on Sigstore

Menu

Recent posts