SE Radio 473: Mike Del Balso on Feature Stores

Mike Del Balso co-founder of Tecton discusses Feature Stores and how it helps operationalize Machine Learning. Akshay spoke with Mike about data engineering challenges to connect signals that make a feature from various data sources to a Machine Learning Model during training and to serve it in production. Mike talks about challenges faced by engineering teams in creating custom data pipelines to connect and transform data from various sources to deploy a Model. Mike discusses Feature Stores which are an emerging area in operational machine learning that provides an end-to-end data platform to automate common data engineering challenges. He talks about how Feature Stores can automate connecting, transforming, storing and serving data required in the entire lifecycle of Machine Learning.

Show Notes

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Akshay Manchale 00:00:48 Welcome to Software Engineering Radio. I’m your host Akshay Manchale. My guest today is Mike Del Balso and we’ll be talking about feature stores. Mike is the co-founder of Tecton where he’s focused on building next generation data infrastructure for operations and machine learning. Before Tecton, Mike was the PM lead for Uber Michelangelo Machine Learning platform. Prior to Uber, he was a Product Manager at Google, but he managed the ML systems that power their search ads business. Mike, welcome to the show.

Mike Del Balso 00:01:15 Hey, thank you for having me. Awesome.

Akshay Manchale 00:01:16 So let’s start with some context for the rest of the episode. Can you briefly describe what a feature store is?

Mike Del Balso 00:01:22 Yeah, so a feature store, we think of it kind of like a data platform to support the data flows that are needed in a production machine learning application. So, machine learning applications have two components to them, of course there’s models. And there are data pipelines that feed the right input signals to those models. So, the models can use that data to make a prediction, and there’s a lot of operational challenges around both the models and the data. You know, we need to organize these models and data pipelines. We need to productionize and serve them and monitor them. And so those signals those data pipelines that are part of a machine learning application. Those, those data inputs to a model are called features. And so the term feature store refers to decent infrastructure that organizes the data pipelines for machine learning application and provides some of the implementation for productionizing these data pipelines and using them in a, you know, in a customer facing ML application, say for a live recommendation service or a live fraud detection service.

Akshay Manchale 00:02:33 Let’s take that example of Sierra Life recommendation service. And can you explain just the terminology of what makes a feature, what is a model, those components? Can you explain that in the context of say that live recommendation service?

Mike Del Balso 00:02:47 Great so, if weíre an e-commerce company and we want to show a recommendation to the user, who’s browsing the website, we want to recommend an item for them that they may also want to buy. For example, we may have just intuitively a variety of signals, a variety of things that, that influence what we think this user might be interested in, right? Maybe this user previously was looking at, you know, women’s fashion. So, we know that maybe they’re probably interested in items related to women’s fashion, but maybe also today in the session that they were on their website today, they were, they were clicking around a lot specifically in pants. And so maybe we have that kind of gives us another clue that maybe they’re looking for women’s pants and they also may have just typed in, into a search box capris or something like that.

Mike Del Balso 00:03:45 And so that’s, that’s another signal, that’s actually a very high intent user signal from there, which is kind of something that’s delivered in real time. That’s not about looking at the history of the user. So, point is like all of, all of these signals that all this information we can collect about a user for, a given kind of decision we want to make is, we can all use this to kind of feed into a decision making process. And we can do that intuitively as humans, you know, it’s kind of obvious like how to think about this stuff, but it’s not always clear how to develop a software system to generate that recommendation and an accurate recommendation and not just to develop such a system, but to also like run your business on that. You know, if you have, you have a website that’s making money, how do you make this a reliable software application?

Mike Del Balso 00:04:34 How do you have reliable predictions that are tuned to the high quality predictions that lead to more sales, for example, and so high quality recommendations. So, the parts there are, you know, we need a machine learning model, which is something that is the system that takes these data inputs in and makes the decision of, hey, this is the item we should recommend in this instance. And that machine learning model is trained. So, it learns from a variety of historical examples. So, we would like learn this machine learning model from a variety of historical examples of impressions that the user has seen from previous recommendations and, also data about whether they clicked those, those recommendations or not. And that’s called the training dataset. And that’s what a machine learning model is built on. It looks at all these historical examples. We run those examples through what’s called a training algorithm, and that all put basically a decision-making system that can act on, it’s almost like a transformation that acts on the input, some input data and can output which item we want to recommend.

Mike Del Balso 00:05:51 And as you build that model, you can configure the training algorithm to optimize for different things. And so we may configure it to optimize for, I likely like the highest click-through rate. Like we want to recommend the object, maybe the piece of, like the fashion article clothing, that is the most likely thing that we think someone will click on, and that will correlate quite well to, increasing our revenue if we have higher quality recommendations. So that model depends on having these signals that we just thought about, and we just talked about intuitively, right? Like what does this person typically search for? And what did they type in the search box right now? The model needs that data in a very structured format, in a very, in a very specific format for it to be able to act on it and for it to be able to interpret that data and use for its decision-making process.

Mike Del Balso 00:06:52 And so there’s special data pipelines that are set up that one has to build while they’re building a machine learning application, to both generate that historical training dataset for a model to be trained on, and also to generate the latest versions of all of these signals when a model is making predictions. So, when we’re building a model, we need to look through it, process our logs data to come up with here’s all the times someone saw a recommendation before this is what we knew about them. These are the signals that we knew at that, at that point in time, in the past. So, this is kind of like historical feature values. At this time, in the past, someone had just searched for men’s pants. Someone had just searched for a hat, something like that. And then this is what we recommended.

Mike Del Balso 00:07:41 And then this is a, whether or not that user clicked on the recommendation or not. And we compile all of that. And that becomes the training data set. And it’s, it’s a real data process. There’s a lot of people put a lot of work into, into the data process for compiling that data set. And that’s what we, you know, you often hear ML people say like, you know, 90% of doing machine learning is data cleaning data preparation. That’s what that work is to prepare that dataset. And then when you want to use that model, and now we’re getting into the deployment, we’ve built, we’ve built the prototype at least, we’ve come up with the model we want. We want to productionize that there’s a similar set of data pipelines that are used to deliver those data signals to the model at production time. And those have to be real-time pipelines that represent the current state of those features for the current session, the current user, their current context. Does that make sense?

Akshay Manchale 00:08:37 Yeah. The second part of that is what you mean by operationalizing your machine learning model. So, can you walk to what the general challenges are in terms of taking this recommendation system that I may have built with, training data, test data with a set of features and then operationalizing that in actual production system. So, what is that? What are the challenges there?

Mike Del Balso 00:08:59 Yeah, I think the biggest challenges are, are actually that there’s just not one way to do this. There’s, we’re kind of, we’re kind of, you know, it’s kind of, it’s, it’s kind of a new thing for a lot of companies to, that are even software companies they’ve been doing. They’ve been building software and depending on it for that being their product for a long time, but adopting machine learning as a, kind of a new challenge for them. And so, that brings with it, some different development processes. There’s different, there’s different ways to productionize that that we’ll talk about, but there’s also different ways to monitor and maintain these systems. Thereís different things that one has to track to allow them to identify issues, to allow them to debug these systems and address kind of like, outages as well. So, when someone thinks about like, what is unique about, or what are unique challenges of building and deploying ML systems compared to typical software?

Mike Del Balso 00:10:01 There’s a really interesting kind of difference here is that we’re not just compiling a single binary and deploying a binary. Or compiling something that has a very specific external dependencies on data pipelines, on data resources that are always being updated. So, when people talk about like him deploying my machine learning system, it’s your machine learning system is both the model artifact, but it’s also, these, these data dependencies that need to be maintained. And they are always changing. They’re very dynamic. They, you know, that’s, that’s data, that’s always being processed and, you know, deploying a simple machine learning system is not as big of a challenge as, deploying some of the more complicated or sophisticated and all systems that are used for a higher, for more like higher value ML use cases. So, they’re very important to that business. So, think about like a pricing system for an insurance company, or think about like surge pricing for Uber or the estimated time of arrival for Uber, or even estimated click through rate for Google, where Google is trying to, you know, make sure they show you ads that you would click on, right?

Mike Del Balso 00:11:21 They don’t want to show you ads you’re not going to click on. And a lot of these use cases have they’re very valuable in a high accuracy is very important for these use cases and that motivates and finding as many ways as possible to bring in additional signal, to make a model more accurate. So, the more fresh signals, the more context we can get, the better kind of information and data we can get about what’s going on right now that that is kind of obvious. Like it’s intuitive for us as humans that we can feed into a model, more likely the better that that model will be, and that will lead to some business impact. So, in practice, what happens is these ML applications that are quite meaningful that are, that have real economic impact. Like some of the ones I just mentioned, there’s a really complicated whole nest of, data pipelines, different services that are built to support these, to process all the different types of data that the, that the businesses seeing that the business has and, and translate that data, extract the signal, aggregate that data, you know, mix and matches data and extract the signal and convert it to a format that’s consumable by the model and deliver it in real time to a model.

Mike Del Balso 00:12:46 So some of these signals are computed in real time. Some of them are, you know, batch signals, stuff like that like we just talked about. Some of them are based on streaming infrastructure on kafka, and some of the challenges, some of the biggest challenges for these teams are just keeping all of these tools up and running. And if you’re a data scientist, you’re not going to be an expert in all of these different technologies. And so that’s why, like a lot of these more sophisticated ML, more valuable ML use cases, they have huge teams of engineers supporting them. And that makes, that makes ML kind of like, not accessible, like meaningful ML, not accessible for teams that cannot staff up giant engineering team.

Akshay Manchale 00:13:27 Yeah. So, you do have like data engineers who can actually bring the idea that might be conceived by a data scientist into reality by building these pipelines. And if I’m hearing you right, that you’re, you’re saying the quality of signal also depends on various sources and various characteristics that are in different places. You mentioned kafka so you could be, have streaming data. You could have a data warehouse. So how is all of these things, how are all these things usually connected? Is there some example that you can share, which drives the point about the complexity involved in building these data pipelines from different sources to actually operationalize a machine learning model?

Mike Del Balso 00:14:04 Yeah, totally. Well, so there’s kind of the, like, how are they usually connected for, you know, a lot of companies that are practically getting started with this today? And then there’s, how are these things connected in the most efficient way or in the way that companies that are successful are doing it.

Mike Del Balso 00:14:26 Let’s just take, let’s just take a concrete example, right? Like, we have this machine learning model, like Uber, which would estimate how long it would take your, your Uber eats order to be ready for you, right? And so, there’s a variety of signals, but let’s just talk about two of them, that could become predictive features for your machine learning model. One is how long does it normally take for them to prepare a dish? Super simple that you can just calculate once a week, once a month, you can calculate it even once and, and just make sure that that’s always available to pass into the model. That’s good. But sometimes, you know, sometimes it’s a, busy time of day or sometimes, you know, a restaurant is uncharacteristically busy. Like it just got some spike of orders. And so a second type of signal, which is slightly more complicated as one that’s based on that’s based on some recent data.

Mike Del Balso 00:15:21 So it’s like, you can think of how many orders did this restaurant get in the past 30 minutes and how does that number compare to what they normally get in a 30-minute period? And that can give us a sign of like, Hey, is this like really busy right now? And if so, then they’re probably going to take longer. That’s hard to compute in a data warehouse, right? A data warehouse,it may not have all of that real time, you know all the data from the past 30 minutes being available in it. And there’s a, it’s not meant for kind of like faster operations like that. So, at a company like Uber, what we would do is we have all of these order events have, be populated onto a kafka stream. And then we have some streaming aggregation jobs that are reading these kafka streams, aggregating these data to the number of orders per restaurant, and a time in an aggregation window. That’s like the last 30 minutes always updated in real time so that when we need to make a prediction for, Hey, this restaurant, how long is it going to take to be, to have that have that meal prepared for you? That signal is just is available immediately for that model to make that prediction. And so, but your question is like, how do people like practically, like if you’re just like a small company or a data scientist, how do you do this?

Mike Del Balso 00:16:42 The answer is like, realistically, you don’t like that’s, that’s what was happening at Uber. When we started is just like a lot of this stuff is, is hard. And so at Uber, and we see this in a lot of other companies right now, they’re kind of in like one of three situations. So, situation one is the, it’s a super important use case for the business. And like the business needs this thing to needs this ML model to work no matter what. And so what they do in those cases is they put a bunch of engineers on it. So, the data scientist comes up with some prototype and then they hand it off completely to an engineering team. And the engineering team takes maybe like a notebook of Python code that they may not understand fully from the data scientists. And then they, they productionize it, you know, quote unquote productionize it.

Mike Del Balso 00:17:27 And that means rewriting some data pipelines and in a variety of different, more engineering centric technologies. It means building production services, building the right monitoring, a lot of custom work for a specific, it’s basically building like a custom engineering, microservice to make these predictions. These systems are, tend to be quite brutal, very expensive because you need to staff a whole engineering team on them and hard to update, right? If you’re a data scientist, you’ve got to go through the engineer every single time. So, that’s a challenge with those. And when we started, when we started the ML team at Uber, a couple of years ago, there was, there was only a few instances of teams having gone through that flow successfully, just because it’s so expensive on the engineering side. The second situation is, the data scientist comes up with the prototype and, then they think, okay, this is great.

Mike Del Balso 00:18:23 Like I came up with, I found the model I want to use. I brought together some of the data I wanted to use. And it kind of, simulated what the signals I want to have in that model, in my Python environment. Now, how do I productionize it? And then there’s crickets. And then nobody knows how to productionize it. There’s not a path to do that. Like, there’s basically, there’s not the engineers there to rebuild this whole thing in the production or the operational environment. And then the, and then the data science project kind of just gets blocked there. And then the data scientist has other things people are asking them for. And then, you know, she’ll go off and work on some other projects and that project will kind of die. The third situation, which is very, very common is people don’t start on these projects in the first place.

Mike Del Balso 00:19:05 Data scientists will know, Hey, I have actually no path to production. There’s no way I can, I can build a real-time ML system that’s productionized anyway. So, I’m just going to work on this other problem that we have in the company BI problem. And it makes some dashboards or something like that. That’s very common, both from a data science decision perspective where they think I don’t have a path to production. So why am I going to even bother? And also from the engineering leadership perspective where they think, wow, that, you know, if I put my engineers on this, that’s so expensive for us from an engineering cost perspective, per ML use case that I don’t even want to support another ML use case. It’s just going to incur more technical debt for us. And it’s just going to be more stuff than I have to support. It’s not worth it for me. So, I think the goal right now for the industry is how do we, one, build that path to production for data scientists, like make it possible for them to get their stuff into production and two, lower the cost, the engineering costs so the engineering teams don’t see supporting ML as a burden. How do we help? How do we help a given set of engineers scale to support more and more data scientists over time?

Akshay Manchale 00:20:19 And just to be clear, one of the things that you mentioned, the data scientist just goes off and builds a dashboard that is kind of like the classic analytical machine learning sort of a thing, right. That’s very different from what you’re, what you’re trying to say with respect to actually having models in production. So, can you just briefly describe what that is? And then we’ll jump into, you know, how to solve this whole problem and this challenge of agility.

Mike Del Balso 00:20:43 Yeah. Great question. So, we started talking about, and this kind of started a couple of years ago, but we started really recognizing that the there’s a big difference in the success and the cost of an ML use case, depending on who is the consumer of that use case. So, when you think about like, Hey, this is a machine learning project in my company, it tends to be that if the machine learning project, if its purpose is for some internal consumption, something like some analytics purpose or some let’s extract some patterns from our data so we can look at it and make a decision, or let’s do a forecasting of our sales. Those tend to be a lot less costly than the other types of machine learning projects, which will you call operational ML? So, the first we call analytic ML, second, we call operational ML. The operational ML is use cases that are just in production.

Mike Del Balso 00:21:40 They power your product, right? They’re powering your customer experience. These are things that some of the use cases we just talked about, right? Like the ETA system at Uber or a pricing system that powers an insurance company’s website, right? Stuff, where if it goes down, your product is going to be broken, you’re your business is going to be broken, your customers are going to notice. That requires a different tier of operationalization, a different tier of productionization, and when a machine learning project, when a data scientist switches from moving from analytical machine learning, where kind of they’re in control, and they are the end-to-end owners of the systems and the stakes, frankly, are, are a lot lower to operational machine learning projects where things have SLS. There’s a lot more engineering requirements. There’s real business impact. If something doesn’t perform or something’s not behaving correctly, that’s a real challenge for data scientists. And that’s where they began really needing, engineering support. And so those types of operational use cases, operational ML is actually like what we focus on at Tecton. How do we help people put machine learning at the production and build their products on a different machine learning applications that they’re building.

Akshay Manchale 00:22:54 Okay, great. So now let’s talk about feature stores, right? So, you mentioned they’re used to operationalize your machine learning models. So, can you go deeper into how that actually solves this whole problem of operationalizing a machine learning model?

Mike Del Balso 00:23:08 Great. Yeah. So, the fundamental, like the big costs that we just talked about, about putting a machine learning an operational machine learning model into production is that engineers, a lot of engineering resources are needed to help a data scientist rebuild their model and production rebuild their data pipelines in production. And so at a high level, what a feature store does is it resolves that duplication. So, it provides a way for a data scientist to define the data signals, the features that they need and their model. And when they define that at the training stage at this stage, when they’re building their model and they’re just exploring, you know, what models they want there. And they’re in that exploration phase, when they define those features, those features are defined in the feature store and the feature storing handles the productionization of them. So, an engineering team is not needed for the productionization of these features.

Mike Del Balso 00:24:06 So what that does is it reduces the need for data science teams and engineering teams to collaborate within the core iteration loop for data science. And it enables data scientists to contribute to the production model. There’s no longer this, like throw it over the wall to a data engineer. The data scientist gets to own their work. They can deploy to production on their own, and they become the owners of the data science app, the whole data science model, all the way through to production. This actually kind of changes like what their responsibilities are too, right? Like they’re, they’re no longer the person who just doesn’t really know exactly what’s going on with the production model. They’re the owners of the production model. And so, the feature store is the, management platform and the platform that runs these pipelines in production. So, the workflow that we enable for data scientists, is the ability to define features declaratively, and then register those features centrally in the feature store. And then have the feature store essentially be this kind of this catalog of trusted signals that are usable by machine learning applications during the development process and production. And so, then those features can be, can be used to generate training datasets when a data scientist is developing to build models, and those features are, can also be queried by model services, prediction services, and from the production environment to get the latest values of these features. So, it’s kind of automating a lot of the data engineering work in productionizing machine learning applications.

Akshay Manchale 00:26:08 This sounds like, how databases kind of change way back or how you look at your data and organize, and actually query by having a declarative way to query and look at your data. Right? So, if I would a data scientist, maybe in the past, I had a CSV file that I would just run through a Jupiter notebook or something to look at the model that works best CSV file is having the data, export it from various different sources. So how does a feature store kind of fit into this, where it’s taking you closer to the data directly by bringing that pipeline to you in some way by saying, oh yeah, this is what I cared about, or this is what I want. How does a feature store solve that?

Mike Del Balso 00:26:49 Let’s do like the juxtaposition of the before and after, right? So just for training a model, let’s do training models and then serving models, right? Like development and productionization. So just for training models, what let’s imagine you’re making the second model in your company. So, they’re actually already is you’re making the second fraud detection model. There’s already a fraud detection model in your company. So today what you do, or maybe like in the past, in your example, what you do is connect to various databases, figure out which data you might need, download them to maybe your local machine, open up those data exports in a Python environment, clean up the data, try to join them together. And you got to bring all this data together in one data frame and then run transformations on this data to aggregate this data, to extract out it’s called featurizng and feature engineering, to extract out the relevant signals for the machine learning model that you want to build and train a model, and then see if your model is accurate and then repeat.

Mike Del Balso 00:27:49 And one really tough thing here also is what you need to do is when you’re building your training data set, you have to be very careful about something that’s called a data leakage. And machine learning where you don’t want to have information about the future, will leak into your historical data. And so a really hard challenge for data scientists during this process also is how do I construct my training datasets when I have all this exported data in such a way where yes, I’m, I’m trans I’m transforming this data into feature values with different types of data transformations, but how do I do it in a way that tells me what that data, what that feature value was at that point in time and not just what that feature value is now, because the model needs to be able to train on, on what the model knew would have known at that time when we was making a prediction in the past.

Mike Del Balso 00:28:43 So there’s a lot of steps there, right? Now, let’s talk about if, for some of our data scientists who are using feature stores for their model training phase. So, the first thing to know is if this is your, you know, your second model in the company, that means a lot of the features that are already available that have already been built in your organization are already available in a standardized way for training purposes. So, if I’m building a second fraud model, I can go and say, Hey, give me all of the user features. This could be hundreds or thousands of features. Give me all of the, all of the transaction features. And so maybe I’m already hot starting my data science project with a thousand really high-quality features, which is, which is really nice. But I also tend to, for every new use case, I have to build a bunch of new features as well.

Mike Del Balso 00:29:31 And so I, I will be defining some, pretty straightforward feature definitions in a, which is basically like, like writing some data transformations into Tecton framework, a feature store framework that, that essentially defines a data pipeline. So, I’m going to define a data pipeline. And then when I’m trying to extract, so I’m trying to generate that training dataset, right? So, I register the data pipeline into the feature store, and now I want to generate a training dataset, what Tecton and feature storage generally you should do for the data scientists is that they handle all of that historical back-filling of feature values. They handle joining all of the features, the feature values together as of specific points in time. So, they really like protect against a lot of these really nasty time issues with data, that data scientists always grapple with. And sometimes don’t realize really easy to make some errors here and generate that training dataset for the data scientists.

Mike Del Balso 00:30:29 So the key points is that it allows data scientists to reuse a lot of their work and hot start new efforts. And then also like simplify a lot of the process of bringing this data together. But the, some of the larger benefits are really felt at times when now we’re productionizing things, because in the status quo today, you know, you go to the data, scientist goes to their engineer and says, all right, cool. I have found my model, let’s productionize this, and what are the, what does the engineers say? Okay. So, like, what did you build and what should I, what should I build? And you say, okay, so I have this CSV that I, that I kind of downloaded from a couple of databases, and I wrote this Python script to generate my model. So, figure it out, you know, and the data, the data engineer then has to understand, okay, how was that CSV created?

Mike Del Balso 00:31:18 Because that data engineer has to not just read from that CSV, but read from those original data sources so they can provide up-to-date data. They can always use the freshest data to have updated predictions, right? So, they have to rebuild these data pipelines and connect to the original data sources and reimplement all of the data transformations that you had defined in your, in your Python notebook. This is, this is a painful process, and it obviously sounds painful, but it’s like, it’s, the problem is pretty tricky for larger organizations, where if we talk to some of our kind of larger enterprise clients, they were having challenges where these challenges plus governance and compliance workflows that they deal with in their large organizations means this stage can take 6 or 12 months. And what they really like about the concept of a feature store is all of that is just automated for them. Actually, this problem was solved during the development time when I defined my feature and registered my feature into the feature store back when I was training a model, that production, that set up a production data pipeline automatically. So now when it’s time to productionize the model, all the data pipeline productionization is already done, and the data engineer is not needed in that loop. So that’s a real, that’s like the real area for simplification here.

Akshay Manchale 00:32:39 Yeah, that sounds great. Let’s take Tecton itself as an example. And, can you explain how it connects to where you is data sources and then prepares this feature that is scientist actually cares about?

Mike Del Balso 00:32:50 So Tecton has a couple of components to it. There’s sort of high level Tecton has a few different elements. Has transformations where it basically, you can think of that as like the data pipelines that are runs. Has a storage layer, and it has an online survey and, or it has a serving layer. And so, each of these elements is configured by a feature definition. So, a data scientist, the whole goal is for a data scientist to be able to self-serve autonomously define this is the signal that my model needs, and they should describe that signal. They should be able to describe that feature and Tecton can translate that into a configuration for the data pipelines, for the storage and the data management layer, and then for the online serving. So that configures both the development pipeline and the production pipeline. And so, if you in Tecton define, say a signal that is, let’s just choose one of the examples we just talked about, like, how many orders did this restaurant get and did a restaurant get in the past 30 minutes, right?

Mike Del Balso 00:33:53 That’s a very common type of some type of feature pipeline that someone would want in their feature store. What Tecton does is it takes that definition that you define. So, you define a SQL query, or you may define, you may configure something in our library through our DSL to specify like a special type of aggregation, but basically like the definition of that transformation. And then you also write some of the metadata, like some specific configuration code that would say, Hey, I need this available online. This is how I want this signal backfield. Like how far back in time might, I want to train a model using this signal and a lot of metadata, like who’s the owner for this, feature I’ll trust it as this feature, is this an experimental feature or is this a production feature? And register that in Tecton. What Tecton does is it takes that, that configuration and we’ll do a couple of things.

Mike Del Balso 00:34:46 One it’ll turn that, it’ll orchestrate jobs, compute jobs based on that. So, it will, for example, in this, in this streaming aggregation feature type, it will talk to say a SPARK, your SPARK cluster, and create a streaming SPARK job that will connect to a kafka pipeline with your events. And we’ll maintain an aggregation, maintain these aggregations. It’ll run these aggregations and keep up to date, you know, past number of orders in the past 30 minutes. Those values will also be fed into Tecton storage layer, which is kind of like what people think of as the feature store sometimes. And the storage layer is twofold. There’s an offline storage layer, which is used to store all of the historical feature values for a feature. So, every time there’s a new version of an updated feature value, it could store it in the offline store. And that’s, that’s used for model training and model retraining later on.

Mike Del Balso 00:35:45 But that data also goes to the online store, which tends to be we plug we’re pluggable. So, we plug into different types of there’s different technologies that can implement our online store, but it tends to be a key value store that’s optimized for fast retrieval. And that will serve, that will only hold the latest values of these signals. And then we have a serving layer, which allows the model that production model to query Tecton through a rest API and say, Hey, Iím fraud model V2. This user is trying to log in right now. I need all of the signals that the feature store has that it’s prepared and it maintains fresh for fraud model V2. What are all the features for fraud model V2? And I need them within 50 milliseconds and Tecton will deliver that to the prediction system, to the model system, which may live outside of Tecton.

Mike Del Balso 00:36:35 It can be any kind of model prediction system you have. And then that kind of enables these real-time predictions. So, the kind of value proposition that we deliver to organizations that are trying to build these, these like complicated real-time machine learning applications is Tecton is data platform that simplifies all of the data pipelines needed to be built and deployed for a real-time ML. And it really kind of simplifies the development and productionization workflows there. So, a data scientist can not only just like build a prototype, but they can run a model in production on their own as well.

Akshay Manchale 00:37:10 So let’s say you have multiple data scientists right now who are publishing these features in Tecton. How do they go around discovering that someone else has actually prepared this feature for me, or I can actually work off of someone else’s pipeline and feature that they want built for my model, because I think this is useful?

Mike Del Balso 00:37:28 Yeah. That’s a really good question. And it’s a pretty important capability of the feature store to eliminate duplication. So, there’s, there’s a couple of reasons why you might want to do this. There’s obviously like the cost component where you just don’t want two signals that, you don’t want the same compute. These pipelines can be pretty expensive, right? They’re heavy weight, data processing jobs. Often, you just don’t want these same things to be running again and again for no reason. But secondly, it can be confusing to people. If two signals are implemented, like I said, or rather one signal is implemented twice, but slightly differently and they have different implementations. And then that can lead to people, not recognizing that they’re actually implemented differently and not being careful about which one they’re using. And you may train a model in one region, one signal, but then if you’re trying to make a prediction with like another implementation of that same signal, if it’s slightly off, your model could misbehave.

Mike Del Balso 00:38:25 And that may be hard to predict, but that will lead to model performance issues. So, I’m bringing that up because there’s a lot of reasons why transparency is really important. Discoverability is really important for these signals from machine learning. And so, what the feature store does is provides a catalog of these production ready signals. So, this is largely a knowledge management issue, but it’s like, I’m a data scientist. How do I find which features are already productionized that I can use today? Literally, a search interface in a web UI can see a catalog. I can browse search by user, search by feature type, search by entity. So, I can say, what are all the user features? What are all the item features? What are all the, you know, an Uber we had, what are the driver features? And the rider features, stuff like that. And I can also see all of a variety of pieces of metadata.

Mike Del Balso 00:39:10 So, Hey, this is an experimental feature. You may not want to build your model based on this. Or this is a solid production feature that 10 other models depend on. So, you’re going to be fine if you depend your model on that. That kind of like visibility and discoverability is critical for allowing machine learning to scale in a company. Because once you have that, then you can kind of enable the reuse and sharing that we were talking about before. So now, and a half of my second machine learning model, I don’t have to start from zero. I can go to the feature store and say, Hey, what are the list of features that are kind of relevant for me that I might want to try out instead of me rebuilding a thousand signals, let me just reuse the top 1000 signals that my company has already put together. And maybe that’s all I need to do. And maybe I can finish this machine learning project in one day instead of a couple of months.

Akshay Manchale 00:39:59 So this is where the value of having your machine learning idea, going from a year to actually get data engineers, to put the pipelines can go into, you know, to a three day into production. And that’s really nice. One thing you mentioned was the storage aspect of it. And is that, can you elaborate on what the storage looks like? You know, you said there’s batch, there’s real time. So, is it going back to the source system? So, it’s a separate database or something. Can you talk about that?

Mike Del Balso 00:40:27 Yeah. Good point. So, the way this is implemented is so we need, there’s two types of storage in a feature store. There’s an offline storage layer, which holds the historical values and the online storage layer, which is, holds the latest values, but it’s optimized for fast retrieval. And so, we talked on, so Tecton is a cloud native solution that reuses a lot of that cloud managed services, but it’s built in a very pluggable way. So Tecton, doesn’t have a proprietary, you know, offline storage layer or a proprietary online key value store or anything like that. What we do is we plug into your infrastructure. So, if you use snowflake, if you use a data warehouse, we will store, we will use that data warehouse as the offline storage layer. So, will all of the feature tables, all the feature data will just look like normal tables and your snowflake or your Redshift, for example.

Mike Del Balso 00:41:16 And if you have a Redis cluster that you want to use for that, you already are using your applications already connecting to for online serving, we’ll use that as the online storage layer. Or, we come out of the box with, we will configure like a Dynamo and use Dynamo and AWS to serve this stuff. So, you can think of the feature store as like a coordinating layer that brings together a lot of pieces of infrastructure that is already available on the cloud or in your, in your account to allow you to allow your data scientists, to configure these data flows that become coherent production, data pipelines for machine learning applications and coordinate all these different pieces of infrastructure to enable that. Does that makes sense?

Akshay Manchale 00:42:00 Yeah, that makes a lot of sense. Yeah. I think that is where a lot of the complexity would be with a data engineer trying to deliver this model to production would be, can I actually have a real time key value store or something that gives me online data? How do I configure that? How do I maintain that? What are the needs? So, I can, I can see the simplification with say, having a feature store, being able to plug into your infrastructure and then serve that. And that’s great. So, let’s talk about quality aspects of it, right? So, you train your model and then, you know, you want to generate V2 of your model. How do you monitor how your V1 is doing and how do you transition that into say, I’m going to actually do V2 because, you know, V1 is deteriorating in terms of what it’s serving.

Mike Del Balso 00:42:41 Yeah, I think so there’s a couple of questions there. The important thing that everyone who’s running, machine learning operationally like running machine learning in a way that it has real business impact. I think that’s very important to do is measure the performance of your model. And that is, that can be harder or easier depending on the use case. So, in a situation, here’s an example of how it’s really easy. I’m going to predict if you’re going to, I’m going to recommendation system, I’m going to predict if you’re going to click on this recommendation and immediately after, you know, in five seconds, or now I’m going to know if you clicked it or you didn’t click it. It’s very easy for me to, and then say calculate my accuracy. Right? And, but if I’m predicting fraud, for example, you know, maybe I only find out that this was a purchase made with a stolen credit card.

Mike Del Balso 00:43:28 90 days later, when a credit card company comes back to me and does a charge back or something like that. So that’s a whole other data engineering challenge to like, how do we associate that data? We collect 90 days from now back with this prediction log that we may, you know, from, from the past. There’s not just my point is that it’s not just like one way to measure accuracy in a super simple way, but once you can bring that data together, then there’s a couple of things that you want to do. You want to have a good sense of what the steady state performance is and track the performance of that model, but you also want to have a good sense of like how I could improve this model. And so, you asked like, Hey, how do I know when to retrain it?

Mike Del Balso 00:44:06 There’s, there’s a couple of, you know, there there’s two kinds of questions there. It’s like, how do I know when my system stops operating correctly? And how do I know when I have an opportunity to an opportunity to improve this system? And I think practically right now, the opportunity to improve this system is driven by intuition by people. So, it’s, Hey, we have, Hey, we just realize we have this data source in the company that we’re not using in this model, but it’s kind of like, it kind of makes sense that it could help us make better recommendations. So maybe we want to test it out and let’s run a, let’s build a prototype where we joined in that kind of data and train a model on that kind of data. And, you know, sometimes, and a lot of use cases, you can only really test these things out when you put them in production. So, that’s kind of like, how do we know how to improve the system when we’re thinking about how to, how to detect issues, that’s where you have to monitor that accuracy signal over time. And if that accuracy signal decreases in some way, that might mean you need to retrain your model, or that might mean there’s a challenge, there’s a breakage somewhere in one of your data pipelines or a data quality issue.

Akshay Manchale 00:45:12 So when you say you want to retrain your model, you also mentioned sometime in the past that you store historical data and offline storage. Is that offline storage used predominantly for retraining, or is it also just because you can have batch jobs on it to serve your model?

Mike Del Balso 00:45:28 The offline storage is used mainly for model training. So, it can be for training new models or for retraining a model, right? You know, the feature store doesn’t care, it’s just generating a training data set, but that’s why we have this historical, these historical values in the offline store, because it allows the feature store to easily create a training data set that says for each time someone logged in, in the past, for example, if that’s what we’re predicting, that’s the granularity of our predictions. What was the features value at that point in time in the past? And that allows you to generate easily generate a training data set that doesn’t have any label leakage. It doesn’t have any, any, bad feature leakage. And so, you can easily create high quality training data sets without a lot of the data science complexity that usually comes along with that.

Akshay Manchale 00:46:15 Yeah, that’s great. So, the feature store concept itself is fairly new, right? So, there’s a couple of players in the field. Tecton is one and I think, Amazon has like SageMaker, Databricks has one. What is like the common, components of a feature story that you see, that everyone agrees on? I’m sure everyone has their own value proposition, but what makes a feature store, a feature store that if you want agrees on?

Mike Del Balso 00:46:38 The core thing that everyone agrees on is that there’s an online and offline serving layer. And so, you know, this is a new category, it’s a new, actually most things in the machine learn in like the ML ops world and the MLR stack, it’s very fresh. And so, a lot of different teams are figuring out exactly what is the scope? What are the boundaries of a specific product we’ve developed the feature? We kind of developed our feature store in Uber and we coined the term feature store and we’ve been working on feature source for a long time. So, we’re pretty confident about the, about like Tectonís kind of geometry. But if you look across every, every product that uses the term feature store, what they have in common is they have a way to retrieve features quickly and at low latency for an online prediction. And they have ways to get historical data from an offline storage layer. And that’s kind of the crux of the feature store is managing that kind of data duplication across these different types of data infrastructure.

Akshay Manchale 00:47:39 So a feature store kind of like connects where your systems across your company, right? So, your model might be really critical for your business. You might have an isolate that says, you know, we never want this model to be down. But it also has dependencies on upstream systems. Like maybe there’s a database that does not have the same guarantees that it’s going to be up all the time. Maybe it’s going to be down every Sunday for a batch or an upgrade or something. So how do you tie in the availability of your models with upstream systems that actually have the data somewhere?

Mike Del Balso 00:48:10 Yeah, so that’s part of the features storeís goal is to hide, to provide an abstraction for the data scientist. To hide back complexity of the upstream data infrastructure. So, when a feature is defined in a feature store, that’s when you address some of that complexity and you say, you know, how often should this feature be recomputed and where’s this feature coming from, and some of the other details about how you can access this feature depending on the context, but then after that feature is registered in the feature store, it should look like to the person who’s using the feature store, just like any other feature. So, it doesn’t matter if this is a real time computer feature or streaming feature or a batch feature. They’re all just, they’re all just productionized features available within this common machine learning focused catalog that I can use to build an offline dataset, or I can depend a production model on. And so that kind of standardization, or that abstraction enables that kind of standardized consumption of these data. And that’s, that hides a lot of the complexity and makes and speeds up the data scientists in a pretty significant way.

Akshay Manchale 00:49:28 Let’s talk a little bit about the compliance aspects and data governance aspects of it. A lot of companies, especially large ones, they have huge implications. If you don’t actually follow some of the data governance standards, maybe a user comes back and says, Hey, are under GDPR or California privacy laws. You know, I want this data to be deleted about me. How does that flow into to this entire pipeline of systems that you have, where it may impact the predictions that you can do? It may impact, you know, my experience, how does how’s, how does this tie in together?

Mike Del Balso 00:49:56 Yeah, there’s two approaches, right? One approach is to build specific capabilities for every single type of regulation. And you know, what that means practically is build specific capabilities for like very common, prominent pieces of regulation. And the second is to make knowledge management much easier here. And the second approach is the thing that is the easier one to take on. And that’s where we are. And basically frankly, where most of the feature store industry is right now. And it’s about tracking metadata. So, it’s like we give you the capability to tag different data resources with, Hey, this contains PII. This is sensitive data. This is how this data should be treated. This is the policy that is associated with this upstream data set and the feature store and ensures that downstream data sets are similarly tagged according to those policies. And it leaves it up to the, the consumer or the data team to make sure that those data are being handled appropriately, right?

Mike Del Balso 00:50:57 It’s not, it’s not like, the actions are not being taken within the feature store, but at least the knowledge is being conveyed accurately. But, but there’s more that a feature store can do, like for example, providing a GDPR deletion API. So, there’s a variety of GDPR, deletion requests, can the data scientists build a way to automatically, this is part of like the maintenance of, a data pipeline for ML automatically handle the GDPR requests for their ML pipelines as well. What does that mean? Okay. User X needs to be deleted. Well, we need to wipe them from all historical training data. We need to wipe them their features from the online feature store. We’ve just got to basically purge them from the system. And that introduces a lot of yet unsolved machine learning challenges. Because if I train my model on this data set, and now we’ve deleted an item from there, can my model be retrained now? Right. And what is the implication on my model? If it cannot be exactly retrained because we’re not allowed to have that data anymore. How do I go to retrain my model? How do I generate that training data set? A lot of these questions are handled on an ad hoc basis right now. And we’re working with our customers to build in the right kind of patterns out of the box to solve this stuff for them.

Akshay Manchale 00:52:12 That’s great to hear. I wanted to start wrapping up. Can you talk about, you know, what’s in the future for feature stores, this is such a new thing right now. So, I’m sure there’s going to be a lot of change. What do you think is going to be the future of feature stores?

Mike Del Balso 00:52:28 Yeah, I think the future for future stores, the immediate future is to continue growing. Frankly, we have a lot of people are just learning about feature stores today, but feature stores are, I’ve been in use for years at some of the top ML companies in the world, you know, all of the top tech companies that you can imagine. And so what’s next for, for us and for feature stores generally is to just education, just making sure people understand what feature stores are and making that as simple as possible for people to adopt some really cool things that our customers have asked us for that are likely coming next. Some, interesting things are, for example, one, how do I automatically create great features from my data? Can we like auto featurize upstream data? Secondly, as can we, can I have features in my models that do not come from my data, but come from someone else’s data?

Mike Del Balso 00:53:21 So maybe there can be some third-party features that provide some context like, you know, some really important things for the Uber Eats models that we were talking about before is like, what’s the weather and what is the traffic outside? Maybe that would be really great for me to use those features and my model out of the box. Right? And that’s something that kind of like a notion of a feature marketplace could be really useful in a feature store. What’s next for us is just keep on, keep on working with our different customers, still a lot to build there’s a lot to do, to support additional use cases. There’s a huge variety of requirements in the across ML use cases. But then there’s some of these really interesting kind of next generation things that can expand the value propositions of feature stores. I just mentioned a few of them, but some other announcements will be coming out soon.

Akshay Manchale 00:54:07 Well, it’s been great talking to you, Mike, thank you for coming on the show.

Mike Del Balso 00:54:09 Actually, thank you so much.

Akshay Manchale 00:54:12 This is Manchale for Software Engineering Radio. Thank you for listening.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

SE Radio 473: Mike Del Balso on Feature Stores

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

Menu

Recent posts

Search

Search

SE Radio 473: Mike Del Balso on Feature Stores

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 724: Jure Leskovec on Relational Graph and Foundational Models

SE Radio 723: Dave Airlie on Linux Kernel Maintenance

SE Radio 722: Dwayne McDaniel on the Engineering Challenges of Secrets Management

Menu

Recent posts