Episode 433: Jay Kreps on ksqlDB

Filed in Episodes by on November 6, 2020 1 Comment

Jay Kreps, CEO and Co-Founder of Confluent discusses ksqlDB which is a SQL engine for data in Apache Kafka. Jay talks about stream processing, Kafka and how the data can now be queried with push/pull queries with ksqlDB, similar to a relational database. Jay discusses some of the similarities and differences between SQL databases and ksqlDB and outlines how ksqlDB simplifies building real time applications over streaming data with a simple query model. Jay talks about its capabilities with examples of how it can be used to build real time applications, its internals and how it fits into the Kafka ecosystem.

Related Links

SE Radio theme music: “Broken Reality” by Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0

 View Transcript

Transcript brought to you by IEEE Software

Akshay Manchale 00:00:55 For software engineering radio. This is Akshay Manchale. Today. We have cheek reps and we’re going to talk about ksqlDB. Jay Kreps is the CEO and co-founder of confluent. Prior to confluent, he was the lead architect for data and infrastructure at LinkedIn. He’s the initial developer of several open source projects, including Apache Kafka. Jay was previously a guest on se radio on episode one 62 and more recently on episode three 93, Jay, welcome back to the show. Thanks so much for having me. So before we get into ksqlDB, let’s talk about a few basics. Let’s start with Kafka. What is Kafka?

Jay Kreps 00:01:28 Yeah, that’s, that’s a good question. So we, we, we call it, uh, an event streaming platform, but that kind of begs the question of what the heck is a event streaming. And, and so, you know, it really is kind of a new category of thing. That’s come around. Some people compare it to being a bit like a kind of message, cute that, that you send messages to, and it kind of cues them up and sends them out. But it’s a bit, it’s a bit more than that. So the idea is, you know, Kafka acts as a kind of distributed cluster and you can read and write streams of events and by events, I mean, you say something happened and it gets recorded as a kind of linear stream, almost like a array of these events that just keeps growing and growing. And the rights all go to the end of the array and readers can, can read along in that stream.

Jay Kreps 00:02:15 And so, you know, those, those events could be anything. They could be things happening in the business. It could be, you know, sales that are occurring, uh, you know, messages of all kinds, you know, so this is a very new thing. Uh, Kafka itself has become quite popular now. So it’s, you know, it’s out in the world, you know, really at large scale, we’re, we’re doing our Kafka summit, uh, this year. And, you know, it looks like we’ll have over 25,000 people who are signed up and get to attend that, uh, virtually. So, so, you know, it’s really become a popular, um, layer for building these kind of asynchronous event driven microservices that communicate through these events, it’s become a popular for building, you know, real-time low latency pipelines for data that, that, you know, stream things into, uh, different data systems or SAS layers or API APIs.

Jay Kreps 00:03:07 And it’s become, you know, kind of part of the modern stack in, in a lot of ways. And so, so that, that’s kind of the basic concept. You know, you can interact with Kafka through a couple of different APIs, there’s these lower level APIs where you would produce or write data into one of these streams. There’s a consumer API where you would, you know, read or subscribe to data that’s being written. And then, then there’s API APIs, which allow you to use prebuilt connectors that, uh, hookup and maybe capture changes occurring in a database, or publish changes out into some data store or SAS application. And there’s hundreds of these pre-built connectors for different types of system and applications. So you can kind of hook them up and turn their data into streams or take streams and Kafka and, you know, send it off to other other systems.

Jay Kreps 00:03:57 And then there’s a stream processing API APIs in cockpit, which allow you to take these streams of data and react or respond or process them as they occur. So you could imagine, you know, maybe you have a stream of the sales that are occurring in some e-commerce application, and you can imagine, you know, uh, triggering order fulfillment center, counting how many sales you had by geographical region as, as the sales occurred. And so those are the different ways of interacting with these streams of data. And so that’s, you know, that’s kind of the, the 32nd in a nutshell explanation of, of what kinds.

Akshay Manchale 00:04:32 Great. So I keep hearing things like Kafka topics. So what’s a Kafka topic. Yeah.

Jay Kreps 00:04:37 Yeah. So, you know, topic is kind of a terminology we borrowed from the messaging world. It’s really just the name for one of these data streams. So, you know, maybe I would have in this e-commerce application, I might have a sales topic and there might be many ways that a sale occurs, right. Maybe it occurs on an iPhone app or it occurs, you know, through some backend system or it occurs, I don’t know. And you can imagine each of those sales being published out to this topic, which means really just appended to that kind of infinitely growing array that I described, and that might trigger all kinds of activity on the backend. So there might be a system that does order fulfillment and every time a new sale is published, it would react and whatever, you know, is needed to send the information off to a warehouse and get that thing into a box and get it set off that it would be responsible for that. Maybe there’s something that updates the information, uh, for customers, maybe there’s something that updates, uh, you know, the kind of coupons and whatever, and, you know, sets up the returns, whatever, whatever it is that you would do, maybe there’s analytical systems that provide reporting that would need to update. They would all subscribe to that feed of sales and react to it. And so the name for that feed or stream is a topic in Kafka.

Akshay Manchale 00:05:54 Interesting. So what’s different Kafka compared to a message broker that moves data from a publisher to a subscriber.

Jay Kreps 00:06:02 Yeah, that’s, that’s a great question. So, you know, w when we were building in this area that there were a whole set of different kind of cues and messaging, layers, you know, the, the kind of foundational technology differences in Kafka that make it a little different. Uh, I’ll describe, and then I’ll describe kind of like from the user’s point of view, what do you get that it’s different? So as a technology layer, what’s really different is, you know, Kafka is storing persistent, replicated, uh, events or messages, right. And it can store them really for any amount of time and you can hold lots of them and you, that means it, everything is kind of multi subscriber. Like any of these topics, it can have zero subscribers or consumers. It could have dozens, hundreds of subscribers or consumers. And so, you know, this, this idea of really orienting around events being, uh, always multi subscriber, being persistent, and then building around kind of a modern distributed system that allows this to scale out horizontally and do so, you know, more or less tickly and dynamically, all of that, that’s fairly different from a traditional message queue, which is more oriented around, you know, individual servers that would hold messages, where they would put them on disc only, uh, if needed.

Jay Kreps 00:07:19 And, you know, beyond that, you know, Kafka doesn’t really provide a queue abstraction when you read a message, it doesn’t, you know, pop it off the queue instead it has this kind of stream. And, uh, it turns out that this is, you know, those qualities end up really important. If you’re trying to build a reliable data pipeline that guarantees the delivery, it turns out is really important. If you’re trying to hook up multiple applications or systems within a company and guarantee that they all get it, it’s operationally very important, just because you don’t want something where if the memory fills up and it goes to disk, it suddenly gets 10 times slower and tips over if for all those reasons, I think, you know, Kafka has ended up being a much better platform for a large scale usage within a company. And then it also turns out, and this was part of our motivation that this abstraction is a much better foundation for stream processing for working with this data in real time.

Jay Kreps 00:08:13 So if you think about it, the message queues, it’s a pretty low level abstraction. You can kind of put a message in the queue. You can take it out, but everything beyond that is kind of up to you to go figure out in your code. And a lot of the things we do with data, um, there actually are common patterns and that’s why databases, you know, they don’t just have put and get, they have, uh, also a bunch of functionality that allows you to compute things off of the data. And that’s, you know, I’m sure we’ll get into that when we talk about ksqlDB and stream processing, you know, it’s very, very hard to build that on top of a traditional queue. And in Kafka was really designed with that in mind. Actually, we started with the idea of, Hey, it would make sense to do a lot more real-time processing on data. How would we do that? And then we realized, well, the first thing you need, if you want, you know, stream processing is you need streams and you need them at scale across the company. And that, that was actually what motivated a lot of the early Kafka

Akshay Manchale 00:09:05 With that information. Let’s jump into ksqlDB, which is our main topic. So what is ksqlDB?

Jay Kreps 00:09:12 Yeah, that’s, that’s a great question. So I started to touch on stream processing and, you know, the, this, you know, it’s probably worth just describing what I mean by that. And you know, what the difference is from batch processing or more traditional databases. So, you know, I’ll explain it by analogy, you know, the way databases typically work is they have kind of a pile of data. Uh, whenever you run a query, they actually need the data to be static and unchanging. And the query is, is just kind of at a point in time. And so database would usually lock that table or have some kind of concurrency mechanism to get a view on that, that doesn’t change. It would scan over it and compute the results of your query and it would give it back. And so, you know, the assumption is that query is just a very transient thing, and it gives you an answer at a point in time.

Jay Kreps 00:09:59 Of course, that that answer is potentially immediately out of date. It doesn’t get updated, but, but that actually makes sense. If you’re, if you’re building a UI driven application, you kind of want to know the answer. As of right now, you’re not really, you don’t really care what the answer is in the future in a lot of cases, although, uh, even UI is changing a lot as we have these more dynamic UI is that, you know, get push updates sent to them. And so, but that that’s the traditional model for databases. And so the analogy, I think that makes it easiest to understand stream processing is, you know, in the U S they have a census that they do every 10 years and the way the census works is they send these people out with, uh, binders and they, they go door to door and they record everybody’s name and the address.

Jay Kreps 00:10:43 And that’s their way of counting how many people there are. And you can think of that as a kind of batch process, right? It’s scanning all the data and it’s saying, okay, here’s a person, here’s another person, here’s another person. And then when you want to get an update on that, you have to kind of go back and you gotta scan all the data again. And, um, so if you, you know, if you explain that to a software engineer, they’re like, Hey, that’s kind of funny. You know, I don’t know if I would build it that way. You know, probably what I would do is I would record all the births and the deaths. And if you have that stream of births and deaths, you could keep kind of a running count of how many people there were. Right. And, um, that’s exactly what stream processing is.

Jay Kreps 00:11:21 So the births and deaths, those are vets, right? Those are big events in people’s lives. You could imagine having attached to that, all the information about the person who was born or who died, right? So everything you would want to know in a census there, uh, gender, ethnicity, whatever. And then you could imagine keeping a kind of running count on top of that. That would say, okay, you know, now in mountain view, California, we have 75,692 people, you know, a baby is born. Uh, now it’s, uh, 75,693, right? So, so that’s, that’s the fundamental idea behind stream processing. And you can think of it another way, which is just if you have this Kafka cluster, and it has all these real-time streams of data, you want to do things with that. You want to harness those calculate things, take one stream of data, translated into something else.

Jay Kreps 00:12:05 And stream processing is kind of how you do that. And so, you know, Kafka has a set of lower-level APIs that help to do this. They help you to do real time, uh, aggregation, kind of continuous running couch on data. They help you to do joints. So in our e-commerce example, maybe you have this stream of sales, but you also have information about the customer and you need to put those together for fulfillment person purposes. Okay. This sold, but where do I ship it? Well, I need the address. I got to join on the information about the customer. Maybe you’re just doing simple filtering. Maybe you’re looking for those sales that are above a certain amount, all the kinds of things you might do in a database. So there’s, there’s kind of low level capabilities for doing that in Kafka. And so what, what isksqlDB?

Jay Kreps 00:12:48 Well, that’s really our attempt to bring this out in a more database like format. So in a database, uh, you can run a SQL queries and, you know, a lot of us know SQL it’s, it’s, uh, it’s not a perfect language, but it’s an easy language. It gives a lot of the expressive, uh, activity of what people want to do with data. And, and, and you wouldn’t, you wouldn’t necessarily know if it was applicable to this kind of stream processing, but it turns out it, as it turns out, it’s actually a great way of doing real-time processing. And what we’re trying to do in ksqlDB is really in some sense, bring together what we think are the two sides of databases. One of these sides is really well-developed, which is if you have a pile of data, can you do lookups and quarries on top of it?

Jay Kreps 00:13:33 And in fact, that that’s kind of the only thing we know in databases, but the other side is if you have a flow of data, can you calculate and update things off of that? And that’s always been a little underdeveloped in databases. They have some features like triggers or materialized views that are, you know, kind of very simplistic, but it’s not actually, you know, expressed, uh, with the full power of SQL on those streams. And so an example of this would be in a data warehouse, you know, once you get all your data in the data warehouse, the data warehouse gives you all this wonderful query functionality to do aggregations and lookups and things on top of it, how you get the data in and how you get it into the right format is a little bit of a dark art, right? Like there’s usually some hacked systems that shove things in.

Jay Kreps 00:14:17 Then you kind of load, you know, run some big batch processes on top of it. But you can imagine a different world, which is, you know, the data and the data warehouse is always up to date. There’s quarries that run on that data as it flows in. And there’s quarries that do look ups to run reports and so on. And that kind of illustrates, I think the two sides of this quarry dynamic, the, you know, changing data as it flows and, and looking up data as it’s, as it sits. And in ksqlDB, those are represented with two concepts, um, streams and tables, and you can query both of them and, you know, stream is this kind of continual set of changes. As you know, Kafka topic and tables are aggregates, we’ve calculated, right. Or, you know, lookups, it’s something that’s keyed, right? So you might have a table, that’s all your customers by some primary ID, you know, ID, and you might have a stream of all the sales that occurred in a data warehouse, they would call these facts and dementia.

Jay Kreps 00:15:13 So they have this idea of facts, which is basically kind of events, although you can’t really subscribe to them. And they have dimensions, which is the kind of store data about customers and products and so on. So those are the concepts. And then, you know, we, we support this by having the idea of push quarries and an idea of pull quarries. So the kind of traditional query where you go do a quick lookup, that’s a Polk where you’re kind of pulling the data out. And the other kind of queries are push query that runs as the data kind of pushes in, uh, to the system. And by putting those together, it’s actually really powerful. You, you can imagine use cases where you’re kind of building some cash, maybe you’re taking feeds of data out of different data systems and putting together some, uh, aggregate view of your customer or some premature realized, uh, view of data that you need to, to serve some use case. And that that view gets updated every time the underlying data changes. And then you can do lookups against it, uh, to get a particular row or value. Does that make sense?

Akshay Manchale 00:16:07 Yeah. So when you see a pool query, when I do the same thing on a traditional database, is that the same as a pool query? When I say select star from table? Is that what a pool query is and ksqlDB?

Jay Kreps 00:16:18 Yeah, that’s right. That’s right. Yeah, that’s right. And so, so the difference now, um, with this kind of streaming query is that you can now have subscriptions that continue to update as the underlying data changes. So, so, uh, you know, traditional query is kind of a point in time thing, like, Hey, what’s the value right now, but you could imagine for systems that are actually subscribing, you need not just the point in time, but also the ability to get all the changes. And so you might say, Hey, give me the count of the number of people in every city, but instead of just getting it right now and having it be out of date, um, you would get it and have it update as people moved between cities or as, as people were born or, or died. And that’s, that’s basically the idea of extending databases for stream processing, for events and streams.

Akshay Manchale 00:17:04 So in that example, let’s say I have a dashboard that has displaying the running count of people. It’s just a single query that I’ve ate. And the changes are just automatically coming into the client with a push query. Is that the sort of the idea there

Jay Kreps 00:17:17 That’s the basic idea is that, is that you can both get a point in time, look up as well as continually continually subscribe and calculate changes.

Akshay Manchale 00:17:26 So what’s a table in ksqlDB. Is that traditionally the same as what a relational database looks like? Um, does it have a schema and types and all of that?

Jay Kreps 00:17:35 Yeah, that’s right. So, so the, the type system and the SQL is, is very similar to what you would expect to in a traditional database. It’s just that now, when you, in addition to saying, create table, you know, with the appropriate attributes, you can also say create stream, and you can think of a stream as being like a kind of table of immutable data that, that has a strong ordering that’s, that’s only upended to over time, you know, effectively, it’s like a fact table in a data warehouse and, you know, uh, a table in ksqlDB is very much like a table and Postgres, you know, it’s something that typically has some kind of primary key and you would insert into it or, or, or change it, or look up permit or modify it in a continuous way. And the key thing that KC equal DB allows is for you to go back and forth from streams to tables.

Jay Kreps 00:18:21 So in that example, I gave of, of e-commerce. You can imagine having this stream of all the sales that are occurring, you could compute off of that. You know, you could take that and you could take maybe also the stream of shipments of products that are happening and using those to you could compute what’s my inventory on hand. And so the stream of shipments that’s like an event, right? Uh, this, this was shipped, this arrived, a stream of sales was like also events, this sold, but inventory is actually a table it’s like for each product. How many of these do we have on hand in each location? And it’s going up and down as the shipments are occurring. And so, you know, the, the cool thing about this is it allows you to go back and forth between these, this idea of streams and this idea of tables. And I think it’s actually a very powerful generalization of data databases and how we work with data that really has a lot of applications. You know, we see the use cases being this kind of streaming data pipelines, like real time pipelines, where you’re capturing data from different systems and transforming it as well as like materializing, these caches new use cases around security, where you’re looking for the bad patterns, but, but it’s actually quite applicable across a lot of different use cases and systems.

Akshay Manchale 00:19:33 So what is the source for a ksqlDB data? Is it a consumer from a Kafka topic?

Jay Kreps 00:19:38 ksqlDB always works on top of Kafka, but it actually allows you to control the Kafka connectors as well. So I mentioned Kafka has this connect API and the connect API is a way of building these simple plugins that either capture a stream of data from a system. So there’s, um, plugins for, uh, my SQL or, or post grass or Oracle, um, there’s plugins for different messaging layers. There’s plugins for a lot of cloud data systems, there’s plugins for, um, different SAS APIs. So it can either capture a stream or it can take a stream that’s in Kafka and publish it out to some system. So you can imagine, you know, maybe you have a, my SQL database and you’re capturing all the changes coming in from that database. And you’re publishing it out to Kafka, or you could imagine taking a stream from Kafka and loading it up into the, my SQL database.

Jay Kreps 00:20:28 So, so there’s connectors, usually in both directions, we call them sources and sinks. And so, um, you know, to, to make these kinds of pipelines, easy ksqlDB to B also allows you to control those connectors. So you can actually have a pretty end to end flow for streaming data where you say, okay, you know, connect to this database, capture all the changes, connect to this database, to capture all the changes, take those streams, combine them in this way, you know, materialize this new table of data that is being computed off that, and now do lookups against that. And so it is actually kind of an end to end solution in this event, streaming space. Whereas, you know, really before this, I, I feel like people had to piece together a bunch of different things to get something, you know, there was some stream processing which would do the transformation and there was, you know, Kafka, there was integrations with different systems and then there’s some database you would do the lookups against. You had to kind of wire all that together and figure out how to secure it and so on. And so this, this makes it a little bit simpler and that you really just have Kafka in ksqlDB to work with. And they’re kind of built to work together and have a same, you know, a similar set of abstractions and security features and so on.

Akshay Manchale 00:21:38 So when you are ingesting data into ksqlDB, can you, uh, have some sort of transformation from the source before you ingest it? Or is it a one is to one data with respect to what’s in a Kafka topic, uh, that you can just create any differently? What’s the difference there?

Jay Kreps 00:21:53 You know, typically people don’t do a ton of transformation as part of that load, you know, you would typically do it afterwards and in a ksqlDB, but, but yeah, we, we do support, you know, we call them single message transforms that allow you to, you know, do simplistic, you know, munching on a single row as it comes in that that ends up being important to, in some cases there’s some sensitive data you don’t want to load, or you want to opt you scale, or you want to encrypt in a particular way kind of on the way in, without it ever being stored in the original form.

Akshay Manchale 00:22:22 So you can query quick ksqlDB, like sequel. Is there any difference between just regular sequel and ksqlDB?

Jay Kreps 00:22:29 Yeah. It’s, you know, the pull quarries are certainly more limited. So, you know, at the moment it’s limited to, you know, um, relatively simple look-ups, we’re, we’re working on broadening that further. Uh, over time you can imagine, you know, our goal isn’t to really replace all databases with this. It really is for use cases around event streams, where you would otherwise be gluing together, multiple technologies. So if you have a UI driven app that’s using Postgres, we wouldn’t come in and say, Oh, you know, stop that and use this ksqlDB thing. It really isn’t the case where you’re trying to do things with streams of data around Kafka, where this can suddenly add a lot of simplicity. What would have been a bunch of custom code and integration between multiple systems can do that? So, so yeah, we, we are broadening the query features over time that do pull quarries, but, but obviously if all you’re ever doing is pull quarries, there’s 101 databases to do that. The magic here really is the push query is that kind of stream processing side of the equation, which is a feature which really hasn’t existed elsewhere.

Akshay Manchale 00:24:10 Can you talk about the components that make up a ksqlDB system with respect to how it works in an existing Kafka deployment and how it fits in?

Jay Kreps 00:24:18 Yeah, absolutely. So, you know, in a traditional database architecture, you have kind of commit log and then you have these tables, uh, that have your data represented in different ways effectively. This looks a lot like that, except that the commit log is now Kafka and the tables are actually materialized in the ksqlDB cluster. So in ksqlDB, you would have a set of nodes, those nodes, you know, read and write streams of data from Kafka. They may materialize it into rock CB, which is not actually kind of a full database in the sense of having, you know, SQL and supporting remote access. It’s an embedded key value store that just maintains the data format on disk. And so that’s, that’s how the kind of indexes and tables are stored to allow look-ups. And so, so that, that’s how it works. And, you know, this makes ksqlDB to be kind of elastically scalable.

Jay Kreps 00:25:10 You can add a new notes and they will take over part of the work of query processing, uh, automatically. So you can kind of scale it out or scale it down elastically by adding or removing nodes from that cluster. And it, you know, it, it works similar to Kafka itself where there’s a kind of partitioned workload. So you can imagine taking one of these tables and chopping it up into little pieces and storing those pieces spread across the cluster. And you can do that with, uh, some replication so that, you know, if one of the nodes fails, it, it doesn’t go away. But the kind of long-term store for all data is, is in Kafka itself. The Rox VB instance, uh, is really just kind of a cache. And so even if it’s replicated, you know, it is just kind of a quick lookup index, not the kind of authoritative source, if that makes.

Akshay Manchale 00:26:00 So if you were to lose the instance of rocks DB, you could rebuild that data from your cafe topics. Is that right?

Jay Kreps 00:26:06 Yeah, that’s exactly right. So, so, and it’s common with stream processing. You may calculate a bunch of things and then want to recalculate them later. And you may, you may indeed want to throw away all your data and kind of start over and reprocess from the beginning as a simplistic way of getting there. So that’s, you know, that’s kind of the basic architecture, all the actual processing and computation happens in the ksqlDB notes. You know, Kafka is just being Kafka it’s it’s reading and writing the streams and replicating them and storing them. So it’s not, it’s not the case that these queries are kind of going off to the Kafka cluster and doing anything like that. One of the things felt was very important, was making sure that Kafka itself works really well as a multi-tenant system that can be shared across many teams and use cases without tipping over, whereas a ksqlDB, you can do all kinds of complicated munching and you can have manyksqlDB clusters that feed off the same Kafka cluster.

Jay Kreps 00:27:03 They can all share data, but they don’t interfere with each other. If you do some kind of very abusive query processing in your ksqlDB instances at won’t, it won’t hurt your, uh, coworkers. And, um, I think that’s an important pro um, property for any of these things that go across teams is to, to think through how that, you know, multi-tendency architecture is meant to work. And so for us, it’s, you know, Kafka is the shared part and it does very simple, predictable things, ksqlDB does all the complicated stuff, but is meant to be, you know, clusters that are not really shared across, uh, teams and meant to be multitenant. Instead, you would give each tenant their own capacity. And it’s actually a very flexible model. You know, typically in databases, if I give you your own database, you don’t have access to my data, but because of this shared commit log at the center, everybody has access to the same stuff. It’s just a question of how you index it and process it in different, in different clusters.

Akshay Manchale 00:27:56 Uh, speaking of indexes, relational databases tend to have certain indexes. Can you create additional indexes on top of your massage data that you’re consuming from Kafka topics and ingesting into case will Bebe, can you have indexes on top of that for faster lookups and all of that?

Jay Kreps 00:28:11 Yeah, so we, we, we don’t support kind of secondary nexus yet that you, you kind of course just materialize the data in different ways, which, which effectively serves the same purpose and is maybe more in the style of this kind of stream processing. We’ll probably add secondary indexes just to make the pull quarries simpler in certain cases and have less materialized data, but there hasn’t been a big push for it.

Akshay Manchale 00:28:32 Yeah. So ksqlDB sort of simplifies how you can access the data key streams or the stream API on top of CAFCA let you do the same thing. So what’s the correspondence between the two. Can you do everything that you can do with streams, with ksqlDB and vice versa?

Jay Kreps 00:28:49 Yeah, I would say you can do the vast majority, obviously with an open API that supports a code you can do more than, than you could do in SQL. So the relationship is this actually the streams API is a lower level set of kind of Java primitives for doing stream processing operations, joins aggregations, filtering, you know, all the things you, you might imagine doing with data. You know, it looks like one of these, you know, kind of, uh, fluid KPIs where you change together a bunch of operations, one after the other, if you, you know, if you’ve used Java, the kind of streams interface in Java is similar to this where you would, you know, chain maps and, uh, reduces and, and aggregates. And, and so it works like that, except it does this on, uh, these infinite streams, not just finite collections and, and does it in a persistent and partitioned in a distributed fault tolerant way.

Jay Kreps 00:29:42 So that’s what the streams, API and Kafka provides KC to be is actually built on top of that. So it uses those primitives under the covers, but it does do more than that. So it, it provides a SQL layer on top of that. It also, it provides remote access so that you can run these kind of pull quarries. You can, you know, send new queries to it. And then it also controls the connect API in Kafka. So you can run, uh, connectors and it brings both of those together. And so we feel like that really broadens the appeal. You know, maybe Java has whatever you want to call it 40% of the program or market, but like kinda everybody knows SQL. And it’s also just a lot less work to put together a SQL query than it is to kind of build and test a full Java application.

Jay Kreps 00:30:28 So it’s not really, you know, depending on the use case, it’s not that one exactly replaces the other, you know, for, for building kind of a very complicated custom set of business logic. And there’s, there’s a good chance that you’re still going to do that in Java, but for a lot of simple data transformations, materializing, caches of data, you know, looking for the bad patterns for security, that’s the kind of thing where you just want to write the query, just say what you mean, not, not write a ton of code and compile it and deploy it and all that kind of stuff. And that’s, that’s really where KCPL DP shines is those, those kind of simple streaming use cases.

Akshay Manchale 00:31:02 So there seems to be like a trade off between ease of use versus expressiveness of what you want to do for expressive, witty, just use like case streams and otherwise use ksqlDB.

Jay Kreps 00:31:13 I think that’s exactly right. You know, you could, you could think of there being kind of a hierarchy here where the lowest level producer and consumer APIs that right streams, they kind of have all the power, you know, that in Kafka, they have the transactional processing in theory, if you’re willing to write enough code, you can do everything with that, but that’s a lot of code in most cases and you would end up rewriting the same stuff over and over again. And that’s where these, the kind of connector API and streams API provide the next layer up in that stack. It’s a little bit higher of an abstraction. It makes it easier to get correct results and reason about what happens if the machine fails in the middle of processing something. Um, but you do trade off, you know, a little bit of flexibility in, in how you, how you write that versus the low-level read and write. And then one level up from that, uh, I think is, is ksqlDB. So the analogy you can use is, you know, uh, if you’ve ever used one of these key value interfaces like rocks DB itself, you know, it’s kind of very flexible and allowing you to work with data at a low level, um, probably more so than a SQL interface, but it’s actually a lot more work for kind of simple stuff, uh, that you might want to do that then using a SQL database.

Akshay Manchale 00:32:19 Is there a notion of transactions within ksqlDB? Can I ingest a particular set of events only if a certain condition is satisfied? Is there, um, you know, an all or nothing sort of a guarantee and a traditional, you know, what relational databases have?

Jay Kreps 00:32:34 Yeah. There, there is. Although it’s a little bit different. So since the domain is stream processing in and the pull cores are mostly for looking up results, um, it’s really about how can you guarantee the correctness of that processing. So, and you can imagine a lot of corner cases here. So, you know, what if, you know, a message was delivered and processed, and then the, you know, the, the stream processing application died and then it kind of came back and it gets that message again, you know what happens? So in my census example where I was kind of counting, uh, births and deaths, you want to, you don’t want to double count a birth, right? And end up with the wrong, the wrong number of people in a given city. And there’s a lot of corner cases like that that you could imagine. And, you know, it’s, it’s very similar actually to the domain of transactions in normal databases.

Jay Kreps 00:33:24 And the underlying Kafka API is support, uh, transactional concept. And so it works just like you would imagine a transaction. When you say begin, you do a bunch of rights, and then you say commit and, uh, then all of those rights either happen together, or they don’t, they don’t happen at all. And so the stream processing functionality and in ksqlDB to B uses that. So it says, okay, you know, I’m, I’m working across these different topics. And so on. I need to make sure that whatever happens, network glitches, the, you know, one of the nodes fails. Uh, I still get the, you know, the correct output in this case, correct. Output means the same output I would’ve gotten if nothing had gone wrong. Right. And, um, people often refer to this as exactly once semantics, meaning you get the same semantics as if the message had been delivered just one time, even though, of course, under the hood, you may send some message over network.

Jay Kreps 00:34:17 You don’t know if it got there, the thing fails, it comes back, it gets it again, you know, under the hood, many things are happening. You want to get the, the, uh, the same output you would as if everything worked perfectly. And so that’s, that’s, uh, supported within ksqlDB. And it’s, it’s an important thing to be able to have that, you know, there, there have been a whole history of, uh, stream processing layers over the last, you know, five plus years. And they’ve kind of built up functionality that makes it easier and easier to use. And so on a lot of the early ones didn’t really have any, um, real guarantees in this area and, or they had very weak guarantees. Of course, that makes it very, very hard to write any kind of important application on top of it. If you can’t guarantee that you get the right output, then it’s only really usable for things where the answer doesn’t matter. And so, you know, that kind of guarantee is very important. It is end to end because you have to reason about what’s in Kafka, you know, ksqlDB part, how it gets out into the other systems that, that it might ultimately end up in as a destination. And so that, that is another part of the simplicity. I think you can bring in this area by having a bunch of components that are built to work

Akshay Manchale 00:35:24 Together on the quality side of things. Can you join different tables, uh, different CAFCA topics to get a result that you want? Yeah,

Jay Kreps 00:35:32 Yeah, absolutely. And the use case for that in, uh, in stream processing, it always obvious to people, but, um, there’s actually quite a lot of that. So, so one of the examples I gave was, you know, you have this stream of sales, but the sale in it probably just has your like customer ID, right. But a lot of what you would need to do to process a sale in different ways, you probably need a bunch of other information about the customer. Maybe you need their primary shipping address or whatever the case may be, or you probably need to join on that information. And so, so you could imagine the sale as being a, you know, stream of events. You could imagine the customer information as being a kind of table you’re joining that stream and table together. And so, you know, it supports all the combinations here.

Jay Kreps 00:36:17 You can join a stream to a stream. So in advertising you have clicks and you have impressions. And, uh, one of the problems is often trying to say, Hey, which, uh, impression which, which viewing of the ad led to the click on the, on that. So our kind of stream join the stream to a stream, right? In normal databases. Of course you join tables. Uh, there’s a lot of that. You might have streams, you know, you might have streams coming out of different, uh, databases where you’re kind of capturing the feed of changes and you might want to aggregate all that. So a common example of this is where you have bits of data about your customer, uh, spread over many different systems in a company. Any big company ends up with this problem where you’re like, Hey, we, we know a lot about our customers, but you have to go to 27 systems to figure out the answer.

Jay Kreps 00:37:00 And, you know, that’s an example where if you treat the changes coming into each of those records in each of those systems, as a, as a kind of change log, then with Kafka connect, it can kind of extract that from those systems and you have this speed and then using ksqlDB, you can effectively produce this kind of streaming join that gives you the, you know, the end all be all, you know, record of for each customer that has all the information together. And so that’s, that’s an example of a normal kind of table join table, the table join. So all these combinations of streams and tables are actually pretty useful, uh, in this domain. Once you start to think about it,

Akshay Manchale 00:37:37 You mentioned earlier that it’s possible to have multiple ksqlDB instances maintaining their own table or convection that can come and go. So from a client perspective, can I join on data? That’s on two different ksqlDB nodes as if there’s a, is there a distributed query of sorts or is it isolated to a single source?

Jay Kreps 00:37:55 Yeah. Yeah. So that’s true both in two different ways. So, you know, a ksqlDB cluster is made out of multiple nodes, and then you can have many clusters working off of the same Kafka cluster. And so, so I was actually saying both things. So, so you could have, you know, I could have my cluster, you could have your cluster. And then within my cluster, it’s actually spread over multiple nodes. So the data itself is partitioned up. And of course you, um, you would need to, uh, be able to join across different partitions. The actual joint itself is performed locally. So it would reshuffle the data to get it into the same node, to do the joint. It’s not doing like a remote RPC call for each lookup. So, you know, under the, under the hood, it would be, uh, done locally. But, but yeah, you can join, uh, you know, different, different topics together in different ways.

Akshay Manchale 00:38:40 So since you’re using rocks DB as the backing storage layer, can you use rocks DB in conjunction with KC glue DB as a traditional relational database, create a classic table on rocks DB and then use the two things to sort of query ingest, et cetera.

Jay Kreps 00:38:56 Yeah. I mean, rocks DB itself, the DB and the name, maybe oversells what it is to people. You know, it is just kind of a put, get scan, uh, interface in C and Java. So it doesn’t support any SQL or anything like that. So that, you know, the table concept is of course, using that under the hood, just store these partitions of data in ksqlDB. And you can combine the tables and streams in ksqlDB, but the, um, there’s not really a SQL interface to, to, you know, to directly access the rocks DB part other than the ksqlDB interface itself. But I think probably the use case you would want to use that for you actually can do in ksqlDB itself.

Akshay Manchale 00:39:33 So you did mention that ksqlDB DB has a schema of sorts where you can declare, uh, you know, columns and a primary key and you’re consuming from CAFCA topics. So how do you deal with changes in screamer schema from your upstream subs, um, publishers of data?

Jay Kreps 00:39:51 That’s a great question. So, so yeah, I guess like any, uh, SQL airksqlDB, it needs, it needs a notion of what the data is and that lets it get into the records and munch on them in a smart way. So, you know, you can, you can kind of express that just yourself by saying, Hey, I assert this is what’s in this data feed. Maybe, you know, maybe have Jason records. And you’re saying, I’m, I’m saying it’s going to have these four fields. And that’s the scheme of my table. Obviously, if, uh, somebody upstream, publishes mangled, Jason, then you got a problem. Uh, so, so it does also support usage with what we would call a schema registry. So, so confluent, uh, produces something that maintains these Schemos along with Kafka. And that’s a way it’s actually a very important component for the usage of Kafka.

Jay Kreps 00:40:34 I talked about the usage of Kafka across different teams, obviously that only works if we all understand each other’s bytes, which means we have to agree on the format. And so this, this allows you to maintain the format of data in common format. So, um, uh, protocol buffers or Avro or Jason, uh, it’ll store a schema for that. It’ll check the data against that. And then you can express different compatibility rules. So you may have some Kafka topics where you say, Hey, it’s wild West, put whatever you want in and good luck to whoever’s downstream. The reality is for important data. If you have a lot of applications building around it, you don’t want to be in a situation where the person putting data in, or the many teams, putting people data and have to go talk to all the teams downstream. And it just gets really hard to organizationally maintain correctness in that world.

Jay Kreps 00:41:22 So, so typically in those situations you would enforce, uh, some notion of forwards or backwards compatibility, you know, which, which control what you’re allowed to do, right? So, uh, the most restrictive would be, you can’t change it at all. The reality is in, in most businesses, the problem changes over time, you know, the, the software needs to evolve. So that’s usually too restrictive. So, but you usually do want to have some notion of compatibility around, you know, okay, you can add fields, but you can’t, you know, change the type on an existing field because that’s going to break downstream code. It’s going to break ksqlDB, you know, DB, whatever, whatever is relying on the format of that. So you would want to use that schema registry to enforce these requirements. And that’s actually quite common. This is a very popular, uh, open component that we produce it, you know, it’s very commonly used with, with Kafka itself.

Akshay Manchale 00:42:13 Since you have a cluster of KC equal DB nodes, I can presumably write a query in different ways. If I go to a relational database and write a query there’s ways to sort of understand what the cost of the query is, plan and kind of get some feedback about it could be written in a different way for facet access. So the presence of this cluster of KC equal DB nodes that are maintaining their own tables and views, is there some assisting for query planning, optimization and all of that?

Jay Kreps 00:42:41 Yeah. Um, you know, we have a little bit of functionality to let you understand what’s going on under the hood. It’s, it’s not as mature, I think, as the more mature databases and what they provide. You know, some of that is because the, the pull queries themselves are actually simpler, uh, in terms of what we support. So there’s less to debug there. What you actually need in addition to that is, is, you know, the ability to really change and evolve quarries in a way you wouldn’t in a relational database, right? Because if you have a query that just happens for a point in time, and then it’s done, then, uh, you don’t really need to evolve it in place, but with stream processing, you have these quarries that might run forever, right? So if I’m computing, you know, how many sales happened in each city at some point, my logic for how I count that or how I define cities might change. And so, so in addition to the kind of explain plan stuff, w what you also need is the ability to go back and reprocess data. And that’s something we’ve put some thought into, uh, there’s kind of ongoing work to help with that kind of query evolution, make it easier to do that in place without having to go back and reprocess data. So that’s, that’s kind of an interesting nuance as well.

Akshay Manchale 00:43:48 And when it comes to actually accessing your data, since you have a cluster, is there a way to find the table, or do you need to know exactly where that no days it’s IP address or a DNS lookup name or whatever?

Jay Kreps 00:44:00 Yeah. The, um, the interaction with ksqlDB DB is very, very similar to most of the relational databases. So if you’ve used by SQL or Postgres or Oracle, you know, you, there’s a little command line shell that you can hook up with, uh, there’s a rest API, but on the command line shell, you can do exactly what you would expect, which is, you know, show me the table, show me the streams. Uh, I want to do this. You can kind of interactively develop these query to see what the results of different things would be, you know, run these quarries. You can manage some of the background quarries, like kill the, you know, kill the long running ones. So it’s very similar to those systems,

Akshay Manchale 00:44:34 Right. Do you have any comments on performance ofksqlDB, and how quickly you can ingest things, um, in comparison with Sierra relational database what’s stopping, or why you skip key sequel DB or see a relational database that could be ingesting the same amount of data, every event that comes in. Yeah.

Jay Kreps 00:44:52 Yeah. That’s a good question. So, so, so yeah, you know, if, if what you’re building is ultimately, you know, a UI driven app where there’s a pile of data that kind of sits there and you just do look ups to show the result. Then I think that w you know, that’s, that’s effectively what our relational database is built to do, and they do it pretty well. And there’s different, you know, there’s 101 databases. There’s probably one that’s a good fit for whatever the application you’re building is. Um, and I don’t know that there’s really a motivation to adopt anything new, this other side of things around stream processing. You know, that’s something that databases don’t do at all. Uh, and so you, uh, you end up pushed into this world where you’re, you know, like in the data warehouse while you dump a bunch of data, and, and then at the end of the day, you run some big batch computation to get in, into shape.

Jay Kreps 00:45:36 So everything’s like a day old and, you know, that’s, that’s nowhere right in the modern world. That’s, that’s where a stream processing chain. So I would say that, you know, the goal of adopting this is not so much that, uh, Postgres wasn’t fast enough, it’s actually that, you know, the, the application you’re building is actually a stream processing application. You’re reacting to things as they occur. And, you know, there’s a lot of use cases that just require that, uh, that kind of access pattern. So then, um, you know, H how does the performance compare? It’s hard to compare because they don’t, you know, these other systems don’t do stream processing. So, you know, theksqlDB just does relatively simple look-ups so, so the typical, like, you know, if you, if you ran it through one of these data, data warehouse, uh, query benchmarks, like it couldn’t do most of the query.

Jay Kreps 00:46:19 Isn’t that, it’s just as simple as simple look-ups right on the pull query side. So the pull queries are fast, but not nearly as expressive yet, and that’s an area we’re adding. And then of course the stream processing doesn’t exist at all in this other system. So, so it’s probably hard to do apples to apples comparison in terms of raw performance. You know, how, how, what can you use this on? You know, there’s, there’s kind of a per node number, which is actually quite good. You can, you can process, you know, hundreds of thousands of, uh, records per second on one node, but, but like a lot of these modern systems, it’s a proper distributed system and it scales horizontally. So you could add, you know, if you want more than you just add more machines, uh, and you can do that, uh, dynamically as it runs. So as you know, as the query is processing, you can add more, uh, capacity to make it go faster. So, so we, you know, we’ve run on workloads that they’re actually massive in scale where it’s, you know, millions of records.

Akshay Manchale 00:47:12 That’s great. So in Kafka, you could have all the events sort of expiring or falling off of the topic. So how does that flow back into ksqlDB? Let’s say I have some exploration that says, um, just forget about events from 30 days ago for compliance or whatever. How does that flow into your ksqlDB instance in terms of the aggregates, your computing, or the transformations that you’ve done on top of the underlying data?

Jay Kreps 00:47:38 Yeah, you’re saying maybe we can start that one over. I’m not sure if I understood the question. So you’re saying like the compliance of the data, the retention,

Akshay Manchale 00:47:45 Uh, so retention, let’s, let’s say intervention of data and topics. You can configure that trait. So, um, since you have these push queries or pull queries, and do you actually see data twice when things sort of are deleted because of retention policies?

Jay Kreps 00:47:59 Yeah, that’s a, that’s a great question. So, you know, in a stream you would have this idea of maintaining, you know, data just for some period of time. Uh, in theory, that period of time could be forever, but if that’s the case, you’re going to need more storage over time as more data accumulates in a table. Uh, it works, uh, just like a database does. So, you know, the table data is stored forever. So if you have your customer counts is the assumption is you, you wouldn’t ever, you know, time out, you know, your customer count that would stay forever, uh, under the hood. Uh, we do actually support kind of a, a time bottled table. Um, that’s important for some of the windowing concepts, you know, as you compute aggregates over a window, like how many people were born in the last, however many days, but the, at least in the common case, you could think of, uh, tables as being just like a database table where it persists until you delete it.

Akshay Manchale 00:48:50 So internally, was there any reason you chose rocks DB? What was, uh, what was the motivation behind that?

Jay Kreps 00:48:57 Yeah. Um, you know, uh, effectively we’re, we’re using, you know, just kind of, uh, uh, some key persistent key value interface. And so it doesn’t particularly matter what that key value interface is. And so there’s, there’s a set of these different embeddable key value interfaces that run as libraries. Um, it would be overkill to have something that had like a SQL layer of its own or a bunch of advanced database features. It literally is just a library that we use for maintaining data on disc and, uh, all the distributed systems, replication of data, sequel processing, that’s all done, uh, byksqlDB itself. So, so yeah. Then when you look at different libraries, they all have pluses and minuses. Um, technically it’s actually pluggable w within Kafka streams, which is used for these underlying primitives. And so you can actually plug in anything you want. There’s a, there’s an all-in memory version of it, um, that could have some advantages. Um, there’s other systems out there you could adopt, but most people use rocks to be it’s it’s, um, you know, there’s pros and cons it’s I think extremely, uh featureful and, um, very high performance, you know, the con is, it has about a million tuning knobs. So, uh, you know, there there’s, uh, if you don’t like the performance, you can almost certainly fix it. If you can discover the right, the right tuning know.

Akshay Manchale 00:50:16 So operationally, how is this packaged along with CAFCA? Do I need to manage a separate rocks DB instance tonight and all of that, or is it just out of the box it’s available? So, um, you know,

Jay Kreps 00:50:28 I think you guys did a podcast on cockroach DB. They use it as well, right? So for, for both of us that the word GB and rock CB is, is maybe a misnomer. It is actually just a library and it accesses the local disk. And so it’s, uh, you know, it’s, it’s really just an, a library that, that maintains a certain file format on disk. It, it, uh, it’s not something you would install separately. It’s not something that’s accessed remotely over a network. It’s not something that has a SQL interface. So, you know, anything which accesses data on disc, you have to pay a little more attention to, but, but yeah, there’s no operational component of it effectively. You start this, uh, ksqlDB process, and it has everything it needs. You know, what you have to tell ksqlDB is where’s Kafka. And, um, you know, that’s, that’s really the, the, the things you have to have is, is Kafka in ksqlDB.

Akshay Manchale 00:51:21 Are there any anti-patterns of using ksqlDB for stream processing, such as building a push-based application, for example, using Kafka streaming API versusksqlDB, are there anti-patterns of using one or the other where it’s not compatible?

Jay Kreps 00:51:37 Yeah, probably the pro the biggest anti-pattern in stream processing, I think is, you know, trying to do a remote lookups for every record. And so this is kind of a very natural thing in software, uh, where, you know, if you have one thing and you want to look up the corresponding record, you do some remote call to get that corresponding records. So, in the example I gave, I had the sale, I want to look up the customer record. Um, there’s a tendency for engineers to want to, you know, take each thing and do a remote call on the other thing. And it usually just isn’t the easiest, most correct, or most performant way to do that because you have to reason about, well, what happens if that call fails and how do I retry it and all this stuff. So the more natural way of doing that in ksqlDB would be to capture the stream of changes on the customer data and capture the stream of sales, and then do that join within ksqlDB rather than try to do it in the application space itself.

Jay Kreps 00:52:38 Um, that will, that will kind of have the right properties out reasoning, route time, it’ll have the right fault tolerance properties, and it’ll be a lot faster, uh, because you’re not kind of doing a remote call on each item, uh, with all the colleagues and seeing whatever that, that implies. So that’s probably the biggest anticodon that people miss, which is just a little different from how people use, uh, traditional databases, where you’re often joining, you’re doing joins kind of in your code over the network between different things. And it works better because you’re, you’re typically just doing that for one or two rows, whereas in stream processing, there’s a whole feed of these. And especially when you go to rerun your processing, if you change your logic, you want it to work quickly, so you can get back caught up to the grandchild.

Akshay Manchale 00:53:19 So what’s the future of KC equally be? Where is it going? Yeah,

Jay Kreps 00:53:22 Yeah. Um, well, well, you know, our, our goal is to really make working with streams of events as easy as possible. So, and that’s really our vision at confluent kind of end to end. And so there’s, there’s a lot of work to try and make that easy. Like we’re, you know, we’re building all this as a cloud service to try and make the operation side of it, go away. Um, it’s, you know, making all the features available that just make it really easy to work with us and make it easy to work across a large company. Um, so in ksqlDB, the parts of that that are important. Um, I mentioned just completing the set of functionality that people want in pull quarries. That’s a ton of work cause it’s like all the stuff databases do. Uh, so we got a long roadmap of things to fill out that side of things.

Jay Kreps 00:54:05 You know, we’re still not trying to replace traditional databases for, you know, what they’re good at, like building kind of end-to-end UI driven apps, but we want to make these, you know, event stream processing architectures really simple and make the access to that data really easy in an integrated system. Um, there there’s work going on around, making it really easy to test and evolve these quarries to that kind of full life cycle, uh, of development for it. And then a ton of work on just the full completeness of all the processing capabilities, performance, all operational side of it. So, you know, lots going on, uh, I guess whenever you’re trying to build the database, it’s, it’s a lot of work. So, so we’re, we’re not running out of things to do

Akshay Manchale 00:54:44 So is ksqlDB opensource? Yeah, it’s available.

Jay Kreps 00:54:46 You do the community license that confluent uses. And so, um, it’s not an OSI open-source license, but you can take it, you can modify the code, you don’t have to pass anything, you know, you can make changes to it and publish those. The, the only, uh, major restriction is around, um, how that can be run as part of a kind of managed SAS cloud service, um, which you know, that right is reserved. All, all the details are in the, the license itself, but that’s, that’s the same license that we produce our schema registry and other components under. And it’s, it’s very popular and, you know, um, gets a lot of free usage. And of course you can do exactly what you’d expect, which is go look at it all and get up and make your own fork with that.

Akshay Manchale 00:55:25 Awesome. So is there anything else that you want to add about ksqlDB?

Jay Kreps 00:55:30 No, I, I think it’s, you know, if you’re interested in, in databases and data systems, I think this world of events and event streaming is, is really becoming a big deal. You know, it’s, it’s kind of becoming part of the modern stack and I think it’s a great tool to make working with this really easy and productive. So I’d urge people to go check it out and give it a spin. You know, you can try it out pretty easily, and there’s a pretty active community around it. So, you know, if you kick the tires and there there’s something that does make sense, or it doesn’t work the way you expect, let us now, um, there, there’s a set of people contributing patches and, and in a bunch of people at confluent working on it actively. So we’d love to hear feedback from people.

Akshay Manchale 00:56:05 I’ll include some notes in the show notes on how people can cry. ksqlDB with that, Jay, thanks for being here to talk about ksqlDB. This is our chairman Charlie for software engineering radio. Thank you for listening.

[End of Audio]

This transcript was automatically generated. To suggest improvements in the text, please contact content@computer.org.


Tags: , , , , , , , , , , , , , , , ,