SE Radio 456: Tomer Shiran on Data Lakes

Tomer Shiran, co-founder of Dremio, talks about managing data inside a data lake and the ecosystem of products available for storage and analytics. Shiran and SE Radio host Akshay Manchale briefly explore the historical change in data organization from databases to data warehouses and now toward data lakes, as well as the motivations and use cases that have led to those transformations. They also discuss different storage formats, the power of public cloud for cheap data storage, ways to move data into data lakes, and methods for managing that data for compliance.

This episode sponsored by O’Reilly.

Show Notes

Transcript

Transcript brought to you by IEEE Software
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].
Intro 00:00:00 This is software engineering radio, the podcast for professional developers on the [email protected] se radio is brought to you by the computer society. I is your belief software magazine online at computer.org/software. Keeping your teams on top of the latest tech developments is a monumental challenge. Helping them get answers to urgent problems they face daily is even harder. That’s why 66% of all fortune 100 companies count on O’Reilly online. Learning at O’Reilly. Your teams will get live courses, tons of resources, interactive scenarios, and sandboxes, and fast answers to their pressing questions. See what O’Reilly online learning can do for your teams. Visit O’Riley dot com for a demo.

Akshay Manchale 00:00:46 Welcome to software engineering radio. I’m actually I’m on Charlie and today I’ll be talking to Thomas Shiran. Domo is the chief product officer and co-founder of Dremio previously Domo led the product management team at map park, but he was responsible for product strategy, roadmap and requirements. Prior to map our Domo had numerous product management and engineering roles at IBM and Microsoft Dahmer. Welcome to the show.

Tomer Shiran 00:01:09 Thanks for having me here.

Akshay Manchale 00:01:10 Awesome. So before getting into data lakes, let’s talk a little bit about the history and how we got here. Everyone is familiar with databases and they’ve existed for many decades, but at some point, you know, there was a transition and evolution to having data warehouses in an organization. So can you talk about what, what were the motivations and what led to that need to have a data warehouse?

Tomer Shiran 00:01:30 Sure. Yeah, I, you know, back in the day, it all started with people having databases, where they were, uh, running their applications or storing the data for their applications and performing transactions there. And there was a need to, to analyze that data and, uh, for a couple of reasons, you know, sometimes it, it became hard to analyze the data on that operational database, for reasons of, you know, first of all, you didn’t want to impact the performance of the application that was running. And that was potentially a mission critical kind of, uh, LLTP application. And then also in some cases you want it to kind of centralized data from multiple places, multiple sources or multiple databases, and be able to analyze the data that’s coming from, from any sources. And so being able to move that data into one place and ultimately that became the data warehouse. And I think in many cases, the data warehouse technologically was just database. So it was the same type of technology that people were using for their old to be databases, whether that’s a, you know, an Oracle database or, you know, or something else, you know, SQL server and so forth. So that was kind of the rise of the data warehouse.

Akshay Manchale 00:02:32 Yeah. And from data warehouses, and more recently, we’ve had a transition to having data stored in different forms. You know, you had structured data for the most part, and then you have unstructured, no SQL like databases that started coming about. So what was really the motivation there and how does that contrast with respect to the data warehouse evolution before that?

Tomer Shiran 00:02:50 Well, I think there’s two aspects here, you know, the, the no SQL database movement and technologies like Mongo DB in elastic search, which you could of put in that bucket, we’re all around making the developer more productive. So a lot of applications were being built with, you know, Jason and nested structures, and it was easier to represent kind of objects in the real world using those structures as opposed to kind of flattened out tables. And so that’s the reason people started building or companies came up with these no SQL databases like Mongo DB, which were much nicer, I think, for a developer to go build an application. And then in parallel, I think there were also for analytical workloads, a lot of data sources that were, you know, maybe log files and other data that was wasn’t necessarily your traditional kind of a star schema. Right. Very structured data that people were putting in, in data warehouses and databases. And so, you know, I think that was another kind of thing that drove the rise of these semi-structured data sets.

Akshay Manchale 00:03:51 Right? So agility, I guess, developer agility was like a huge portion for that, that change coming from there. You know, what, what led to the emergence of say data leaks and looking at your data as a data lake?

Tomer Shiran 00:04:05 I think the early days of the data lake, we’re all about the ability to kind of store larger volumes of data and ingest larger volumes of data at more easily into a centralized kind of repository. Right? If you think about the data warehouse, a ton of effort goes into creating those, those models, creating the tables, the schema is making sure that the data that’s being ingested, you know, meets those, you know, having to change the schema before you actually ingest any data. So they’re pretty rigid and, uh, and it takes a lot of effort. And so the one systems like Hadoop back in the day, this is like 2009, 2010. Once those were created, the ability to just, you know, have a repository where you can dump files, whether they’re log files or more structured data into a single system and pay a very low cost per terabyte or gigabyte.

Tomer Shiran 00:04:53 That was I think, a big motivation there. I think what then happened, you know, early on in the world of data lakes, you know, a lot of the processing was, was this kind of batch processing, right? It was really designed for these MapReduce jobs. Originally. This is pre spark, um, and pre kind of being able to run sequel on the, on, on data lakes. But one of the really interesting things that kind of came out of this is the creation of many different types of computational engines, right? So the ability to use one engine for a developer that wants to do kind of batch processing, right? Something like spark and a different engine for somebody who are kind of workloads that are more like a interactive sequel and BI, right. That’s what, that’s what we do at Dremio. So all sorts of different engines, you know, different machine learning engines that all interact with that same data that I think has become the defining feature of the data lake, which something you, you, you just can’t get in, in a data warehouse,

Akshay Manchale 00:05:45 Right. I guess the separation of the compute and storage aspects of it really led to the emergence of various compute, uh, aspects, but on the storage side, was there, is it still Hadoop and HDFS, or how do people look at a data lake? Is there a classic data lake product, so to speak?

Tomer Shiran 00:06:01 I think now for us, what we see is the vast majority of, of data lake projects are actually in the cloud. And so once companies started moving to the cloud and things like AWS and Azure, right, where you have these storage services like S3, which is infinitely, scalable, and very inexpensive, you know, $20 per terabyte or less, and you don’t have to manage much, you only pay for what you use. You know, it’s, it’s just, and it’s, you know, it’s available geographically and in many different regions around the world. So you have a much better solution now for, for the storage of a data lake in the cloud. And you have the same thing in, in the other cloud providers, for example, ATLs on, on Azure. So that’s created really a much more powerful concept. And, you know, this is now being taken advantage of by both cloud data warehouses and cloud data lakes.

Tomer Shiran 00:06:49 So the cloud data warehouse has take advantage of the fact that you have this separate kind of infinitely scalable storage system, but they separate the compute from the, from the storage. But the data itself is still stored in a proprietary format that is tied to that one data warehouse. And then, you know, can be accessed by other computational engines. You have to pay that data warehouse vendor to get the data in, to get it out to anytime you want to do something with it. Whereas with cloud data lakes, what happens is that the data inside of these storage systems like S3 is stored in open formats, right? Open source formats, like parquet is a columnar format, that’s high performance. And increasingly also open table formats, like things like Apache iceberg and on Delta lake. And by having the data stored in open formats that can be accessed and transacted on from different systems, right. You know, Dremio being, being one of these systems, but you can use, uh, you know, other engines that are out there, you can spark Databricks, Presto, Athena, there’s, there’s a lot of different engines and tools that can interact with the same data. And so you’re not locked into one specific, uh, you know, data warehouse vendor. Is that

Akshay Manchale 00:07:54 Particular defining characteristic that you see for a data lake? Is it the variety of data? Is it the wardroom of data? Is there a single defining characteristic that you would normally see?

Tomer Shiran 00:08:04 Well, I think data like typically are used in situations where companies have relatively large volume of data. So you typically see this being used by either enterprises, so larger companies, they typically have a lot of data, or you often see this, even the smaller companies that are tech companies. So venture backed technology companies where, you know, they’re, they’re, they may be a small company of maybe a couple of hundred people, but they’re serving millions of users, right. Because, you know, that’s just the nature of their business. And so they’re, they’re generating a lot of data and, and have, have the need to analyze a lot of data. And as a result of that. So that’s, I would say typically the defining kind of, um, element of a data lake, historically, there’s been talk about, you know, back in the days of Hadoop, people talked about, well, you should use the data lake, if you have unstructured data, semi-structured data, things like that. I think these days, most of the data that people use is structured data. So it’s, it’s not like, okay, people that have structured data are putting that in a database and people have unstructured data. Like it’s actually the vast majority of use cases where, where we see people using data lakes, they did a structured it’s in tables to have rows and columns and schemers, and you know, and they’re doing processing on that, that data. So that’s really no longer a big distinction, I think.

Akshay Manchale 00:09:14 So before you actually have data stored in CST or something, you may have data in different databases. Maybe it’s an data warehouse. What’s really the ingestion path into say S3, what sort of ETL tools do people normally use to get the data in?

Tomer Shiran 00:09:30 So I think, I think you can look at the data that’s out there and kind of think of it as two different categories. One is data that’s being generated in say operational databases. These are the, you know, the OTP, relational databases, Oracle SQL server, and it’s the no SQL databases like Mongo DB. And that data needs to be ingested into a, into S3. And then you have, especially for companies that were born in the cloud or new, new, new applications that are being built in the cloud, a lot of the data exhaust from those systems kind of natively just lands in, you know, starts its life in S3. You know, fortunately what what’s happened is, and this is thanks to the, I think the value proposition, the benefits of, of systems like S3 that are so inexpensive and infinitely scalable, they’ve become the de facto system of record in the cloud, right?

Tomer Shiran 00:10:15 This is where everybody puts their data. Even, you know, you look at companies that use data warehouses in the cloud, the vast majority of them probably over 90%, the data is actually first starting first being moved into S3. And then from there loaded into a data warehouse. And that’s because it’s so inexpensive and so easy to put the data into S3. I I’d say a lot of companies, you know, it’s, it’s very simple to kind of upload data into S3. So they just kind of do it themselves. They, they, there were some, some code from, you know, in Python use some Amazon tools, others use, uh, ETL tools to do that. So every ETL tool basically now supports, uh, getting data into S3, whether it’s the more traditional ETL tools, you know, Informatica talent or these newer data pipeline tools, you know, five trends, battalion and stitch, et cetera. So there’s a lot of different options for people to get data and test three days.

Akshay Manchale 00:11:02 Well, SC itself has kind of like an object store, right. So how do you catalog and organize your data? If you have a database it’s fairly straightforward, you have a schema that defines, like, what sort of people do you have, and then you have the actual columns within it also well-defined whereas an STD or just talking about a single file. Right. So what’s, what’s the cataloging and organization there. Sure.

Tomer Shiran 00:11:21 So, so you’re right. S3 is, is at its core, it’s an object or a file store, right? So you’re storing files. And so there’s, there’s really two ways to kind of define, maybe, maybe let’s, let’s call it three ways to define a table in this kind of a system. So the most simple, but maybe least flexible ways is, you know, you store the files in kind of a self-describing format, right? So you just drop a bunch of files in a directory. All these files are in a format like, uh, uh, you know, parquet, Apache parquet, or maybe even just, you know, Jason files and, you know, because those are structured, um, many engines like Dremio, can you just point at that read that query that right. And if you structure it well, you can also even support partitions right. By having kind of nested directory path.

Tomer Shiran 00:12:05 So that’s one way, um, the second generation that was created kind of, uh, on top of that is, uh, systems like hive meta store and Amazon’s glue catalog. So these systems provide, um, kind of, uh, a metadata layer or kind of a description layer on top of those files, right? So now you have, uh, rather than just pointing at a specific path and that’s your table, this meta store or catalog hive, meta store or glue, they basically store the definitions of all these tables and for each table, they, they started the schema and they also store kind of the list of files that make up that table. So it’s kind of a separate system and that’s great adds a lot of structure. It makes it easier to be, have kind of consistent definition of tables across different systems. You know, one of the challenges, there was just the scalability of those systems.

Tomer Shiran 00:12:51 And also the fact that they’re kind of disconnected from the data itself. So you have kind of the files being stored on is three, but you have the data being, or the metadata being stored in a separate system that that is transactional. And so it was hard to kind of do anything to kind of transactional that way. And so what you’ve seen in the last year is the rise of these table formats, native table formats that are transactional and, you know, support data mutations and support time travel directly on systems like S3. And so you have two real, I’d say two main kind of competing technologies. There you have one project is called Delta lake. It was created by Databricks and primarily supported by, by their system. Although that data can be quarried from other systems like, uh, you know, like Dremio and others. And then I think more, more relevant is Apache iceberg, which is a project grade by Netflix and now backed by apple and, you know, Airbnb and Stripe and Adobe and Alibaba.

Tomer Shiran 00:13:46 And basically many of the largest tech companies are now behind, uh, Apache iceberg. And the idea there is to have an open source table format that supports transactions, supports, record updates and, and deletions supports time travel and allows these types of transactions to happen from basically any engine, whether it’s a dry me out or a spark or, or any other engine out there, they can all perform transactions and updates on the data and do that in a consistent way without, you know, kind of harming each other. So I spoke sort of users as S3 as a backing storage system to which you can push data in. Is that right? Yeah. So iceberg, you can think of iceberg as basically, uh, it has manifest files, basically files that describe the data that are stored alongside the data. So in S3 you have a file that basically points to what are the data files and also includes information about the schema. And it does it in a way that when an update is made, it doesn’t lose the previous version. It kind of only the, it keeps track of the deltas. So there’s a new manifest file. Um, and then you kind of the root point or points to the new manifest file and that allows different engines to support transactions and, and version control on this data.

Akshay Manchale 00:14:54 Interesting. So I want to dig into some of the storage formats, a little more, you know, we started with a XMR lunchy, so I’m a super popular early on in the decade. Uh, the whole no SQL storage, the systems. Now you have a Parkay Avro, a couple of these formats. So can you walk us through some of those newer emerging formats and what they’re really used for?

Tomer Shiran 00:15:17 Yeah. I think if you want to call XML and then, you know, Jason has kind of that first-generation, I think the focus there was on interoperability on having a standard that, you know, different systems can, can talk to. Right. And of course, Jason being a much simpler, much more elegant, I think, than XML. And back in the day, I was, uh, you know, starting my career as an engineer at IBM. I was, uh, I was totally immersed in this world of XML. You know, I was using these XML spy tools and I was, I could create excess these, uh, you know, while I was asleep. So that was, that was the standard back then. And, and that that’s improved. But when you think about analytics, it’s not, it’s not just about storing the data in a way that’s interoperable and, you know, it can be understood by different technologies.

Tomer Shiran 00:15:58 It’s also about performance, right? You can’t just, uh, you can’t store that in kind of a text format like Jason or, or XML, because that’s not going to be very efficient for Corrine. And so then you saw the rise of these newer formats and Avro and park air examples of that, or see as another example, and they’re kind of the world separates into two different formats. There’s, there’s the real based or real oriented formats where, you know, kind of from a binary standpoint, each, each record is stored. You know, the first record is stored before the second record, which is sorted before the third record. So it kind of, you have all the data for that, that first row before all the data of the second row and so forth. And then you have the columnar formats where the data is structured in the column by column.

Tomer Shiran 00:16:36 So, you know, maybe take the first, uh, know 1000 records of the dataset and store the name column for each of those 1000 records followed by the, you know, the ages of all the 1000 records followed by the, you know, the address of, of, of those 1000 records. And so that’s, that’s, what’s called the columnar format. So parquet is a culinary format, and that has a lot of advantages in terms of things like compression. For example, you can, you can be much more efficient in terms of, uh, how well you can compress the data. If you’re storing, for example, all the dates, right. One after the other, right. You can use things like Delta and coding and so forth. And then it also has another benefit when you’re using a columnar format in that when you think about query processing, we once did a study of some of our largest, some of the largest Djemil customers, some of the largest banks in the world and tech companies.

Tomer Shiran 00:17:26 And what we saw is that on average, 5% of the columns of a given data set are actually being quarried right in, on, on, on a regular basis. So you have these data sets that have a thousand columns in them, or 500 columns in them. And out of those 500 columns, you know, maybe 20 of them or, or, or, you know, 25 of them are actually being queried right by, by the users. The rest are, yeah, they’re stored there for reasons of wanting to maintain that data, but nobody’s accessing them. So if you think about doing query processing, the ability to only read from disk or only read from S3, the data that’s applicable to that query can save you a lot of time, right? In terms of how fast you can actually process that data and run the query.

Akshay Manchale 00:18:06 So let’s say you’re a business with a standard traditional database, right? And you want to push that data into your S3, like data lake storage system. How do you manage ingestion and terms of data quality? You could have multiple databases within the organization. So data quality is really important with respect to how you’re pushing that data into your data lake, right? So what’s the admission control look like.

Tomer Shiran 00:18:30 You know, there’s not one magic solution or an ingredient to achieving data quality or ensuring data quality because at the end of the day, it’s a business problem, right? It’s, it’s something that’s accustomed problem to, to your situation as a, as a company, right in the cleanest, how clean the data sets are, and in this same problem, it’s no different really, if you think about it from data lakes, data warehouses, you have the same challenges regardless. Right. But, uh, one big trend, I think that has happened in the last few years is a shift from what was called ETL or extract transform and load, which was how, you know, you kind of got data into like data warehouses to now everybody’s doing ELT, which is extract load and transform, right? So the order changed the transform that the T went from being the middle to being the end.

Tomer Shiran 00:19:21 And the reason for that, if you think about it, that when it was ETL, what was happening is that you had these tools like Informatica, or even just script that were actually doing this transformation as the data was being moved from the source system into like the, the data warehouse or the data lake. And that has all sorts of challenges, right? Uh, first of all, the, the fact that, you know, these ETL tools, they can’t scale anywhere near the same level as, you know, your data warehouses or your data lakes, right? So as soon as data volumes got big, and that typically didn’t work also, you know, being able to kind of reproduce or go back to that raw data and being able to change the transformation or figure out what went wrong, becomes harder. And so, you know, for most companies now the, the standard is, uh, is ELT for any, any new kind of use case.

Tomer Shiran 00:20:06 And so what you do there is you take the data from the source system. You pretty much dump it into something like S3. And the great thing is because also S3 and ATLs and, and all these kind of cloud source systems are so inexpensive and so easy. You don’t have to worry as much about the cost, right, when you’re paying that, that little. So you’re not worried about that. You, you put the raw data that you took from the source system, you put it inside of S3, and then you do that transformation. And then you, you know, you clean the data. So you don’t necessarily give everybody access to that raw dump of your, your database. And you do some transformations, you clean the data, maybe you apply some security policies, and then you expose that to, to kind of the users,

Akshay Manchale 00:20:47 Right? So

Tomer Shiran 00:20:47 Since you are saying, the transformation is actually going to happen towards the end, I assume that transactional nature of data is not directly visible when you’re loading data into a data lake, it’s just, you dump it there. And then later you figured out what has happened is, is that typically how you go about transactional data? Well, I think what you’re hitting on is like data freshness, right? Like how fresh can, can the database. So if I’m taking data from a source system, putting in a data like, well, no matter what, there’s going to be some latency, right? Um, when I take data from one place and put in another place for it to be quarried, but you do want to shrink that, that latency as much as possible. And so there’s newer, newer approaches, and a lot of companies are using now and, and also data kind of pipeline technologies, things like Kafka, where for use cases that require kind of more real time data.

Tomer Shiran 00:21:33 A lot of times they’re being ingested that way, as opposed to kind of a batch process from like a, you know, a database into just kind of batch jobs, taking chunks of data and putting them into the data lake. Right. So it really depends on, on, on the need there. What we’ve found though, is that I’d say the majority of use cases, it’s less about having the last one minute or last five minutes of the data available for queries, especially if it’s a human kind of use case where, you know, somebody’s going to actually look at the data for ad hoc analysis or kind of maybe a business intelligence dashboard what’s more important is that, you know, you’re not, it’s not taking you weeks or months, right. To get this to work and for any change to happen in the system. Right. That that’s the, probably the bigger problem than okay. If everything we’re working on, the only problem was I can’t see the last five minutes of data for most use cases. That’s okay. Not always, obviously if you’re trying to do fraud fraud analysis or fraud detection, you know, you may, you may need, obviously the, the life kind of real-time data feed,

Akshay Manchale 00:22:32 Right? This is great. When analytics, one thing is now that you’re actually moving a lot of data from different sources into a data lake, what is compliance like, you know, you might have, uh, GDPR requests that say, I want all of my data deleted. Now you have it in various forms and various places. So how do you conform to some of those policies
Speaker 1 00:22:52 At O’Reilly? We know your tech teams need quick answers to their most urgent questions. They need to stay on top of new tech developments. They need a safe place to learn the technologies, your company adopts, and they need it all 24 7. Well, they can get it all at O’Reilly dot com with O’Reilly online learning, your team gets live online courses, tons of resources, safe, interactive scenarios, and sandboxes, and fast answers to their most pressing questions, visit O’Reilly dot com and request a demo.

Tomer Shiran 00:23:22 From my perspective, a key to having data governance and insecurity is to minimize the amount of data copying that’s happening. So if you think about the traditional architecture, even in the cloud is companies, they, they kind of, they have their data either being generated directly on S3 or they’re moving data into S3. Then they, you know, they take a subset of that data. They load it into a data warehouse for querying. Then within the data warehouse, they create copies of that data because different users and groups have different needs around how to view the data. And also for performance reasons, right. You know, the, the raw data is too big and the data warehouse, so the queries are slow. And so then you want to maybe summarize pre pre aggregate or summarize it and create kind of more filtered or aggregated views or tables of that data.

Tomer Shiran 00:24:03 And then the access often is too slow. And so then additional copies get created by kind of creating, you know, BI exports. If you’re using tools like Tableau or Microsoft power BI you’re creating what’s called exports or imports into the BI tool. And if you’re a data scientist, you’re exporting data from the warehouse into like a local file on your laptop. So you can do data science. And so you can see in this whole pipeline, there’s so many copies of data, and that makes it very hard indeed, to deal with things like GDPR, because somebody submits a request to under the right to be forgotten, right? They say, I want you to leave all the records related to that person. Well, if that data has been copied all over the place by different users, it becomes basically impossible to really satisfy that in many ways, the centralized kind of data team doesn’t even know where the data is anymore.

Tomer Shiran 00:24:47 So that’s been one of our focuses at, at Jamia, we call it the no copy architecture is to really minimize the need for data copies, right? You want to have one centralized system that can support data democratization and give people the performance that they need all the way from, you know, of course the, the kind of the ad hoc, you know, experimental data exploration kind of use cases, but all the way to that sub-second and kind of BI workloads that in the past, you’d have to go through that, all that complexity to, to achieve these kind of BI performance requirements.

Akshay Manchale 00:25:17 Yeah. I want to come back into a cameo and like, you know how you’re doing those, your copy analytics, but before we get there, can you talk to some of the common use cases for a data lake that you see in the industry today?

Tomer Shiran 00:25:31 Sure. You know, I think the data lakes are used, they’re used horizontally by, in, in really every industry. Right. So when I think about the companies that we deal with in the world of data lakes, it’s everything from, you know, the world’s largest banks to the large insurance companies, to, you know, tech startups, it’s really all over the place in terms of the kind of the industries and what they’re doing with the data. But the common thing there is that the companies have a lot of data and are trying to get value out of that data. Right. They’re trying to process that data. They’re trying to query that data and present it in dashboards. They’re trying to do machine learning on that data. So they have a variety of different workloads that they want to bring to that same data. Right. And that’s one of the, one of the advantages of the data lake is that you have that same set of data.

Tomer Shiran 00:26:16 And this is why it’s so beneficial to have this, this architecture where you have kind of the data stored in open formats. The same set of data can be processed by all these different engines for different use cases. So to give you an example of one of our customers is a tech company and they’re, you know, they’re, they’re optimizing their supply chain, right? So they have a lot of data from all sorts of all parts of the, kind of the supply chain. And they need to be able to visualize this and present this to all way from the folks that are responsible for understanding what’s happening in the supply chain and where the bottlenecks all the way to the executives that are looking for kind of bigger picture, you know, how are we doing? How many units have we moved and so forth, and being able to do that requires you to be able to aggregate the data and provide that big picture view of it very, very quickly, because that executive that wants to log in and look at a dashboard, they need a response with less than a second. And it also requires the ability to drill down into the detail level data, right? So down to the individual records and what happened to this, you know, what happened to this box and so forth, right? So, um, so it goes all the way down, down to that. But yeah, in many, in many cases we’re talking about a variety of different use cases, all kind of being executed on that same, that same data,

Akshay Manchale 00:27:26 You know, there’s a lot of value to be extracted and you could be in an organization that is having a lot of data from various sources inside of a instead of your data lake. And how do you go about exploring what’s actually available? Is there any exploratory tools to kind of discover the kinds of data that you have that might lead to richer analytics?

Tomer Shiran 00:27:45 There actually are. And in fact, there are companies that vendors that this is, this is what they do. They, they, they they’ve developed these data catalogs, see companies like Collibra ventilation and, um, and others that, uh, this, this is exactly what they do. They try to kind of surface the med information, right? The descriptions of the datasets they provide maybe like a search bar that allows you to search for, you know, a few, if, you know, kind of what you’re looking for. I am looking for sales data, I’m looking for things like that. Then you put it in the search bar and it will surface tables that maybe have that column or the columns can be described. And also, uh, you know, you can have tags on these different things. So it gives you a level of kind of data management and kind of a catalog that you can explore what you have.

Tomer Shiran 00:28:27 And then I think for most companies, it comes down to what Vicky, to be honest, I think at the end of the day, there’s a lot of tribal knowledge. This is a problem that has never been truly solved to the end. And I think there’s a lot of tribal knowledge and a lot that you can learn from, from other people within the organization. And you see a lot of that obviously happening as well, because yeah, there’s descriptions of data, but they get out of date and as the data’s changing and, you know, you don’t really understand maybe what exactly is stored in that table. So this is where it’s important for the company to have kind of really good data management kind of processes as well. It’s not just the tools.

Akshay Manchale 00:29:03 Okay. Now I want to switch gears a little bit and talk about cameo, which is obviously the company that you’ve started. So what is Dremio and how does that fit into the ecosystem?

Tomer Shiran 00:29:13 Yeah, sure. So we, we described Dremio as a, a data lake service or a next generation data lake service. And the idea really is to bring kind of the capabilities that you would traditionally having a data warehouse to bring that directly to the data lake and allow people to run SQL queries directly on their data in systems like S3 and ATLs. And so really what we do is we help companies take advantage of all that data they have in their data lake, not just for kind of the data scientists that are doing ad hoc and kind of exploratory work, but for the full spectrum of use cases that you can accomplish with SQL. So we’re, we’re the sequel engine for the data lake and that ranges from people that are using power BI and Tableau, and they want that sub-second response time on their dashboards, all the way to people that are doing ad hoc quarries, and kind of one running one query at a time. And, you know, for many minutes and kind of exploring the results of those queries. So we, um, w were the SQL air for many of the world’s largest data lakes, the largest, uh, many of the largest banks in the country and are just tech companies, retailers, insurance companies really spans every vertical.

Akshay Manchale 00:30:18 So something that I read on when I was going through some of the talks was, uh, about what your datasets that you can create. Uh, can you tell us more about what, what your data sets are?

Tomer Shiran 00:30:28 Sure. So you could think of the value proposition of Dremio is having two key more than two, but let’s talk about two key kind of components. One is that performance element, the fact that we can rent quarries and achieve the performance that typically you couldn’t get on a data lake historically. Um, and then the second thing is what we call a semantic layer. And so the idea here is that in a company that has lots of users that want to consume data, and increasingly that’s really every company, because everybody wants to democratize data and, and put it at the, in the, in the hands of, of the business. Um, you want to be able to make the data accessible and consumable and secure, right? And you don’t want to create lots of different copies for every single user in the company, right? So, so we have these virtual datasets where you can very easily create kind of different views of the data.

Tomer Shiran 00:31:17 You know, whether it’s, uh, could be as simple as renaming, a column of a dataset all the way to, you know, a masking, a specific column for a pie that, that has PII information, uh, or, you know, cleaning the data or, or joining it with something else, but you create these virtual datasets and that becomes kind of that semantic layer. And then regardless of what tool you’re using to access the data, if it’s Tableau or power BI, or it’s a Jupiter notebook, if you’re a Python user, you’re basically seeing the same definitions, the same virtual datasets, regardless of how you’re accessing that data. And that makes it very, very easy for like the data team. Um, you know, the data engineers to provision data sets for different users, right? They don’t have to go and create a separate copy because creating the copy is easy.

Tomer Shiran 00:31:57 It’s keeping in sync later is the hard thing, right? As data’s constantly coming in 24 by seven, you know, once you start creating copies, you have to keep those data pipelines billing, and make sure you have many nines of availability on them. And so forth becomes very difficult. So we create these virtual datasets and just creating virtual datasets alone is not enough because sure you can create views in a database, but, but then, you know, you query them and it’s really slow, right? Because you know, a lot of compute is happening kind of on the fly. And so we kind of compliment that with something we call data reflections, which is, uh, a way in which we materialize, you know, different, different views of the data. So you can imagine a data set that maybe has billions or hundreds of billions of records, and you want to have different views of that data at summer.

Tomer Shiran 00:32:43 Maybe aggregated already, maybe sorted in different ways, but you don’t want the users to have to know and guess which version of the data they should use. Should they use a pre aggregated version because there, they only need aggregate level data. Should they use this sorted version because they’re doing a filter on a specific column, right? The users of the data that consumers of the data, they’re never going to understand things at that level, right. They, as we talked about earlier, just, just knowing which data sets you have in the company, that alone is hard enough. So let alone, you know, if you had like 50 different representations of that data set. So what we do at Dremio is we take these, you know, the users, they, they operate in the logical world. They think about kind of the datasets and the virtual datasets. And then our query optimizer basically gets the, gets the query composite into a query plan and basically rewrites that query plan internally to leverage these different data reflections automatically and transparently to the user. So you get that performance that you could get from highly optimized data structures without having to understand any of that.

Akshay Manchale 00:33:45 When you say daily reflections, is that something that’s actually materialized? Is that just a representation of sorts?

Tomer Shiran 00:33:52 So you have two different things. You have the virtual datasets, and those are, those are kind of like, kind of like using a database, but, you know, we provide a much nicer interface for people to kind of view them and interact with them. And then you have the data reflections, which are various materializations. And so you could think of a reflection as kind of being like an index in a, in a database, right. You know, an index safe, he went to an Oracle database or a Postgres database and you create a table and you create an index on it. The index is a different data structure, right? It’s maybe it’s a bee tree that has a, you know, that column kind of a copy of that call, but sorted in a specific way and kind of with pointers to the different records. Right? So an index is an example of a materialization, a cube in OLAP is an example of another kind of materialization, right?

Tomer Shiran 00:34:34 There’s different kinds of materializations that you can have, and they’re all good for different workloads. But the idea is that if you can take that complexity away from the users and you can automatically take advantage of the existing materializations when somebody runs a query will, then, then you can make that a reality really, right? It’s not, you don’t have all these problems that you had back in the day with all these OLAP cubes that people would have to build. And then the users would have to point their BI tool at that cube. And the user wanted something, it wasn’t in the cube and it took two months for the it team to refresh that cube and all those problems basically Yollie.

Akshay Manchale 00:35:06 So it’s kind of like a self service way of getting to what you want. And on the compute side of things, like what other common tools or common processes exist, you know, you have your data lake, you have data coming in from various sources that you are storing. How else is it used? What sort of tools are used for what kind of applications,

Tomer Shiran 00:35:26 Uh, w when you say tools, you mean in, in, in the data lake itself,

Akshay Manchale 00:35:30 I think things that make use of the data and our data lake. Yeah.

Tomer Shiran 00:35:34 Yeah. So most of the time, if you, if you look at the companies that are using Dremio, they’re not using Dremio alone, right. I mean, part of the value of the data lake is that ability to use different compute engines, different tools on the same data, right? The data stored in open-source formats, and you can use different, different tools on that. And in fact, uh, Werner Vogels from, from Amazon, the, the CTO of Amazon wrote a nice blog post about, uh, kind of why Amazon is standardized internally on data lakes and talks about the fact that you have this flexibility, but also that you have a flexibility for the future, right? Because, you know, as we know things evolve fast, there’s new open source projects, there’s new technologies that come up and, you know, this allows you to tomorrow, we’ll use a new type of engine on that data, right.

Tomer Shiran 00:36:15 Without having to do some kind of migration of your data. But today you have systems like a spark that people use on the data. When it’s more about batch processing and machine learning, you have a dedicated kind of machine learning libraries like DAS, you have things like Flink for kind of stream processing. You have Kafka more for kind of on the ingestion pipeline. So there’s a, actually a variety, a large variety of tools that people use, um, as well, you have, you know, for example, there’s other SQL engines, you know, dromio which, which is a company that I, that I found that five, five years ago, but, uh, you have other sequel engines. You have Hy-Vee of Presto that, that some companies use there’s a, you know, Amazon has kind of a serverless version of Presto, but those are more kind of, I’d say for the ad hoc and kind of exploratory use cases that only right. They can’t provide the performance that you would need for kind of lower latency workloads and things like BI as well.

Akshay Manchale 00:37:11 Well, what about security? You know, you have different kinds of data now and how do these compute engines, uh, respect the access control or other security related aspects of your underlying data.

Tomer Shiran 00:37:23 Th this is where it helps to have kind of a meta store or a catalog for the data, which is respected by the different engines, right? So you have things like kind of glue catalog that can provide permissions. And then actually there’s an open source project that we, uh, we’ve recently created called a project Nessie, which you can think of it as the next generation meta-story for, uh, um, for, for data lakes, which I’m not just provides kind of the schema and the permissions, but also provides a, uh, get like experience for the data where you can have version control and branching and cross table transactions. And it really makes her data like, look like it’s a get repository. And so all of these types of, of Metta stores that are accessible by, by different engines, uh, for the most part, they’re all, they’re all open source. You can define the permissions in these systems, right. And also integrate with your existing identity provider for the user identities, right. Whether it’s a Azure active directory or Okta or something like that.

Akshay Manchale 00:38:20 So you described the metal store aspect of it, right? So one of the things that I’m wondering is that you have data that’s coming in from the various sources in different formats now is enforcing some sort of like a metal schema, an anti-pattern with the spectral data leaks, because you want this variety to be present, but you also want it to be present in a way that’s consumable.

Tomer Shiran 00:38:42 So at the end of the day, the fundamental building block for a data lake, for any system, really, for analytics as a table, right? And so that’s the level where you want to define the permissions, right? You want to say, well, this table, or perhaps these columns are accessible to these users in this table, or these columns aren’t accessible to those users. And so I think that’s really important, right? Especially these days where security is such an important thing for companies, especially when, when, when it comes to sensitive data and PII that you really have to have that right. And data lakes maybe in there kind of, I think early days, 10 years ago, there were questions, oh, is this secure enough and so forth? But you know, these days, you know, all the, all the fortune 500 they’re, they’re all, everybody’s running data lakes, young in many cases in the cloud. So all of these problems have been solved.

Akshay Manchale 00:39:32 So one of the things that I’ve seen about data lakes is people refer to a data, swamp of sorts, like, you know, your data lake turns into a data swamp. So what’s, uh, how do you prevent that? Well, first of all, what’s a data. Swamp. What point do you call that a data swamp? And how do you prevent that?

Tomer Shiran 00:39:46 I think this is a problem that you have with data lakes and data warehouses, actually, right? Because at the end of the day, if you don’t manage what’s in that environment, then people start grading, all sorts of, kind of derive datasets. They don’t keep them up to date. Nobody knows what those data sets are. And that’s what I think people refer to when they say a kind of data swamp. Right. It also sounds nice, right. Lake in a swamp. I think that’s, it’s fun to say that, but yeah, fundamentally I think that you have this problem regardless of what system you’re going to use. This is where I think we go back to that, that idea that if you can reduce the number of copies in the system, right? For example, with that semantic layer that we have with virtual data sets and reflections, you know, by not requiring you to create a different copy for every user that needs something a little bit different and not requiring you to export data from your lake, into the warehouse and into BI extracts and so forth. If you don’t have to create all these copies, then you’re going to have less of that problem. Right. You’re going to have an organized repository of data sets really that that is up to date and well-documented, you know, people can access and get value of,

Akshay Manchale 00:40:56 I want to talk about the operational aspects. How do you see organizations running operations around data lakes? You normally, you would have say DBS, uh, managing your database clusters maybe, and you have another set of, uh, operational engineers who are managing your analytics side of things, right? So how is the operational transformation happening at that companies in order to manage their data as a data lake?

Tomer Shiran 00:41:20 I think there is typically, you know, a data team in most companies that’s responsible for the data lake often, it’s the same team that’s responsible for the data warehouse as the company has. So, you know, by and large, they’re, they’re the ones doing that. What is happening think is that, and this is especially driven by kind of the rise of the public cloud and the ease of use that now comes with, you know, systems like, uh, like, and other SAS services in this space. It’s just becoming easier and easier to utilize these different services on, on the data, right? So you don’t, you don’t have that same complexity that you used to have with, you know, big monolithic Hadoop clusters, where you have 30 different open source projects running on the same set of hardware. And, you know, you’re constantly having to tweak and deal with memory settings and upgrades of software, different software components, and keeping them kind of compatible with each other, you know, those types of problems go away when you have, when you’re running in the public cloud, because you have this separation of data and compute, right.

Tomer Shiran 00:42:16 The data is stored on S3 and open format. So yeah, you gotta manage the data, of course, that does that, that never goes away, but you have to worry less about the compute, right? So to give you an example, we have, you know, in our AWS edition, we automatically spin up the engines. So you can define different, uh, what we call, we call them engines, but you can define different engines for different groups within the company, you know, the marketing group and maybe the executive dashboards that they have their own engine, and maybe that’s a medium sized engine and the other, one’s an extra larger engine, but they’re all running independently. And they all kind of spin up and spin down dynamically based on the workload. And so they’re not interfering with each other. You’re not having to deal with all this kind of fine grain tuning of resources between these different workloads and worrying about, well, did the, did the data science intern run a big, massive job that’s going to impact, you know, the CEO’s dashboard, right? So it’s becoming a lot easier. And the amount of, uh, I’d say the amount of data engineering work, or kind of operational work that has to go into managing, managing these things is on the decline

Akshay Manchale 00:43:17 That because like you have a higher throughput from the storage side, so that you can just spin up compute resources as necessary, especially in the public cloud.

Tomer Shiran 00:43:25 Well, I think it starts with the fact that, you know, you have somebody who’s willing to rent you servers by the, by the second, right? And that’s, that’s, you know, Amazon and Microsoft, whereas on-prem, you basically bought servers and it took you three months to get the boxes from, you know, somebody like Dell or HP, and you have to reckon stack them. And so you didn’t have much LSD density there, right? You couldn’t just say, Hey, I need a, you know, 10 servers for the next 10 minutes to, for this workload. And then, you know, here, take it back. I don’t want to pay for the other 23 hours and 50 minutes of the day. Right. So it starts there just having the ability to rent a computer infrastructure, but then it also comes with the fact that you have that separate storage service that is accessible for all these different engines.

Tomer Shiran 00:44:06 Right? So, you know, even you look at a demo environment that has multiple engines for different workloads. They all have access to that same S3 storage system and, and the ability to get kind of decent throughput there. But they also, you know, when you look at some of the more sophisticated systems like lake Dremio, we also have built kind of caching technologies, right? So a dream engine doesn’t have to go to S3 to get the same data again and again, and again, we actually leveraged the fact that Amazon easy two instances have SS, you know, very fast SSDs or end VMEs on them. And we can leverage these MVPs in a, kind of a distributed way to catch the data locally, close to the compute without the user, even knowing that’s happening, right. Because the NBME is they just come with a box. And so from the user standpoint, they’re just getting faster IO than they’d get. If we had to go to S3 every time three, they’re getting lower latency and also lower costs because you actually end up paying less for like a desk, three gets kind of gets in puts.

Akshay Manchale 00:45:02 So I want to start wrapping things up. And what are some big obstacles and challenges that you see in the whole data management space going forward?

Tomer Shiran 00:45:12 So the challenges and opportunities for, for data lakes is really to continue to push the envelope in terms of kind of what’s possible from a data management standpoint. And so historically you kind of have a trade-off between data lakes and data warehouses, as some, you know, as a company that was sort of building your data infrastructure, um, you could get the scalability, the low costs, the flexibility, you know, the openness of a data lake, or you could get the, kind of the all-in-one kind of functionality that transactions, the, the data mutation is time travel that you get with, with a data warehouse. And that came at a very high cost and lack of flexibility and so forth. What’s happening now in the world of data lakes is that we’re companies like Dremio are basically taking that trade-off and eliminating that trade-off right. Basically bringing the, the functionality of the data warehouse to, to the data lake.

Tomer Shiran 00:46:05 So that ability to have, uh, you know, transactions and record level mutations, you know, DML the ability to have time travel to, to query the data as it was yesterday at 9:00 AM, all of that within the data lake, so that you no longer have to move your data out of the data, like into a data warehouse for any, any, any kind of use cases. And so I think if you look at fast forward a year or two from now, basically data lakes will be solving all these use cases, and there will be less and less of a need for a data warehouse out there. I think that’s the big, uh, big opportunity. And then you see things, things that are beyond that, of course, with some of the newer innovations things like project and se, which are providing that get like experience for the data lake. But I think that also takes it to that next level enables things that really weren’t possible before across any system.

Akshay Manchale 00:46:55 Yeah. That’s, that’s definitely way exciting. All right. So thank you so much for coming on this here. This is auction when Charlie for software engineering radio.

Tomer Shiran 00:47:03 Yeah. Thank you. It’s been great to be here.
Speaker 1 00:47:05 Keeping your teams on top of the latest tech developments is a monumental challenge. Helping them get answers to urgent problems they face daily is even harder. That’s why 66% of all fortune 100 companies count on O’Reilly online. Learning at O’Reilly. Your teams will get live courses, tons of resources, interactive scenarios, and sandboxes, and fast answers to their pressing questions. See what O’Reilly online learning can do for your teams. Visit O’Reilly dot com for a demo.
Outro 00:47:36 Thanks for listening to se radio and educational program brought to you by either police software magazine or more about the podcast, including other episodes, visit our [email protected] to provide feedback. You can comment on each episode on the website or reach us on LinkedIn, Facebook, Twitter, or through our slack [email protected]. You can also email [email protected], this and all other episodes of se radio is licensed under creative commons license 2.5. Thanks for listening.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

Join the discussion

1 comment

More from this show

SE Radio 721: Rob Moffat on Risk-First Software Development

SE Radio 720: Martin Dilger on Understanding Event Sourcing

SE Radio 719: Birol Yildiz on Building an Agentic AI SRE

Menu

Recent posts

Search