SE Radio 454: Thomas Richter on Postgres as an OLAP database

Thomas Richter, founder of Swarm64, discusses the internals of the Postgres open source database project. Host Adam Conrad spoke with Richter about an introduction to relational databases, comparisons of OLAP and OLTP structures, performance considerations, and advanced features of the Postgres architecture.

This episode sponsored by O’Reilly.

Show Notes

Transcript

Transcript brought to you by IEEE Software
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected].

Intro 00:00:00 This is software engineering radio, the podcast for professional developers on the [email protected] se radio is brought to you by the computer society. I is your belief software magazine online at computer.org/software.

Adam Conradr 00:00:16 Keeping your teams on top of the latest tech developments is a monumental challenge. Helping them get answers to urgent problems they face daily is even harder. That’s why 66% of all fortune 100 companies count on O’Reilly online. Learning at O’Reilly. Your teams will get live courses, tons of resources, interactive scenarios, and sandboxes, and fast answers to their pressing questions. See what O’Reilly online learning can do for your teams. Visit O’Riley dot com for a demo. Thomas Richter is the founder and CEO of swarm 64, a company focused around Postgres performance extensions prior to swarm 64. Thomas held executive roles at multiple companies and startups. Thomas, thanks for coming on to software engineering radio today. Glad to be here. So I’m actually really excited about this topic because I studied databases as my specialty in college and in my master’s degree. So I’m really excited to talk today about Postgres OLAP versus LLTP databases. Um, and just really get to know more about the internals about Postgres database software. So let’s start all the way back with the basics. What is Postgres?

Thomas Richter 00:01:22 It’s a project out of Berkeley. Mike Stonebraker, you know, big figure in database design named arrives from post ingress. So I generally found it always a tough to pronounce name. That’s probably its origins. And then at the point where sequel was added as a Curry language, it got put into this, this portmanteau, this, this word assembly that was Posterous SQL. But to be honest, I personally never pronounced the SQL part because that’s very difficult to say great open source database. It was always built with the intention of being multi version concurrent. We’ll probably go into that a little more in a later on, but it’s really this idea that you have true transaction safety and you have many, many individual writers and readers making modifications or retrieving data, but they all get a consistent view. And that’s a big thing and that’s very valuable. It’s gone through many iterations and recently it has really been given a lot of performance features, a lot of enterprise great features. So it is really trending very fast. And yeah, that’s, I think an unexpected outcome because it has so many features. It is, as people argue the most fully featured open source database, extremely friendly license combined with a great feature set means people use it everywhere. And for all sorts of applications.

Adam Conradr 00:02:48 Yeah, we use it at work. It’s a very popular database that we use, basically all the companies I’ve worked at. And if you’re interested in learning more about relational databases, you can listen to episode 1 0 2. And we actually had Mike Stonebraker for an episode himself episode 1 99, which check out as well. One of the things I hinted at at the beginning was this is an example of an OLAP database versus an OLTC database. Can you just give a brief introduction on what do those two things mean and what are examples of each?

Thomas Richter 00:03:15 Yeah, so online transaction processing or OTP January 1st to many, many small modifications. So to give a practical example, let’s say you have an app and the app is running on front end servers that take care of all the things that you want to interact with it, with your customers. And at the same time you want all the customers state, as in the current, for example, screen, they’re on the current interaction they’re doing with your application. You want all that to be kept in a central consistent repository. And that would be where your database sits. So all your customers will do small interactions. They will read as an retrieve data, find out certain things, or they will write as, and maybe modify values and all that is kept in the database. So your front ends that are actually powering. The app can actually be running independently and can be stainless.

Thomas Richter 00:04:09 And your database is where all the data sits. So for that kind of workload, very classical workloads systems of engagement type of workload, it is necessary that you can take many, many small modifications all the time. Another example is, think about an accounting system where you have bookings coming in, bookings, flowing out. You always want things to kind of transact in a very consistent way. As in the booking comes in and the money flows out that should be locked in the same transaction. So that’s your old TP, your online transaction processing. And then on the other side, you have online analytics processing. And that is when you’re looking into, okay, what is happening on a grand scale? So let’s again, take the, the application example. So as I said to an app, maybe it’s in a browser, maybe it is in a mobile phone app, but you, as the business owner running this application are actually interested in what’s happening globally.

Thomas Richter 00:05:04 Are people engaging with me? Are they doing things? Are they interacting? What are they doing? This is really where your analytics part comes in same example with the accounting system. So you have all your transactions and you have locked everything, and you’ve made sure that you paid all your bills, but no, you want to see the state of the business. Again, it’s an analytics query where looking at, okay, which customers did I earn most with? Which customers have paid their bills late, these kinds of questions you’re asking. So it’s really the difference between the day-to-day operational part and the insights that you need to run the operations better. And so the first is the old TP and the second is more in the OLAP. Me.
Adam Conradr 00:05:43 Got it. So for folks listening today, I mean, they’re going to be doing both reads and writes whether they’re executing transactions or they’re running query plans on something. A Postgres is an example of an old AP database. Correct? Well,
Thomas Richter 00:05:57 Traditionally it is very strong in OTP that’s, that’s post-Christmas claim to fame, however, and this is something that’s warm, 64 is, is, is claiming. And I think you see this more and more in the market. It is starting to become a really strong choice for oil AP as well. And it wasn’t traditionally. And some of the things why this is so interesting to move posters into that domain is part of our talk today.

Adam Conradr 00:06:21 Got it. So what, you know, just at a very high level, what facets of Postgres make it a good choice as an OAP database?

Thomas Richter 00:06:30 Yeah, so I, I mean, again, the OTP part is kind of obvious. That’s all the ability of Postgres to run many, many tiny transactions in parallel, et cetera, but what is increasingly making posters, a powerful choice for all AP is parallelism that was added as a version 9.6. It is the version 9.6 is a few years back. So now Postgres has had parallelism within the same, query it since a few years and it’s getting stronger and stronger. So you can ask really complex things and use many, many hardware resources in your server simultaneously in order to answer these complex questions, then of course the, the full feature set is really important when you’re trying to ask complex questions. That is good. If you have the full sequel language specification available. I mean, apart from some exotic corner cases, Postgres is fairly complete when it comes to the SQL implementation.

Thomas Richter 00:07:25 And finally, the other thing that makes posters really powerful is that it is, has such a permissive license and it’s open source support by big community. So it can actually fit into a lot of different business models and a lot of different environments. So that’s also really, really powerful. So when you’re thinking, for example, about including it into your microservices, or if you’re thinking about running this as a decentralized analytics server and a lot of places, all these things are possible thanks to licensing and free open source abilities. So you can also compile it for a range of platforms, which is also thanks to the open nature of the code makes it extremely versatile. So with

Adam Conradr 00:08:10 The utility, let’s actually go back to the basics. So starting with the very basics of the Postgres internals, how are records stored in the database?

Thomas Richter 00:08:18 So Postgris is a classic example of a row store. And again, just want to connect that to the baseline we’ve talked about regarding LTP and all AP. So row store, if you think of a database as a collection of Excel sheets, which is an oversimplification, but it may be useful here. So you have a collection of tables that are really just rows and columns. And usually what happens when you update something is let’s say you update a customer’s address. You would not just update the street name. You would likely update the streets, that number of the house, probably the state, the zip code, all these things would be changed. You have many more modifications and they tend to be all in the same row. So this spread the, the row stall concept, uh, can play twist strength. Any updates usually tends to modify multiple fields in the same record that is all stored in the same role.

Thomas Richter 00:09:16 Now pollsters has a row store and has these benefits. And that’s why it is so suitable for things like systems of engagements. With, as I said, an app where many people change things, or they have the example of a accounting system, typically what you would do in analytical databases to see what store the data and columns and are going to go back to my address example. Now you may not be so interested in the individual address, but you may want to aggregate your, for example, the products or services you sold by sip code or by zip code area. And this is where then when you have a column store, as in every column is stored by itself. So you’ve got one column with all the street names, one called them with all the house numbers, but most importantly, one column with all the zip codes, then of course analyzing it this way, round is much more efficient for asking for insights.

Thomas Richter 00:10:08 So you’re only analyzing for example, the zip codes, and then you’re correlating them probably to how much we’ll spend and the customer numbers, et cetera. So a lot of your data you don’t actually need. And then it is extremely efficient to just put those different columns together and thereby crunch a lot less data in the first place. And this is why most analytical databases, our column store, you may have perceived already the dichotomy here, right? Do you all got them the one side row store perfect for transactions. So all TP and the column store part perfect for analytics or AP and they don’t quite mix. And this is one of the things that’s these four has, for example addressed and how you can get all the row store benefits and add the column store benefits on top without destroying the first,

Adam Conradr 00:11:01 Usually different implementations. Like I know for example, Stonebraker also has done Vertica, which was a, which was a column store database. Monet’s another example of a column store. So yeah, I’m very curious to hear how you can take a row store database like Postgres. My SQL is another example which we’ll get into later and still gather the performance benefits of a column store database when doing analytical work. So just briefly, before we get into more of the advanced features, what are you able to achieve there to get that additional performance

Thomas Richter 00:11:30 Challenge starts? You want to retain rose store and you still want to have a column based view in order to be able to run an analytics faster, and the way we’ve decided to do this. And there’s a few databases in the world that are as advanced and have kind of established similar concepts is we call it a column store index. So we have a secondary view of the data that is columnar. Now people may say a wow, but that means I’m storing my data twice and it’s going to be very expensive, do your modifications, et cetera. But these were really the challenges we were from day one, attempting to solve and have solved successfully after quite a bit of engineering time running into this. So what’s happening basically is you have your, your rows stored by post-stress and then you instantiate a second theory storage view, and that’s basically your column store index, but that one is actually updated a bit smartly.

Thomas Richter 00:12:23 So it is basically not always needing to be really up to date. There’s a very clever algorithm to actually allow to retrieve data from two sources. So the column store index can actually just tail the main system a little bit, and that’s, that’s part of the feature set. And also it is extremely heavily compressed. So one of the compression algorithms that we give users, the choice is, is that standard, which is by Yann collards, Facebook, for example, one of his employers. And it is a very, very compute efficient algorithm that still manages to squeeze data together very efficiently. And those kinds of compression algorithms were playing on top. So in other words, you can have your entire table stored in a column store index as well, but that entire index is not going to be much larger than for example, the one of the three or four indexes you already have on your table. For example, you may have a index on your primary key in your table in order to be able to find that quickly. Well, if you add a column store index for the entire table, the additional storage is likely to be just the same as that one index. So in other words, you are gaining all the benefits of analytics processing at a very negligible, additional quantity of storage, maybe 10%, maybe 20% in extreme cases,

Adam Conradr 00:13:46 Right? Because in the case of the databases time is everything really space is that cheap commodity trade off that you can make. Do you have to make recommendations why I’ve got a couple of questions here? Cause you, you said some really interesting points, but the first one I’m just to kind of close out this thread is do you have to make recommendations for additional space allocations for folks when they’re partitioning database storage, given how much extra they might need for these sort of columnar indexes,

Thomas Richter 00:14:11 We are normally below the threshold that people anywhere use to make sure that the database operates well. So you will always, you always want to have some headroom in order to make sure your operational workings of your database are fine. And basically we’re usually way below the threshold. So normally, like we have not seen it yet that we need to, we need to give anyone kind of specific troubleshooting guides or something to address that, that, that has never been the case. But of course, people, especially people using our software longer, they’re getting more and more proficient using it. And of course they are very aware of what they’re provisioning and then how they’re setting up the system. So it’s basically you’re paying, if you to see the kind of trade-off you’re paying about 10, 20% in storage at most versus a 20 X improvement on average on your reporting queries.

Adam Conradr 00:15:03 Yeah. That’s a very easy ROI, right? A very

Thomas Richter 00:15:05 Easy calculation to make. Yeah. So

Adam Conradr 00:15:08 Are really interesting point. Now I’m going to ask two questions here they’re related and you’ll understand why I’m asking the first one. So in common nomenclature for databases, one of the standards that we look for besides hearing to the SQL standard is the acid standard acid compliance is really important, especially for transactional databases. So first I’d like you to find what acid is and why we care about it in databases. The second is, I’m very curious when you talked about the column indexes and then not needing to be updated as frequently. And it got me thinking about, oh, is this eventual consistency? And consistency is the sea and acids sort of as a hint. Um, so I’m curious, how are you able to keep Postgres acid compliance and still have these column indexes that may not necessarily be updated as persistently as other parts of Postgres?

Thomas Richter 00:16:01 Yeah. So let’s first talk a bit about, uh, acid as in the backronym so to say, I’m really sure they reordered the individual components to make sure it’s spelled something. So it’s about ethnicity, consistency, isolation, and durability. And it refers to a few concepts that are especially relevant. When you think back to the beginning of the show where we talked about transaction processing. So atomicity means many, many individual readers or writers are doing things, but they’re all doing things only affecting individually their own environment until certain things happen. For example, the transaction is committed and then that co the end transaction is committed as a whole, as in, there’s never a state in between there’s either it’s not there or it’s there. So it’s really a set of small atomic operations. Consistency means of course, that there’s a defined view into the data that has basically a tiers to certain rules.

Thomas Richter 00:16:59 The isolation means that things are basically done in these transactions. And if you, for example, want to take them out, want to take, for example, roll back a transaction. You can do that. So all the individual pieces are in themselves, isolated from the others, so they don’t need to. And then finally the durability part is of course, that everything that happens and has been properly committed transactions and flight may get lost. But the moment you’ve got your confirmation, yes, it has worked. It has worked and it hasn’t been made doable. Yeah. So that is the base, this asset standard. And that has a few benefits. I mean, on the one side it’s of course, absolutely critical for systems that are like, have any mission critical component, like you would never want in your accounting system, let’s say $400,000 bill to be marked as paid, but the money hasn’t actually flown, right?

Thomas Richter 00:17:50 That that would be a big problem. Because then one day you actually find out your business’s either $400,000 short, or your businesses as actually has $400,000 too much of which you then have to search where the hell that came from. So in those kinds of systems is absolutely relevant, but it is also extremely relevant in systems that are gathering insights. And there, you could say, well, you know, I’m a, I don’t know, I’m a hundred million dollar company. It doesn’t make a fundamental difference if there’s $400,000 more or less, whereas that may be true. It is a nightmare to debug. So you have people like analysts or developers that are building tools for insight gathering, trying to make sense of the world. And suddenly the result is off. And they don’t know is the result of because they made a mistake or it’s the result of because your database has not reached a consistent state as in your database has been eventually consistent.

Thomas Richter 00:18:42 As in, at some point it’s consistent, but you’re not guaranteed. That’s kind of the flip side of acid. That’s on the one side you’ve got acid, which is a very, very strict consistency guarantee. On the other side, you’ve got things like eventual consistency, where your database is basically, it will eventually come to a consistent view, but there may be a transitional period, usually of a fairly undefined length of time during which you may still get answers that are partially old or fully old. That’s the bigger context. Now, when we designed the column store index, it was an objective that this column store index was strictly adhere to the assets principles, because that is one of the key strengths of Postgres. And I’ll come to later on to the point of why asset consistency can be very interesting also in analytics processing, apart from the debugging I’ve already mentioned.

Thomas Richter 00:19:31 So the challenge is, however, if you start to keep your index updated in every small transaction, you will really tail behind what your system is capable of doing. So, in other words, you will be slowing down your day-to-day operations because you are doing more work than you used to do in the past. So you’re putting a huge burden on the system. So the second design constraints, so to say the first design constraint was attached to be asset. The second same constraint has, has to be really clever in terms of how it updates without going too much into the logic, which is quite advanced. Basically what it does is it knows what is up to date and it knows what is not up to date. And then it knows where to get the non up-to-date stuff through a second channel. So it’s a, basically a cash and validation strategy.

Thomas Richter 00:20:18 It’s exactly cash and validation strategy and that differential buffer type logic. So you see this in a couple of databases, to be honest, all database of that kind have these type of things where you basically have buffers that are giving you information about data that is not yet stored in your primary column story index. Yeah. So you can basically ask a second source for the things that have been validated since the performance is really in how well that is implemented. And this is why we are so proud about this, because this is something that very, very few proprietary databases have and now swamp 64 is bringing it to open-source world.

Adam Conradr 00:20:59 And now you can just basically bring that on as an extension. That’s essentially what 64 is, right? It’s an extension on Postgres.

Thomas Richter 00:21:05 Exactly. Yes. That’s it got a little too excited there. Swamp 64, we are seeing ourselves as the performance innovators in the Postgres space. Our primary goal is really to make posters more performance and easier to use in the sense that, you know, sometimes if you ask your data a certain Curry pattern, the database would really struggle. And then traditionally you would have to just rewrite the query manually and you would have to then afterwards, make sure that it’s actually really the same question you’re asking and that you didn’t accidentally introduce a problem or a mistake. And we are basically trying to do away with all of that. And we are bringing performance features into posters that aren’t there yet. We are strictly adhering to all the mechanisms that post was already offers for extensibility. So we are not forking. Also some of the things that we’re doing are contributed as open source.

Thomas Richter 00:21:56 We have got some patches waiting for the Postgres community to accept. They have also accepted our first patch about a year ago. So this is one area where we’re doing it. But of course also the swamp 64 extension is available as a proprietary extension that you can load into your Postgres directly. And then you will get all these features. And one of those features is now suddenly you have the column store index available. You can then define it. And we do these things that I’ve described as extensions to what Postgres already doing no need to rewrite any query, no need to make any changes. The answers you get are exactly equal. It’s hearing to all the principals, asset principles, certain expectations you have around data consistency, et cetera, that’s all served, but the answers would come a lot faster, especially with the reporting queries where you would see on average, something like 20 X faster reporting queries got.

Adam Conradr 00:22:55 So you can keep the exact same SQL that you’ve written before. It just, I’m assuming you now have an extended grammar. So there are additional keywords that you’ve provided for folks who want to Nope.

Thomas Richter 00:23:04 Nope. We actually made sure. Yeah, we actually made sure that this is exactly the same grammar that we don’t need any hinting to be introduced into the queries, et cetera, where everything we do is we are exposing additional features and we’ll, we’ll come to that in a moment. We’ll expose additional features to the Postgres planner. So the planner can choose, oh, it is smarter to do this. It’s a little bit like, do you want to go? Let’s say you have a really long drive to work and you’ve got multiple cars to choose from. Right. And you’ve got your model T there, you know, from the 19 1900 odds. Yeah. And you’ve got your, your, I don’t know, your Volkswagen beetle and you’ve got your Lamborghini and then basically your career planner choose the respective car based on what you’re trying to do. Yes.

Adam Conradr 00:23:48 It’s gonna be tough for me not to want to choose the Lamborghini, but I guess if you want to flex that 19 hundreds style, you could go with the model T sometimes which so getting back to the point here, I think you’re raising a really good point, which is something I wanted to talk about, which is the query planner pulling back to the basics of a database, especially when doing analytics, you need a query planner in order to figure out, okay, how do I actually go and retrieve those records? So it sort of my, my first two basic questions here are what is a query planner. And also what is an index? We talked about index a few times and I realized we didn’t actually define it. So if you want to just give some basic definitions of these two things, I think that’ll help our listeners understand a bit more about what we’re talking about here with, with query plans and endemic indexes. So,

Thomas Richter 00:24:32 Uh, Postgres is using a so-called cost-based career planning approach. That means it is trying to make an estimate of every step and how much that is going to cost Postgres to execute. So for example, let’s say let’s make it very simple example. I’m trying to find all the customers that are living in the UK. So then Postgres would look at it and we’ll say, okay, you’re actually only asking two tables. You asking the customer table and you’re asking the table with the addresses. So that’s a starting point. And then in the address table, I know for my statistics, that may be around 20% of the addresses on the UK. And then I make the estimate and that in this case, it’s probably fairly close to correct that about as a result of that, about 20% of the customers are going to be in the UK.

Thomas Richter 00:25:25 So then Postgres has a pretty good idea that it needs to join about 20% of the customer table with 20% of the address table, in order to then tell you all your customers that are living in the UK. And that is basically a very small example of a career plan. And what the career planner then does is it says, okay, how could I find that out? Well, I could use a look-up structure as an index, for example, an index to find all my addresses that are in the UK and an index is really quite similar to an index in the book, right? It’s you’re looking up a specific value and then it will point to where you can find that value. So if your food, for example, look up UK in your index and you have made an index on the customer’s country, it would then tell you, well, you can find UK in this row, in this row, in Switzerland, this, from this one, same with the customers.

Thomas Richter 00:26:20 Yeah. You could have an index for example, on the customer number. So when you’re trying to find exactly customer number 1,477, it would point you exactly where that data would be stored now, indexes and career planning play together in the sense that we’ve got our little query saying, okay, how many percent of your customers are in the UK? Tell me all the customers that are from the UK, we already found out that’s about 20% of the customers. So then Postgres coast to coast based career planner will make an assessment. It will say, do I want to look up 20% of all my data from an index? That means I’m going to follow the index and I’m going to look up every single value individually. Well, that sounds like a lot of work. And in reality, that’s actually true because Postgres would go via this indirect structure.

Thomas Richter 00:27:11 Let’s say your table was just to make it simple. In terms of math, your table was a hundred million rows. Now it will have to go 180. It will have to go 20 million times because 20% of your customers are from the UK. So we’ll have to go 20 million times to your index. Find a single value follow. The index reversely Postgres were determined with is cost-based career planning. No, it’s actually much, much cheaper to just read the entire table and forget while I’m reading everything. That’s not UK. So here we’ve got a good example of cost-based career planning that on the one side you say, okay, how much would it cost me to look up 20 million values out of a hundred million values with 20 million individual look-ups or how much does it cost me to scan a hundred million row table once? And this is really what the career planner does.

Thomas Richter 00:28:00 And as you can probably appreciate is it can get so complicated so quickly. And the strength of this is actually that there’s no guessing involved. It is basically a execution model. You’re building a model of the query you’re spending sometimes even milliseconds on figuring out what’s the best way to do it. And only then will you do it? That’s really coming back to that old paradigm rights people are saying, well, now rough planning can save eight hours of work and Postgres spending a few milliseconds. Sometimes in career planning can often save minutes or hours of CRE execution time

Adam Conradr 00:28:35 At O’Reilly. We know your tech teams need quick answers to their most urgent questions. They need to stay on top of new tech developments. They need a safe place to learn the technologies, your company adopts, and they need it all 24 7. Well, they can get it all at O’Reilly dot com with O’Reilly online learning, your team gets live online courses, tons of resources, safe, interactive scenarios, and sandboxes and fast answers to their most pressing questions, visit O’Reilly dot com and request a demo.

Adam Conradr 00:29:06 So the really interesting point you brought up, and I’m just trying to understand if this was part of the example or if this is actually happening under the hood, you had mentioned in the example that 20% of the customers were in the UK. So is Postgres analyzing that say in between transactions, when you, when you write a new record, it’s now saying, okay, let me do a little bit of math to figure out what, where is this slotting into the grand scheme of this table? Or is that something you just, you know, from your own knowledge of your customer base, that it’s about 20%? Like, how is that math working under the hood?

Thomas Richter 00:29:35 Yeah. Postgres keeps statistics and it’s pretty good at that. It’s not perfect. I mean, there’s a lot of community movement in making that even better, but yeah, statistics are absolutely essential for doing good play career planning because otherwise you would just be guessing. So great question. These 20% would be something that posters would know. And if you look at what table statistics actually stored, it’s typically things like, what are the most common values? How often do they occur? They could be ranked. And of course it’s all probabilistic, right? So it does not like Postgres keeps statistics always up to date to the last minute. But if it cast statistics that are only a few hours old or a few days old, it will be able to do a pretty good estimate. And that’s all it needs to make sure or to make a good estimate between when the way better scan the entire table.

Thomas Richter 00:30:24 Or when do I just look up a few values and index. And by the way, this is a very typical problem that sometimes it’s this cheaper together, new statistics than to execute the query wrong. And that’s one of the features that swamps these forests also brought into the mix. We’ve got a little function that basically determines the cost of running the query, potentially wrongly, as opposed to updating the statistics. And if it finds out that it’s better to just quickly update the statistics, it will actually gather and use statistics during the career planning stage. That is a parameter that can be toggled on that. Some customers find tremendously useful

Adam Conradr 00:30:59 When our statistics normally run. That’s interesting that it’s not normally run as part of the, as part of the plan strategy. So when are they normally it’s part

Thomas Richter 00:31:07 Of the cleanup strategy. So your, your vacuum analyze would, for example, be the keyword that’s when you’re so, so as you’re maintaining your database day to day, and there’s some, some reasons why Postgres actually needs also a vacuum. Every so-and-so often that don’t want to go into that detail here it’s it’s about has to do with transaction ideas, et cetera. But when you’re doing these kinds of data cleanup or weekly cleanup operations, it would also update the statistics.

Adam Conradr 00:31:33 Got it. Now you’re bringing up a good point here. The analyze part, now that we’ve understood what query plans are and indexes aren’t actually, I really loved that analogy about the books, because the key there is that you do have an index just like that analogy, but it’s important that you can’t find that particular keyword unless you write it. Right. So unless you have made an effort to say, put an index on the column of the country where UK would be stored Postgres, isn’t going to know that it needs to actually look at one of those indexes, right? So you have to actually create your own index in order to gather that functionality. Yep. That is correct.

Adam Conradr 00:32:10 Okay. That’s really helpful. And so one of the things I’m actually curious about with the query planner is a couple of keywords here is explain and analyze, you know, when you’re talking about creating query plans and indexes, which I really liked your analogy about the book, because you actually, you know, if you want to look up something about the UK, you actually have to write a record for an index around the UK. So for example, creating an index on the country for the address table. So, but in order for you to know, to do that, you have to actually analyze the plan and understand where there might be gaps in performance. So explain and analyze can help with that. So I’m actually curious for our listeners, what are explained and analyze and how do they help you in terms of optimizing your query plans?

Thomas Richter 00:32:50 Yeah, that’s a really good question because that’s actually, I think the example lends itself really, really well for this specific example that we’ve been discussing. So as you actually run the query, and you’re wondering why it took so long to retrieve all the different data points from the UK, you’re saying, well, it’s a hundred million rows table. It shouldn’t take so long. It is extremely useful to look under the bonnet, what is actually happening in the query. And this is where these keywords like explain and explain analyze comes in. So just a quick backgrounder. So explain, put in front of a query will actually let you know how a query is planned. So it will not tell you anything about how the query was actually executed as in how long individual steps took, but it will tell you exactly what the execution plan was that was followed.

Thomas Richter 00:33:37 And it will basically just give you that plan. It will not actually actually execute the query itself. Whereas explain analyze will not only give you that plan. It will also execute the query and it will compare the actual execution versus the planned execution. So it will be able to tell you also big differences in estimates and reality. Let’s go back to our example, just to show how useful that is. Let’s assume that for a reason of let’s say outdated statistics or, well, it’s usually always outdated statistics, but it could also be things like correlations. For example, that may be certain customers are always in the UK and therefore by chance, your query is returning many more rows than you would normally expect for those reasons. It could be that Postgres has, for example, vastly underestimating the number of customers in the UK. So let’s say it would assume it’s only 40 customers in the UK, instead of the other example, where we set it as 20 million.

Thomas Richter 00:34:37 So the difference between 40 and 20 million is of course massive. And if you try to do 40 index look-ups, that is much more efficient than doing a whole table scan of a hundred million rows. But if you’re doing 20 million look-ups, that is a lot less efficient than doing a table scan with a hundred million rows. So if you’re wondering why it took so long to retrieve the data on your records for the UK customers, for example, to come back, you would just put, explain, and you would see immediately, wait a second. It is planning to do an index scan, but I already know that so many customers are in the UK that has gone to execute that index scan so many times, wow, this is going to be, this is going to be slow, but you, yourself as a user may not actually know all that.

Thomas Richter 00:35:26 So you may actually instead go and say, I’ve now seen the plan. It looks like an index scan. That looks to be fine. Don’t understand what the problem is. So then I run explain, analyze. And at that point it will become black and white where it tells you here, I estimated to do 40 index look-ups and instead of the 20 million. So my planning, my estimate, and my reality went way apart from each other. And that’s where you can go in and say, okay, wow, well, step one, let me update the statistics. Step two, let me see if I can optimize my Curry, or if you don’t want to start kind of being your own consultant. Can I look at other solutions? Like for example, changing the table design or using is tension such as swamp 64, but you don’t get there if you don’t analyze a bit of what is happening, because of course every single case is different and explain and explain, analyze are tremendously useful to see what your database is doing under the bonnet.

Adam Conradr 00:36:26 That’s a really good point. Actually. I was thinking about your example and thinking, oh yeah. Well, if I see an index joint versus a sequential joint, that would be, that would be a better solution, but you’ve just clearly illustrated that sometimes that’s not always the case. So if you are running, explain and analyze for the first time, what things should you actually be looking for in a query plan? And how do you use those things to help you make decisions about performance? Trade-offs like you see, you know, a common example is you see a sequential join where there could be an index scan. So you know that that’s the column you have to create an index on. Are there other examples that you can think of in terms of just sort of the basics that people should be looking for in analyzing their query plans?

Thomas Richter 00:37:02 So the first thing I would actually recommend is there’s a very nice open source tool called Posterous explain, analyze, explain visualizer or called PEV. And why I recommend that tool is because if you’re running, explain or explain, analyze what your, we will actually be seeing is your whole Postgres console. For example, being printed full of, uh, statements that are a little bit unintuitive to read. So visualization tools, like for example, the Postgres explain a visualizer can really help. So that’s one X run recommendation to, uh, that it would just give, makes it just so much easier. We’ll then show you a Curry as a kind of tree and you can look at, okay, where is most of the time in the tree spent? What I would also recommend is for those beginning, with this switch parallelization off, there is a post plus 1 64. So if you check our blog that we have many interesting posts, one of them is on Postgres parallelism.

Thomas Richter 00:38:03 I would recommend for analyzing a query to switch parallelism off first, because when parallelism works or doesn’t work, it can introduce a lot of difficult to understand paradigms and complexity to your query. So if we really first trying to understand what exactly happened, I would recommend to switch parallelism off. For starters, you can still switch it back on as you getting more or deeper understanding of the kind of output you’re looking at. So those are two things like just, uh, two basics to start with as you’re analyzing the query, what you should really do, you should really try to understand, okay, what kind of choices did the database make? And finally, I think it is important to, to appreciate that SQL is a, it’s quite a nice language in the sense that it is quite expressive and somewhat close to English. One could, could claim.

Thomas Richter 00:38:52 Yeah, it has a certain very repetitive structure on how things are laid out, but it actually can compile into some incredibly complicated things. So when the Curry is actually made ready for execution and the career planner is actually building a kind of execution tree, it gets very complicated. So again, I think what you’re really looking for is really those telltale signs and those visualizers will actually help you by, you know, placing kind of reds indicators at things that take far too long or, or were really part of your career running slowly. Or you should really rather try to look at the key bottlenecks. Don’t try. I mean, you can of course try to understand everything in the deepest steps, but that takes a lot of time and a lot of practice

Adam Conradr 00:39:36 That makes sense. And we’ll actually link to those in the show notes, the, uh, Postgres explain a visualizer as well as that blog post about parallelism. So we, we spoken about parallelism a few times now I’m actually curious to start talking more about some of the advanced features in Postgres, since, you know, for example, this was out in since version nine in, uh, in Postgres. So let’s, let’s go over a few of these advanced features in Postgres. And let’s start with parallelism since we’ve spoken about it a few times, what is parallelism and how is it an advanced feature for Postgres?

Thomas Richter 00:40:03 Yeah. So post was always been very parallel in between curries, as in, you have many, many concurrent users and they would all run in parallel, but they would all only do a small interactions individually. Like the example I gave with the people interacting with the application, everybody does something small. Now, if you have analysts in your company and they’re trying to extract some insights from your business data, those questions tend to be very, very complex. And every individual analyst will ask things that are quite big. So they may ask for all the customers to be analyzed a little bit like the simple example we gave earlier, but of course with a lot more complexity. And if you’re starting to analyze all customer orders and things like that, you will basically really want to apply all the available cores in your, in your machine in order to solve that problem quicker.

Thomas Richter 00:40:56 And they are Postgres goes a certain way. It is able to scan and parallel. It is able to join in parallel to a certain degree, up to certain points where it goes back to Cyril execution. Now swarm 64 has basically taken these features and brought them further. So we’re paralyzing much more. So instead of just starting, let’s say you have a 64, a core server, so server, maybe 32, physical 64 virtual course available to you. And you only have one analyst running on it at the moment. And you really want that query to return as quickly as possible Postgres for typically use maybe 10, 12, 14 of these course swamps 64 can use all of them. So you can use all of the course at the same time. We keep the query parallel for longer. So it will probably be paralleled through the entire execution as opposed to Postgres that may at some point go back to serializing, because for example, there has been certain aggregates and things like that, and they will then go back to serial executions policies of all states parallel the entire time.

Thomas Richter 00:41:56 And then essentially you have maybe big joins where you need to join millions of rows or billions of rows. And Postgres is now capable of executing that in parallel, which is good, but we have found ways to execute it in parallel with a lower memory footprint. So you can actually join in memory billions of rows with billions of rows. And that just works. Yeah, I think this is really where Postgres has come from a strictly serial execution where it would just take a very, very long time to answer any complex query to a parallel execution that goes quite some way, but we are dialing that up to 11 in order to really solve people’s problem, give them a lot, lot faster response time and also to reduce the time they have to spend analyzing curries and finding out what’s wrong with.

Adam Conradr 00:42:40 Yeah. So I’ll actually want to dial it up to 12 here. Cause the thing I was thinking about was does that mean you could actually utilize a GPU in order to just get extreme amounts of parallelism given the, the bra CPU or GPU compute power,

Thomas Richter 00:42:55 The usual issue there then becomes to actually feed the data into these devices and actually retrieve it at the time that you actually have a net gain overall. So, whereas we’ve done so far has a background in hardware acceleration. We’ve used to lean very heavily on hardware acceleration we’ve since found ways in order to do it in CPU we’re at the moment, using everything in parallelism that the CPU can offer us. We aren’t yet in the GPU domain and we aren’t yet back in the hardware accelerator domain, quite simply because right now we see that the bottleneck is moving the data there and fi and back fast enough, we are generally seeing that the CPU can give us enough parallelism with all these special functions that it already has, but we are using like, we are designing at the forefront of CPU architecture. We are using a lot of different hardware acceleration methods and of course, coding everything and see, which is not only Postgres as native language, but also an extremely efficient compiled code once it’s running.

Adam Conradr 00:43:56 That makes sense. And I think that’s a reasonable trade-off, I’m fairly aware that the IO limitations with GPS are fairly common when it comes to performance issues for folks. So moving on to other advanced features, I actually there’s, I mean, there’s a lot of them that are available, but I think some of the more basic ones like non atomic values are interesting. Tell me more about sort of the, the newer values that you can use with Postgres. Ah,

Thomas Richter 00:44:19 Right. Yeah. I mean, some of the things that we really love is this whole concept of you basically leave the domain of a strictly relational database and the thing, especially, I mean, our topic today is really Postgres also as, uh, an analytical database. And I think it becomes really powerful if you look into things, for example, like binary Jason, so you can have an arbitrarily complex object that could also, for example, be a tree in itself and you can store it into a field, into a database. So you have a table. So to say, and the table has fields. And then these fields, you could build Jason objects that are just stored in there. So you have a table full of fields that contain trees. Yeah. And, and, and it’s, it’s wonderful because it basically gives you that entire flexibility. You no longer need dedicated documents stores.

Thomas Richter 00:45:06 If you have a document, put it in a binary, Jason storage into your table, it keeps flexibility. High-end it just makes things so much easier because who knows exactly, you know, databases tend to be in operation for a very long, very, very long time, because it’s always very, very hard to migrate from one database to another and to change your operations and all these things. So who knows that on day X, let’s say now in 2021, you’re implementing a database, do you know what’s going to be used like in eight years, probably you don’t and these kinds of features are wonderful because they allow you to do all these things. So, yeah. Uh, one big thing I’m a big fan of, is of course, these binary adjacent types, which are very flexible, they have their own operators. So you could do all sorts of things with them, but you can also do really fun things.
Thomas Richter 00:45:50 Like, for example, placing an index on a binary Jason function, which is really interesting. So let’s say you have a certain value in these binary Jason objects that you always care about. Let’s say it’s about cars. And one of the values of speed and speed tends to be more important to you than other values. Well, then you could actually put an index on it and get data out, even quicker, all with a very kind of Postgres standard relational concept behind it. Another example is of course are raised. You can do, you can build a rate types and, and store many values in a single field. So you have like a, a table full of fields and the fields each individually contain effectively. What’s like a small table. That’s cool. And then finally, which I think is very powerful as well as all the custom data types.

Thomas Richter 00:46:34 So you can define your own custom data type. And that is really powerful feature because it gets away with certain complexities. Like for example, company, Belinda called just got bought by app Levin. They are basically doing it at tracking. They wanted to have the most compact way to store a country. So they thought, okay, how many countries are there in the world? Well, there’s about 190 or so. Okay. We actually only need one bite to store a country and we make it a custom code. So we have basically the name of the country. We can retrieve that we can, of course retrieve the, the country code, all these things, but we actually just put it as a single value. It only takes a bite, but I’ve got this wealth of different functions. I can ask it so I can ask it, what’s your name, country, what’s your name, country, what’s your code? And so on it, river turn all these things, but it only takes a single byte of storage. So really cool things. So in summary binary, Jason array types, custom data types are very, very flexible and allow you to store much more in your database than you ever could do in the past.

Adam Conradr 00:47:42 So the, the, the last one I want to cover, cause it actually a plenty of advanced features in Postgres. And if you want to listen to about all of them, you can listen to episode 360 2, which goes over all of the bigger advanced features for Postgres. The last one I want to cover, just because it’s so common is views. So tell me a little bit more about views and Postgres.

Thomas Richter 00:47:59 Yeah. I mean, one, one way you can actually make your life easier is actually to define certain views. And there’s some very nice things you can do. Like for example, sometimes views can be an ideal way to make a Curry, the syntax, a lot easier and a lot more clear and transparent. So as an, as opposed to having a table, that duplicates data, you could define a view and you can then have the ability to actually Curry that as if it were a table, but it isn’t, it’s just a view into a or multiple tables. That’s powerful because what you really want to avoid is duplicating your data. And especially with this mindset that we have, right about why can pollsters be such a good analytical tool? Those things like views and so on. They’re so helpful for analysts where you can put things together without storing them a second time.

Thomas Richter 00:48:50 And very similarly and sorry, I kind of use that a bit as a segue into another area where you don’t want to store data twice. The whole concept of doing translate, tickle world work, or H tap as they call it hybrid analytics, transactions, processing those areas. You don’t want to store your data again, you have your operational data source and you don’t want to store yet another source of truth. You want to be able to go directly back to your operational data source. And with pollsters, you can do that because Postgres is a fantastic operational database able to do, do all these transactions we’ve talked about in the beginning, but you can now analyze your data with complex queries as well. Yeah. In the spirit of avoiding to duplicate data and as a result, getting things that are out of date, building a bigger and bigger backlog of maintenance duties, it is fantastic to be able to on the one side, do things like views and, and just work with views as opposed to the underlying tables. And on the other side, really work with a data source directly that is used in the operational work, but also for analytics, you not only get more up to date information about what’s happening in your company, or for example, with your app, but you also save yourself all the work of moving it from one place to another.

Adam Conradr 00:50:10 So we’ve done a real deep dive into Postgres today. And now that we’re rounding the corner, I just want to touch quickly on other OAP database examples. You know, obviously we’ve gone very deep into Postgres, but I’d also like to compare it against other choices. So if someone was looking for an OAP database, why would they choose Postgres over these other examples? Like my SQL or DB two, or even, you know, data warehouses like Oracle or

Thomas Richter 00:50:34 SQL server. Yeah. In general, the question number one is why would you choose Postgres over things like, for example, an Oracle or a DB two or SQL server. And that comes very much from, it’s got the combination of a fully featured, but at the same time, open source and very, very attractive licensing terms. So I think that’s this really the strongest assets there for Postgres. Also, you could absolutely argue momentum. Postgres has again made DB engines database of the year. It’s been now for three out of the last four years, been the database engine of the year. And also the fastest trending since about five or six years ago, posters were really building the fastest trends. So you’re really investing into something that is having a strong following and only the best commercial databases can, can compete with that. And they of course have a completely different licensing scheme.

Thomas Richter 00:51:29 Now, why would you use Postgres as opposed to let’s say something like a big Hadoop cluster or something in an all AP setup and the background there is that that consistency we talked about in the middle of the show is really something that helps you in so many ways. We’ve talked about the debug ability. That is a big thing, but it also helps you so much in doing certain things like let’s say, for example, you want to make sure that you don’t accidentally insert data twice because that would completely falsify your analysis. Well, in Postgres with this relational features, you can just put on a unique key and that will make sure that data never accidentally put in twice problem solved. Another example is let’s say you want to keep interim updated statistics on what is happening. For example, you want to have a daily total or something of all the customers per country.

Thomas Richter 00:52:22 That’s something you can do because you just set a few triggers and you update once a day, some value, and that is fully transactional consistently. That’s just happening. So those kinds of features are really the wealth that a relational database in the analytics processing can provide you that those other solutions that are much more batch oriented, much more at scale, but at the same time, without that consistency, they can’t do that for you. And finally, of course, as I mentioned, that combination of analytics on the one side and operational data on the other side saves you all the what’s called the ETL, the extract transform and load that moves data from one place to another. You can save that. You can replace the transform with views as we discussed. You don’t need any extracting and loading if your data’s already in the same database. So these are some really powerful reasons why people should consider relational databases in general and especially Postgres and especially Postgres with the swamp 64 extension as a powerful tool for analytics.

Adam Conradr 00:53:27 That’s a great way to wrap up. So I guess in closing now that we know, we won’t use Postgres any, any last minute tips or things that people may mess up that they might want to consider once they first dive into Postgres.

Thomas Richter 00:53:39 Yeah, that’s a good question. I think this, it is not the easiest tool to use. Let’s be, let’s be really honest about that. So there’s many, many things I can think of. A couple of points is number one, spend a little bit of time on your schema design, like how everything is supposed to hang together. And number two, avoid the urge of over-indexing, especially with these kinds of lookup indexes, we’ve talked about the so-called B3 indexes and so on too many as, also not a good solution to a few is not a good solution, but too many either. So resist a little bit the urge to put an index on everything. I would say those are two good advice for beginners and then read up on parallelism. If your intended use case is analytics, the parallelism and posters is absolutely essential to get the analytics performance you’re looking for. So reading up a bit and parallelism or reaching out to us and ask us is a great way to get started there with Postgres as an analytical database.

Adam Conradr 00:54:35 Perfect. And with that, we’ll wrap up again. This is Thomas from swarm 64, talking about Postgres and Noelle AP databases. Thank you again, Thomas. Thank you very much.

Adam Conradr 00:54:44 Keeping your teams on top of the latest tech developments is a monumental challenge. Helping them get answers to urgent problems they face daily is even harder. That’s why 66% of all fortune 100 companies count on O’Reilly online. Learning at O’Reilly. Your teams will get live courses, tons of resources, interactive scenarios, and sandboxes and fast answers to their pressing questions. See what O’Reilly online learning can do for your teams. Visit O’Riley dot com for a demo.
Outro 00:55:15 Thanks for listening to se radio and educational program brought to you by either police software magazine or more about the podcast, including other episodes, visit our [email protected] to provide feedback. You can comment on each episode on the website or reach us on LinkedIn, Facebook, Twitter, or through our slack [email protected]. You can also email [email protected], this and all other episodes of se radio is licensed under creative commons license 2.5. Thanks for listening.

[End of Audio]

SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)

SE Radio 454: Thomas Richter on Postgres as an OLAP database

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 717: Eric Tschetter on Decoupling Observability

SE Radio 716: Martin Kleppmann Local-First Software

SE Radio 715: Sahaj Garg on Designing for Ambiguity in Human Input

Menu

Recent posts

Search

Search

SE Radio 454: Thomas Richter on Postgres as an OLAP database

Show Notes

Related Links

Transcript

Join the discussion

More from this show

SE Radio 717: Eric Tschetter on Decoupling Observability

SE Radio 716: Martin Kleppmann Local-First Software

SE Radio 715: Sahaj Garg on Designing for Ambiguity in Human Input

Menu

Recent posts