Episode 484: Audrey Lawrence on Timeseries Databases

Filed in Episodes by on November 3, 2021 0 Comments

Audrey Lawrence of Amazon discusses Timeseries Databases and their new database offering Amazon Timestream. Philip Winston spoke with Lawrence about data modeling, ingestion, queries, performance, life-cycle management, hot data vs. cold data, operating at scale, and the advantages of a serverless architecture.

 

Related Links

Related Links – IEEE

 View Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact content@computer.org and include the episode number and URL.

Philip Winston 00:00:50 This is Philip Winston for Software Engineering Radio. My guest today is Audrey Lawrence. Audrey is a software development manager at Amazon. Her focus there is on their new database service, which is called Amazon Time Stream. Audrey has worked in the software industry for more than 10 years at companies such as Microsoft, Uber, Cruise Automation, and Carmera. As a disclosure. I first met Audrey when we were both working at Carmera a few years ago. Let’s start at the top. What is a timeseries database?

Audrey Lawrence 00:01:22 A timeseries database is a database that is purpose built for timeseries data. And when we’re talking about timeseries data, what we’re talking about really is data points that are measured over time. So, this is useful when you’re measuring events that change over time. So, when you’re not just interested in the current measurement for some metric, but you really want to look at the measurements over a period of time and you see this commonly in a number of industries. And actually this is also a really fast growing data storage type that we see across industries. And what timeseries databases do are they allow for really efficient collection storage and analysis on top of this data? So, everything from doing streaming analytics in real time to maybe training machine learning models off of historical timeseries data, and these data bases really allow for some complex analysis as well that you do on top of timeseries data and what Amazon Timestream is, is it is one of these purpose-built databases for timeseries data.

Philip Winston 00:02:29 Okay. So, I know the financial industry has long used produce timeseries data. They have data going back over a hundred years on stock prices. So, to get specific, what other applications or industries tend to produce timeseries data?

Audrey Lawrence 00:02:46 Yeah, finance is definitely one very common example that you hear for timeseries data. So, looking at stock prices over time or different prices of commodities and the end, another big area, we see a lot of timeseries data is in the systems monitoring use case. So, a lot of developers now and engineers want observability into their applications and their infrastructure to have really granular metrics about the performance and to ensure that things are running well and systems are healthy. So, you see a lot of times developers will add metrics to monitor different latencies of components, different response codes, that type of thing. And actually, one of times strains biggest customers is a video streaming service and they use Timestream for this use case to monitor their video streaming content. So, collecting metrics like frame rate drops latencies, that type of thing so that they can ensure really high-quality content streaming to their customers.

Audrey Lawrence 00:03:47 And the this, this use case also, you get a lot of data at times we can see up to gigabytes per second of ingestion and really up to millions of queries per hour. Another big area where we see timeseries data is with the internet of things, the IOT space, both in industrial settings, and then also commercial settings. There’s a move to kind of instrument everything and create smart devices. And these collect a lot of telemetry of the world around them. So, thinking like heart smart home sensors, your smart thermostat, that’s measuring temperature, air quality, those types of metrics over time. One other that I’ve gotten really used to is a smartwatch that I initially got as an upgrade from my $10 stopwatch for running to measure pace amount running. But since I’ve gotten it, I’ve become really accustomed to all of the metrics it’s collecting at all times. Things like heart rate, heart rate, variability air pressure, that type of thing. And we’ll see this area grow as we start to make everything smarter.

Philip Winston 00:04:51 Yeah, that makes a lot of sense. And for monitoring a data center, I can imagine there’s thousands of machines and, and, and many more metrics. And for consumer devices, it could be millions of, of devices. So, I can see how the scale is going to add up. So, I think anytime there’s a new type of database, one question that comes to mind is why can’t develop a store timeseries data in a SQL database or a key value store or some type of database that already has existed for a long period of time. What limitations or problems might they experience if they did use a SQL database thing?

Audrey Lawrence 00:05:30 Yeah. So, developers certainly can use more conventional databases to store timeseries data. I’ve done this before in the past as well. But with the problems that you tend to run into are around really scaling those systems out for performance and costs. Given timeseries databases just are purpose-built for this type of data, they can do a more efficient job in ingesting, storing and querying the data. Another area too, that kind of gets teams to adopt timeseries databases is the tooling around these data stores. There are a lot of different plugins you can use to really easily collect and ingest these metrics. And then also visualization tools that help later with querying and that type of thing. And then another piece too that kind of moves teams to look into using timeseries databases are just the, the solutions that they may need to come up with to in order to store timeseries databases can get rather complex and can take a lot of developer time to build and then to maintain and scale out these systems.

Audrey Lawrence 00:06:40 There is a recent all things distributed blog posts by Amazon CTO that discussed this a bit and referred to these systems or compared them to Rube Goldberg machines, where we have seen teams manage their timeseries databases with really complex architectures where they need some system that can essentially like whew, the data coming in, store it in some sort of durable store, have some massive parallel processing framework that process that data and stores it in an optimized format for querying and these solutions are pretty difficult to build and maintain and really scale. And a lot of times the metrics that you’re storing in these stores are also your really critical business or operational metrics, which need to be timely and correct. So, moving to a solution that can solve a lot of these problems for you, and also if it can do it in a more performance and cost-effective manner means that you’ll see images be the right choice for your team.

Philip Winston 00:07:44 Yeah. That makes sense that you could start out with a SQL database for a smaller implementation, but you’re going to run into trouble with scale and, and, and other factors. Let me take a minute to list some relevant past episodes of Software Engineering Radio. This is the first episode we’ve done on timeseries databases, but many past episodes have dealt with other types of databases. These are just a few episode, 1 0 2 relational databases, episode 1 65, no SQL and Mongo DB with Dwight Merriman episode 1 94, Michael hunger on graph databases, episode four 17, Alex Petrov on database storage engines. And I will put links to these episodes and the blog post, you mentioned in the show notes. So, let’s dive in, let’s start with data modeling. How does someone create a data model for a timeseries database? And how is this different from modeling for SQL database

Audrey Lawrence 00:08:45 When you’re using a timeseries database, the core unit that you really care about are what we call records. And this is just a single measurement of a metric at a point time. And a series of these measurements over time is what makes up the timeseries. And for each of these timeseries, you have a group of attributes or dimensions that kind of describe that timeseries. And then those you can store and databases and tables to organize the data and control access retention and that type of thing. So, an example of this would be, if we are monitoring a fleet of hosts in our system, we could have a dev ops database and maybe a host metrics table. And then the dimensions that we would want to store with our timeseries to describe the different timeseries would be things like region where that fleet is located.

Audrey Lawrence 00:09:38 Host’s name for the different hosts, OS version, different things like that. And then the measurements that we are, the actual metrics we’re collecting. So, things like CPU utilization, memory utilization, and traffic or traffic IO, that those types of measurements. So, those are the main concepts that you would think about when ingesting into a timeseries database, and based on your use case kind of model what business metrics you’re collecting into that type of format which most timeseries as you’re thinking about can really fit well into that. And timeseries databases also have tend to have really flexible schema. So, for Timestream, you don’t outside of creating a database in a table, you don’t declare your measurements and dimensions upfront, but you could declare them when you’re ingesting data and time shrink. We’ll kind of figure that out and store the data appropriately.

Audrey Lawrence 00:10:36 Whereas if you were using a SQL store to store your timeseries data, typically what you see here and you do see this alive. We’ve done this a lot in my past as well. Um, it’s pretty convenient to do this, especially if you have other data already in your SQL data store that you would want to join with, then it makes it a lot easier to have your timeseries data also in that store. And what you typically have is tables with one row per measurement. So, each of these attributes would also be in a kind of de-normalized view in that row. And so you can kind of see that these dimension values would be repeated for each measurement, which that can grow that constantly because you’re storing a lot more data in that case. And then also you have to read through that data when you’re doing queries.

Audrey Lawrence 00:11:25 Typically you see here that you’ll add an index on the, for the time column, and you’ll probably also want to add other indices to on some of the dimensions so that you can have query per week so that you can query based on those dimensions. This also gets a little bit hard to balance because you don’t want to add too many indices. You kind of have voted in nexus and then on ingestion, that will cost a lot, but also you do need to leverage these indices at query time, or sometimes the queries can just get way too slow. And then the, one of the other problems that you can run into using a conventional store here is kind of the nature of timeseries data is to grow over time, as time moves on, you continuously collect more and more measurements. So, what you see here, people tend to partition by time.

Audrey Lawrence 00:12:19 So, partition their tables based on time. And then that helps you know, tables not grow too long. You can kind of bring those tables over time. These solutions are still really hard to get, right. Um, you have to do a lot of tuning of the indices and how you’re doing partitioning. Um, so it’s a bit simpler to use a purpose-built store that handles this type of data for you. Um, you can also see to using a column oriented format where you’ll store kind of all the data points from the measurements at one column, and then you can kind of query just for that column at a time, but you still run into a number of challenges too, of how you partition the data in an efficient manner.

Philip Winston 00:13:03 Okay. So, one of the things I read said that Timestream partitions database space, as well as time, was that physical like geographic space, cause you mentioned regions or is it something more abstract?

Audrey Lawrence 00:13:16 It’s a bit more abstract. So, so to go back to why you would want to partition data you’ve partitioned data, essentially data stories in order to scale ingestion, storage and query. And typically what we see for timeseries data that you could want to partition based on time, because this allows you to kind of have more paralyzed reads and also your the queries all tend to be around the time element. So, it makes sense to be able to quickly determine what data points are relevant to serve your query and just have the query engine read those partitions. Um, but as most famous timeseries that we see there are exceptions to this, but a lot of times the data that you’re ingesting is at or around current time. And so indexing or partitioning just by time, isn’t really sufficient to distribute the ingestion traffic across your, your fleet.

Audrey Lawrence 00:14:19 If you’re seeing a lot of ingestion throughput to a table, and then also to at query time, you do want to be able to also kind of scale out queries. So, multiple workers can read multiple partitions. If you are serving a query that has a kind of longer temporal window. So, you want to partition as well by some other attributes and how you tend, you use the dimensions and the attributes on those timeseries to determine how to organize data. So, how Timestream does this is what we call partitioning by space and time. And the way to best kind of understand this is to picture a 2d grid in which all of the data points for your table will live and along the X axis is your time access or act access. And then the Y axis would be what we call the spatial dimension.

Audrey Lawrence 00:15:19 And for all of the timeseries that you have in that table would consist of dots that go in horizontal lines. So, the horizontal line would represent one timeseries and there would be a data point at the time point in which the measurement is taken and how we want to organize the different horizontal lines or timeseries are just by similarity. So, storing all of the data points similar in time for one timeseries together, and then also storing similar series close together because those are more likely to be queried together. And then kind of when we do our partitioning, we will kind of cut this grid into other 2d boxes and store those partitions so that we can really quickly look up in our kind of 2d index to figure out where the data is located. And then when we’re doing reads, we’re pulling hopefully just the data that we need to serve that query. And don’t have to do a lot of pruning after we get the data points from that partition.

Philip Winston 00:16:31 Okay. I think we’re straddling data modeling and ingestion. So, let’s dive into ingestion some more at the beginning. We mentioned consumer devices. I imagine ingestion is different depending on the number of sources being ingested. So, maybe it’s one data center that’s putting out a lot of metrics or maybe it’s a million consumer devices. How does ingestion process differ or what are the challenges of these two extremes?

Audrey Lawrence 00:17:01 So, yeah, there, there’s definitely two types of cases where that you see for timeseries data, where you can have millions of IOT devices out there, all sending metrics to the same table. And then also the case where you have a really high rate per device of timeseries, data being produced. And this, I, I familiar with the latter case in my time before Amazon working in the self-driving car industry, because you really see a lot of these in the robotics industry with really high-frequency sensors that are collecting a lot of data really high volumes of data being produced. At one point, I think we were collecting hundreds of megabytes per second, per car. So, ingesting all of that. You need to be able to handle large batches of data coming in the case. The former case is really tricky as well because you have really high throughput data and there you, you need to have large ingestion fleets that can handle traffic from many devices and also ensure that you’re not having a lot of fan out afterwards into your storage layer.

Audrey Lawrence 00:18:18 So, how we kind of handle this is with our ingestion and auto-scaling. So, we use these kinds of two D partitions that we just talked about and how we determine how we slice them further and do further partitioning based on the ingestion throughput that we’re seeing at any given time to a table. So, for instance, if we’re seeing so I guess first for terminology we referred to and you have these partitions as tiles. So, I may be using that word a bit, but if we see a lot of ingestion pickup to one tile and we see that the resources that are w we may be constrained in the resources that are able to accept that ingestion throughput, what we’ll do is we’ll against split that tile into two tiles that can serve essentially twice as much throughput. Um, and we’ll continue doing that over time to kind of distribute our ingestion traffic so that we can still serve these really high throughput use cases.

Philip Winston 00:19:22 Okay. You mentioned the frequency of data coming from these self-driving cars. What sort of frequencies are common in timeseries databases, for example, high-frequency trading operates at the scale of milliseconds. What frequencies do you typically see?

Audrey Lawrence 00:19:40 It definitely varies by the use case. Um, we see sometimes even the need for nanosecond precision, and that’s what we support in Timestream is up to nanosecond. It really depends on these cases. Sometimes we we see use cases where maybe you only have a metric per hour or even per day, and that we certainly support, but some of these really high frequency and where you need that sort of time, precision use cases, maybe even like the monitoring use case to where you have systems that are really low latency themselves and are producing metrics at faster than at the millisecond level. So, we do support nanoseconds,

Philip Winston 00:20:22 Okay. This might be straddling the topic of ingestion and queries, but once data is ingested, how quickly does it need to be queryable, or how quickly can it be queryable and just what challenges are there and making the data available as fast as possible

Audrey Lawrence 00:20:41 For many use cases you want the recent ugly ingested data to be available under a second, there are a number of different customer use cases out there where that are very time sensitive, such as if the monitoring use case. If you’re monitoring your e-commerce site on black Friday or cyber Monday, I guess it is. You want to know if there are any issues immediately, because you could be losing hundreds of thousands of dollars, millions of dollars per second, that your site is down also other real-time analytics use cases or say if you’re doing fraud detection on different purchases, you want the latest data to be serving these predictions. You don’t want data from an hour ago. You don’t even want data for five minutes ago. So, getting the data ingested in queriable is really, and doing that quickly is really important.

Audrey Lawrence 00:21:34 And there, there are a lot of challenges with doing this the same kind of with any distributed data system where you whether the top priority of any data system is durability. So, you need to ensure once you ingest that data, that it is correct and durable. And so you’ll want to have redundancy in your systems systems. So, you’re storing replicated copies of that data and ensuring it’s consistent when customers go to read that data and how we do this for Timestream is we have all of our ingestion coming into a fault tolerant memory store. So, a single kind of writer note will accept rights. And then we will replicate that out across three different availabilities zones and have quorum consensus so that we can handle a node going down or an entire availability zone going down. And then that allows us kind of serving those, the ingest traffic and memory to be really fast with how we are able to ingest the data and then make it for instantaneous queriable
Speaker 0 00:22:45 Support for this episode comes from single store, the single database for all data intensive applications. Why make the switch scaling modern? SAS is already hard. It’s even harder when you’re using databases that were built back when all your friends had pet rocks and the most stressful part of your day was connecting to the dial-up modem. Single store is built for the smart SAS generation and all the data intensive applications that come with it. The single store cloud powers, the new wave of SAS technologies, displacing legacy providers with insights, for apps, effortless operation of models at scale and easy shifts to the cloud, introduce simplicity and ease to your data structure and watch what happens to your speed scale and SQL find out why industry leaders trust single store to modernize their data platforms. See for yourself at single store.com/tri-free.

Philip Winston 00:23:36 So, scale is obviously an important topic here. Can you describe what difficulties does scale present to the ingestion? Let’s suppose we have many data sources and they are all high volume. So, the total amount of data being ingested as huge. How can you cope with that?

Audrey Lawrence 00:23:55 Yeah, so we, we have seen some use cases with Timestream, where we see really massive amounts of data coming in at one time. So workloads for see gigabytes per second per table coming in, and you can’t really handle that volume of data without distributing the load in some fashion. And how we do this is we distribute the load by cutting more into our two dimensional space of tiles. So, in order to really increase, if we’re seeing a lot of increased traffic to these tiles, and we will further split them down so that we can have more resources working on ingestion for that table. And then similarly, if we see workload decrease to the table, then we can merge tiles back together so that we don’t have resources not working on a lot of incoming data. And yeah, by nature timeseries data is also somewhat interesting to versus other data that you see in online cases because most of the time, timeseries data is immutable and also coming in order at, or around current time, there are a lot of good times when this is not the case.

Audrey Lawrence 00:25:11 So, we do need to support data updates and data D duplication. If we are receiving duplicate data points and then also supporting out of order Jada or even data that could be really old, but a lot of times timeseries data does come in in this fashion. Um, so that allows us to really scale our system and not have a lot of conflicts when multiple different hosts are kind of getting the exact same data points they need to ingest, but we do have to deal with this case, and we do have a manner of doing kind of batching and load balancing with our ingestion service.

Philip Winston 00:25:52 So, I think I hear you saying that it’s not just the volume of data, it’s the dynamic range that you have to be able to support the amount of data becoming more or less as it proceeds.

Audrey Lawrence 00:26:04 Yeah, that’s true for systems that are such as timeshare, but we don’t know upfront what workloads to expect from customers that it can really be dynamic. And we want to ensure that we’re being cost-effective in the resources that we’re using and also can scale up really, really quickly to support any fluctuations in workloads from customers.

Philip Winston 00:26:27 Okay. I want to talk a little bit about lifecycle management. When I was reading about timeseries database, I found it interesting that many times customers choose to age out their data. Is this strictly because of storage costs or are there other reasons,

Audrey Lawrence 00:26:44 Um, typically you see this as due to storage costs and the fact that you may not have the business use case for the older data. So the nature of timeseries data is to continually grow as time moves forward, and you’re collecting more and more measurements. And sometimes the older data that you have could be larger volume and expensive to store. And if you don’t need to query it, then it can make sense to age out of your system. Also sometimes you see that the raw data measurements that you’re ingesting can sometimes be really high in volume, and these can be expensive to store all of the raw data and then also expensive to repeatedly scan all of that data at query time, if you’re doing some of the same queries over time. So, you see use cases where instead of storing all of the raw data for a long period of time, you run queries on kind of roll ups of that data and store those aggregations somewhere that has a longer retention and then have a shorter retention on the broad data.

Philip Winston 00:27:49 So, that being said, are there use cases where a hundred percent of the data is kept indefinitely?

Audrey Lawrence 00:27:57 Yes. There are certainly use cases where you would want to keep the data forever thinking of the medical industry or financial industry where you may have legal or compliance reasons to keep the data indefinitely. And then also the data could be really useful. I think a lot of times timeseries data could be used when you’re training an ML model. And a lot of times you’ll want as much data or as many unique events as possible to, for training data. And then sometimes too, if you have rare events in that historical data sample. Um, so maybe if you’re training a weather forecasting model and have some like really rare weather events, you’ll want to keep that data for longer. I remember one time working in self-driving car industry and the power went out in San Francisco, and we really wanted to ensure that we retained all of the data we collected during that time in which all of the traffic lights were down in the entire city. So, cases like that, I think a success

Philip Winston 00:29:00 That makes sense that different industries have different compliance or business use cases. So, I read about hot data versus cold data. Does hot data mean it’s stored in Ram and cold data mean it’s on disc or is it not always that straightforward

Audrey Lawrence 00:29:17 Typically with timeseries, when you’re talking about hot data, you’re talking about the online use case where you’re doing real real-time analysis on the latest data or monitoring of the latest data. So, for that, it’s the latest data. So, this is the data that’s just been written. You need really fast writes. And then also you want these queries to be really fast returning in milliseconds. Um, so to get this type of performance where you have really fast writes and really fast reads, I think typically you would store the data in memory in that use case. And then this is more expensive to store data here versus storing the data on disk. So, then I think that that’s why you would typically want to push the older, more historical data that is serving a lot of the offline use cases onto disk at that time.

Audrey Lawrence 00:30:09 So, then it kind of your cold data on disk and this data can kind of cover a larger time periods that needs for querying aren’t as low latency as the online use case. And then kind of storing in a read optimized format since you’re typically serving more reads off of this data. And Timestream manages this with two different data stores that have two different retention periods. So, Timestream has two tiers for storage one. We called the memory store and this store in stores, the data in memory, the most recent data. And then we have a magnetic store which stores the data on disk. So, when a customer is setting up their table for their timeseries data, they declared the retention periods for the memory store and the retention periods for the magnetic store. And then the Timestream itself takes care of moving the data around clearing between the two layers. And this is all pretty transparent to the user other than the memory store we’ll have uh, faster query performance. Um, and that will cost a bit more for storage as well.

Philip Winston 00:31:17 So, you mentioned the real-time analysis of recent data. What is the relationship between timeseries databases and visualization tools like Grafana? What do you see customers using with their timeseries database?

Audrey Lawrence 00:31:32 Yeah, I think it’s very natural for us when we’re thinking about timeseries data, to think in terms of how to visualize that data, even thinking of our example of financial data. And you’re probably thinking of like the stock track tracker, where you can very easily visualize with a line on a graph of the stock price plotted over time of it rising or decreasing crazy spikes and that type of thing. So, that’s very common for all sorts of timeseries data and having tools like Grafana to visualize this are really, really useful. Um, especially in the case where you’re monitoring or building reports for your data. So, for the operational metrics, when you’re monitoring a system, Griffin is a common tool. She used here, other tools like it, where you can easily create a dashboard that allows you to have a lot of aggregate metrics over time.

Audrey Lawrence 00:32:28 And this the, the query language for these is really natural for timeseries data, where you can do these aggregate percentiles or averages do groupings by different dimensions doing roll-ups by minutes or hours or day they auto refresh. Um, they’re a much easier to use versus maybe like handcrafting SQL and sending SQL queries against the data store or looking at the raw timeseries data itself. And they’re also the number of business intelligence tools as well that you can use for reporting or doing analysis on against timeseries databases, Timestream has integration with Amazon QuickSight and a number of other tools. And we, we do support a SQL query interface with a lot of extra timeseries specific functions and have support for JDBC drivers. So, a number of different VI tools, you can hook up to query your time sharing data.

Philip Winston 00:33:31 So, when we’re talking about querying recent data or live data, what type of query performance to people expect compared to say, doing a report on the historical data?

Audrey Lawrence 00:33:43 So, for querying live data, you would expect prayer, your responses and milliseconds. And for these you’re, you’re querying typically it may be tens of gigabytes or your data points. Um, and so this is the data that you typically have stored in memory that can really quickly serve these datas or these data queries. Also for this use case, you tend to see a lot of concurrent queries as well. So, thinking of your, if you have a Grafana dashboard, that’s monitoring your system there that each time it refreshes every minute or so, maybe sending tens or hundreds of queries to the data store that you want returned immediately almost. And then when you’re doing kind of more ad hoc analytical queries, or maybe fetching data for ML training you can, you can scan a lot more data, so you can scan terabytes or even petabytes of data.

Audrey Lawrence 00:34:40 And the latency requirements aren’t as strict here. Second is acceptable. And how we handle those with timeshare too, is having an adaptive, distributed query engine that will auto scale based on the type of query that we see coming in. So, it does some estimation on how complex is the query and how much data, how many of these 2d tiles is the query going to need to fetch? And then it can distribute that load off out of a fleet of workers. So, what we’ve seen kind of benchmarking this is as the amount of data that we see as the data volume that we query grows. The query latencies themselves don’t grow nearly as fast.

Philip Winston 00:35:21 We’ve been talking about performance this whole time, but let’s dig even deeper. I can imagine that ingestion performance and query performance are very different things which matters most, or how do you balance providing these two things?

Audrey Lawrence 00:35:39 So, what we see for a lot of times series of these cases are that both really matter, both ingestion and query performance tend to matter. And you can see this with a number of different use cases. So, suppose we have some real time and analytics use case or say, we’re doing fraud detection on incoming data. And for that, we, we need both the ability to act upon the latest data. So, rights to be really efficient, and then also reads to V really fast so that we can make our predictions very fast. I mean, you’ve seen this too with the video streaming service that use this Timestream for monitoring, where if they are about to release, maybe a new TV show is about to come out. They’ll expect a lot more users to be watching that at the time that it’s released and that will produce a lot more data for monitoring. And then also those are kind of the critical times for their service to be up and running. So, if what you tend to see in many of these use cases is that both really matter for performance. And that’s one of the things for timesharing where we’ve built the system so that we can scale both of those separately and also benchmark really high loads for both write and read together to kind of test out these real world cases.

Philip Winston 00:37:05 Yeah, I read you need to spill it a scale ingestion query and even storage separately, is this to support different customer types or is one customer going to need these different things over their lifetime?

Audrey Lawrence 00:37:19 It depends. I think having, having the ability to scale separately allows us to dynamically respond to what workloads we’re seeing, both for ingestion query and storage. And there, there are use cases where you may see both heavy and Justin Curry that we just talked about. Um, but there are also use cases or maybe you are just having heavy volumes of queries. So I’m thinking if you, if you have a use case where you’re doing a lot of training of the ML models or really heavy analysis against the historical data, so you’ll want to be able to still independently scale your system to support those workloads.

Philip Winston 00:38:04 Okay. We talked about storing data in Ram or disk. How specifically does the developer configure where that split should happen? Is it just a question of time? Like all data that’s older than one day is stored in on disc, or is it more complicated?

Audrey Lawrence 00:38:21 It’s typically a matter of time when we’re looking at the table level. Um, that’s how temperature manages that is that with the developers setting up the table will set a memory retention period and kinetic retention period. And one data is older than the member of your attention. Well, each that data to the magnetic store and this, these values are just time durations. So, typically based on the use case, you’ll set maybe a day or 30 days, or maybe even a year for your in-memory store. And then that will be kind of a sliding window from now, and that will age out data older than that period. And this, this is kind of more, it varies a lot based on the customer use case of how you’re using this data and what the query patterns are. And what we see is that if you, if you do have multiple use cases, then a lot of times naturally those timeseries will fall into maybe two different tables and you could have different retention settings for those tables.

Philip Winston 00:39:29 Okay. We talked about SQL databases early on, and how they’re likely not going to scale with a large timeseries application, but what specifically about queries is faster with the timeseries database compared to sequel?

Audrey Lawrence 00:39:45 Yeah, so what’s faster than conventional databases is how the data tends to be stored. So, at timeseries databases, students really optimizing how they store data. So, such as storing all of the data points for one series together and then storing the similar series together so that when you’re serving a query, you’re able to just read the data that’s relevant to search for your query and then really efficiently indexing this data as well, so that you can quickly you know, we have a 2d indexing that maps into kind of our two D time and space grid where we store all of the data points. So, what we want to do when queries come in is really efficiently be able to look up where the data is, and then quickly prune out the tiles that aren’t relevant to serve that query and then have kind of the minimum amount of non-relevant data points in the tiles that we do serve. And then also another kind of the query engines too. This is probably coming with conventional databases as well. Um, but being able to really distribute queries is also really important to keep query performance good as the data set grows because with the timeseries use case, we, we know for a fact that as time goes on, we’ll continue growing and growing that dataset. So, being able to massively scale out data queries over time is also really important.

Philip Winston 00:41:24 So, moving on from performance Amazon Timestream is a serverless database. What advantage does a serverless implementation have for the user?

Audrey Lawrence 00:41:34 Yeah, so one big benefit of server lists and this somewhat goes into performance and auto scaling is that the user of a serverless data store doesn’t have to think too much about what type of load they need to support and advance. So, there’s, there’s no resources that they need to provision, so they don’t have to think in advance. Um how many nodes does my data store need to be on? What type of instance type do I need to have you know, good performance to serve the ingestion and read traffic for this data store. And they don’t need to manage scaling up when they see you know, increased usage of the timeseries store nor worry about scaling down when they see less usage and don’t want to pay for being over-provisioned. So, this, this is really helpful for users of databases, just to not have to manage the instances themselves and scaling those up and down you only pay for what you use which is nice.

Audrey Lawrence 00:42:39 And I F I think this is really helpful, especially for different data teams at companies, because it’s really hard for timeseries data to do upfront capacity estimation, thinking of you have to work with all of the different teams at your company to understand what their timeseries usages, and then kind of plan ahead for what their ingestion or volume will be, what their storage will be over time, and then also how they expect to query that data and that that’s hard to do in an accurate manner. So, it’s then hard to ensure that you you know, have the capacity that you need and that you’re not paying to be. Over-provisioned, I’ve seen this before working on data teams too, that it’s, it’s really difficult to get accurate estimates, especially if teams aren’t yet using the solution. So, being serverless really helps with this paying for what you use. And then also it helps with just the operational load where you’re not managing the resources themselves. So, a lot of times managing databases is nontrivial amount of work where you have to keep it on the latest patches. You have to make sure it’s highly available. These are difficult operational tasks.

Philip Winston 00:43:48 Yes, I guess again, we’re talking about the dynamic range of how much data the company or the user is processing. And I can see if that varies a lot, that serverless is going to be potentially easier to manage. I read that Timestream is cell-based. Is that the same as tiles, or is that something else related to architecture?

Audrey Lawrence 00:44:11 That’s something else that’s related more to the architecture and the infrastructure that Timestream runs on. Um, so this is a pattern that is used at AWS. I think Aurora uses this as well, but how we’ve architected Timestream itself is to run as a full system in multiple isolated what we call cells. So, the system itself is segmented into multiple copies of itself and customer traffic coming. And we’ll just go to one of these copies. So, what this allows is each of these cells runs in an isolated fashion. So, if there is a, an incident in one of the cells, then it won’t impact any of the other cells and why we do this as it you know, in the case that there is an incident w this really dramatically reduces the blast radius.

Philip Winston 00:45:04 We haven’t mentioned security at all. How does a serverless design impact security considerations?

Audrey Lawrence 00:45:11 Yeah, so a serverless design can help with the customer of a type of a database and not have to worry as much about the different security implications. Since they’re not managing the resources themselves, they’re not managing the instances or the storage nodes, they don’t have to ensure that those are secure running the latest software managing access to them. The data store manages all security at kind of the top level. So, for, for Timestream, for secure, by default, we encrypt data in transit and at rest by default and customers can, they can bring a key to encrypt the data with them, or they can just leave that completely to Timestream to manage. So, this really makes security, I think, a lot more simple for customers being serverless.

Philip Winston 00:46:04 Great. Okay. Let’s start wrapping up. What are some trends that you have noticed in timeseries databases? Where do you see things evolving in the next three to five years?

Audrey Lawrence 00:46:15 I think where we’ll see a lot of evolution timeseries databases is just their growth and adoption. I think right now, we’re just getting started in terms of usage of timeseries databases and the functionality that they can serve. And if you think about it so much data out there across so many industries, this timeseries in nature, and you don’t have most data that you’re collecting has some sort of time element that you tend to not so much care about the current state of things that you want historical data as well. And especially as machine learning expands and we make devices and different operators smarter and smarter, kind of giving them insight into this historical data as well is really important. And I think another area that we’ll see is as timeseries databases become more, are more well used that we’ll see more use cases are rushing right now as well across different industries. So, I think that’s really an exciting area to see what all we’ll be using timeseries databases in the future.

Philip Winston 00:47:27 Yeah. I can see adoption being a something that’s gonna accelerate. Is there anything we missed that you’d like to mention?

Audrey Lawrence 00:47:36 One thing we didn’t talk a lot about was just the types of queries that you typically see and the types of analysis that you do on the timeseries data. And I think that’s a really interesting area. So, to talk a little bit about what types of queries you would run on timeseries data. A lot of times what you’re doing is doing aggregates over time, or maybe filtering by some of those dimension attributes. So, if we go back to our observability use case where you’re monitoring a system, you may be looking at what is the 90th percentile of CPU utilization across the hosts in my fleet for this region. And you can do things such as alert if these values go above a certain threshold, or just plot these values out on your dashboard and have a lot of them. But also what we see are some even more complex analysis that you do in real time on top of timeseries databases.

Audrey Lawrence 00:48:37 So, some of these examples are looking at forecasting. So, being able to, if you’re a ride sharing service and demand picks up in a certain area forecast how that will continue growing or not growing, and then maybe changing the price of your service or sending more drivers in that area. And there are certain query types too, that are a bit more complex in timeseries for timeseries data such as looking at derivatives on top of timeseries data. So, measuring rate of change integrals, if you’re looking at maybe totals over a period of time, if you’re maybe having something like sales per second or per day for your e-commerce site and your timeseries and want to total per day, how many cells you have, you can use that. You can also use correlation functions to compare to timeseries. That’s useful if you’re kind of looking at trends between different, similar timeseries. Also interpolation is a, another one that’s really helpful if your timeseries is missing data points and you need to kind of fill those in. So, there’s a lot of really cool queries that you see done on this timeseries data that I think if you’re interested in this it’s worth looking into, and I think these are really powerful queries that will also kind of we’ll see evolve over the next three to five years, as people are leveraging this timeseries data more,

Philip Winston 00:50:12 I can see that’s interesting. So, there’s a dynamic range in the data being ingested, but it sounds like you’re saying there’s also this dynamic range in queries where you might be looking at second to second or minute to minute, but you might also be looking at trends that last months or even years.

Audrey Lawrence 00:50:29 Yep.

Philip Winston 00:50:31 Great. Where can listeners find out more about you or Amazon Timestream?

Audrey Lawrence 00:50:37 Mr. Hayes can find out more about me, probably the best way would be to look me up on LinkedIn. My LinkedIn URL is just M odd law AUD law. And then I’m Audrey Lawrence on LinkedIn. Also a little bit of plug. If any of these problems sound interesting to you. We are hiring I’m hiring software development, engineers, and software development managers for my team. And there are many really exciting problems we’re working on in this space. So, please ping me on LinkedIn, if you’re curious about this and then for Timestream itself, and we have public docks that detail different use cases and different integrations that we have. Also, if it’s really easy to go and get started, you can, with a couple clicks in the AWS console, create a database, create a table, and even seed it with some data from some sample data that we have for a couple of different use cases, and then start playing around with queries if you want. And then there’s, I think on get hub code samples and different languages too, if you want to directly use our APIs in any of the AWS SDKs.

Philip Winston 00:51:47 Great. I’ll put your LinkedIn in the show notes and some links to some of these Timestream resources. It was great to have you on the show. Audrey, this is Philip Winston for Software Engineering Radio. Thank you for listening.

[End of Audio]


SE Radio theme music: “Broken Reality” by Kevin MacLeod (incompetech.com)
Licensed under Creative Commons: By Attribution 3.0

Facebooktwitterlinkedin

Tags: , ,