Search
Frances Perry

SE Radio 272: Frances Perry on Apache Beam

Venue: Internet
Jeff Meyerson
talks with Frances Perry about Apache Beam, a unified batch and stream processing model. Topics include a history of batch and stream processing, from MapReduce to the Lambda Architecture to the more recent Dataflow model, originally defined in a Google paper. Dataflow overcomes the problem of event time skew by using watermarks and other methods discussed between Jeff and Frances. Apache Beam defines a way for users to define their pipelines in a way that is agnostic of the underlying execution engine, similar to how SQL provides a unified language for databases. This seeks to solve the churn and repeated work that has occurred in the rapidly evolving stream processing ecosystem.


Show Notes

Related Links

Transcript

Transcript brought to you by innoQ

This is Software Engineering Radio, the podcast for professional developers, on the web at SE-Radio.net. SE-Radio brings you relevant and detailed discussions of software engineering topics at least once a month. SE-Radio is brought to you by IEEE Software Magazine, online at computer.org/software.
* * *
Jeff Meyerson: [00:01:06.11] Frances Perry is a tech lead and manager for Google Cloud Data flow, a unified batch and streaming programming model. She was previously a developer on the Flume project, an internal data processing infrastructure, and she is a co-author of the Google Data flow paper. Frances, welcome to Software Engineering Radio!
Frances Perry: [00:01:25.00] Thank you! Looking forward to chatting.
Jeff Meyerson: [00:01:27.06] Absolutely. What is streaming?
Frances Perry: [00:01:32.19] Stream processing is basically a style of data processing where you’re continually processing data. It’s coming in live, and you’re producing live results continuously.
Jeff Meyerson: [00:01:42.05] A year ago I was doing several shows about streaming frameworks and streaming systems, and I continually asked the question “What is the difference between batch and streaming?” Different people have different answers to that question. What is your response to that question?
Frances Perry: [00:02:00.18] I want to take that even one step further – that’s the wrong question to be asking. The way we encourage people to think about data processing is you want to separate into the shapes of your data, and then how you choose to execute it. In that sense, we’re encouraging folks to use the terms “bounded” and “unbounded” when they refer to their data. If your data is bounded, you have it all right now; you can just grab some file pattern off some file system, and there’s the data that you wanna process. If your data is unbounded, it’s continually streaming in.
Given those two things, there are multiple ways in which you could choose to execute that data. There is the traditional batch-style systems that everyone’s familiar with (Hadoop, MapReduce) and there’s the more real-time systems that people are familiar with. I don’t think of those as two different axes, I think of it as a spectrum. There’s a really big continuum between processing all your data at once, processing it continually in small records, and in the middle there’s the style that people call “incremental batch” or “micro-batch streaming”, where you’re processing small subsets of the data at a time. It’s like a cron job, over little batch jobs.
Jeff Meyerson: [00:03:18.09] You’re talking about this unbounded data. In streaming systems the data is often unbounded, it never stops coming. From the point of view of the server, you’re getting infinite data. How does handling unbounded data streams differ from handling a finite data stream?
Frances Perry: [00:03:37.17] The thing that makes it really tricky and the place where it gets really fun is when we start looking into the distinction between event time and processing time. Event time is the time the data was generated, when the event happened; processing time is when it arrives in the system for processing.
To understand that, it’s really helpful to use an example. I didn’t grok this myself until I had a good, concrete use case. Let’s imagine we are building a mobile gaming app. We’ve got users on their mobile phones around the world, playing some sort of game, doing some sort of mind-numbing task over and over again, earning points; they’re crushing candy or slaying zombies, or something. Now, those users are earning points for their team, and we want to be able to process that data and figure out what the team scores are over time.
[00:04:26.00] If you look at that, when the user is playing the game, they crush the candy and they earn some points, that’s the event time – the time when the score happened. The processing time though is when that arrives in our system. If they’re online and nothing is wonky in the network, the data will actually arrive for processing very quickly. But there are many reasons that that can slow down. Perhaps there’s a bit of network congestion, and some of the elements arrive out of order in our system; we’re getting some elements from some users slightly delayed.
Even more commonly in mobile gaming though, we want these games to be working in offline mode. When the users playing this game in seat 26B on an airplane flight (airplane mode) and they land eight hours later, that’s when we get the score. In this point, our system is processing data – some of it is fresh, happening right now, and some of it might actually be eight hours old, and it’s all coming in at the same time. That distinction between when the data happened and when it arrives makes streaming systems and real-time system quite complex.
Jeff Meyerson: [00:05:31.24] The gaming example is interesting; you can think of gaming as not a serious application, but it seems paradigmatically similar to the types of applications we might be building in the future, because there is such a large influx of data. Would you say that’s accurate? Are our applications going to become increasingly data-intensive and more “real-time”?
Frances Perry: [00:06:01.03] Absolutely. People start to figure out what kinds of insights you can gather from data, and they gather more and more because they see how valuable it can be. At the same time, we’re doing so many more things in a distributed fashion. You’re not just dealing with a couple of servers gathering data, you’ve got people on mobile devices and everything becomes much more distributed. When you’re dealing with distributed systems, you have to be able to deal with the uncertainties that come with that.
Jeff Meyerson: [00:06:29.21] One uncertainty that you have been talking about is event time skew. Can you define that in more detail?
Frances Perry: [00:06:37.16] The event time skew is the difference for an element in when it happens and when it arrives in the system. When you’re dealing with large amounts of data, there’s the skew that you’re generally seeing, and then some elements may be even worse than that.
Jeff Meyerson: [00:06:56.13] Another way of putting it – event time skew is the difference between the processing time and the actual event time. Would you say that’s accurate?
Frances Perry: [00:07:04.19] Yes.
Jeff Meyerson: [00:07:06.00] Okay. How significant of a problem is this event time skew?
Frances Perry: [00:07:11.08] It really depends on your application. For many situations you’re absolutely fine dealing with processing time, if that’s enough for the algorithm that you’re trying to process your data with. In other cases, if you’re trying to talk about events in relation to the other events that happened at the same time, event time skews are really going to mess you up.
Let’s say we’re trying to calculate the total score for one of our teams in a particular hour. We’ve calculated that result, and if we’re doing a streaming system – presumably low latency is important to us; we want to be able to get you these results in near time.
[00:07:48.15] Maybe we’ve got a dashboard and we’re trying to show which teams are currently at the top of the ranking. If we go ahead and sum the scores from 12 to 1 o’clock and we give you that result, and then eight hours later one of those team members flight lands, they come back online and we get more scores, it’s now 8 PM and we need to figure out how to go back and deal with the fact that the data that we generated at 1 PM was wrong, because we’ve got additional updates that came in and we’d like to refine that result.
Jeff Meyerson: [00:08:18.27] I want to spend more time talking about the event time skew and how we deal with that, but I’d like to take a step back and talk a little bit more about streaming systems in general and historically, particularly for the listeners who are a little unfamiliar with streaming systems. Streaming systems are sometimes associated with this idea that you get incomplete or approximate results because there is this giant influx of data. Why has there historically been a notion that streaming systems give these incomplete or approximate answers?
Frances Perry: [00:08:54.19] I think because it is so difficult to deal with these late-arriving elements and to reason about them in a way that makes sense. People have kind of thrown up their hands and said, “You know what, we’ll just assume that when those late elements come in after some threshold, we’ll just ignore them. I know that my system’s going to be slightly lossy, but it’s still great to get low latency, so that’s a tradeoff I’m willing to make.”
What we want people to start think about though is that that tradeoff isn’t necessary. With the right abstractions and ways of reasoning about these complex systems we can actually get both correctness and still be able to handle late-arriving data.
Jeff Meyerson: [00:09:33.22] One of the band-aids that people have used sometimes is this lambda architecture, which uses batch to provide eventually correct results that can complement that streaming inconsistency. Why has this been a thing that people have used? Why are people using batch to complement their streaming inconsistency?
Frances Perry: [00:10:02.10] When your streaming runtime isn’t able to give you guaranteed correct results, the tradeoff you’re often willing to make is that “Every night I’ll run a batch job to get my correct results. During the day, because I’d like to see something more real-time, I’ll be okay with approximate results. Then basically I’ll throw those away at the end of the day and rerun my batch program to get the correct updates.”
A system like that works, but we think this can be unnecessary. It’s also much more difficult to manage, because you have to manage both a streaming and a batch system. Often you end up having to write your computation in two different ways, with two different APIs and two different sets of concepts, and that’s unnecessary.
Jeff Meyerson: [00:10:48.16] Why is it unnecessary? Why isn’t the lambda architecture good enough? How can we roll the functionality of both a batch and a streaming system into one?
Frances Perry: [00:11:00.23] What it comes down to is that when you have the right set of abstractions and concepts for reasoning about these types of programs, you can do both in one. There’s just no need to do them separately.
That’s what’s at the root of what used to be called the data flow model. The data flow model is something that evolved out of a lot of work internally at Google. We actually had a very similar setup to the lambda architecture for a while, where we had MapReduce – which we’d used for many years – and we evolved that into a system called Flume. Flume put a higher-level, graph-style API on top of MapReduce to make it much easier to program. Then separately we evolved some real-time systems. [unintelligible 00:11:46.17] was our continuous real-time processing system.
[00:11:49.28] What we found is that so many teams inside Google had both – they had their MapReduce job they were running nightly for those correct results, and then they were also running a [unintelligible 00:11:58.29] system for more real-time results. [unintelligible 00:12:03.08] itself can actually do a really good job of not losing data. It started to develop those right concepts that let us realize that we could have actually correct streaming results. When we realized that, we started to merge the two systems and see if we could make it so that people didn’t have to do both. They could do one system and then tune – after they wrote their algorithm – what is the latency requirement that they need in their particular use case.
Jeff Meyerson: [00:12:31.06] It’s interesting that the lambda architecture was essentially in place at Google at the same time that people were talking about it outside of Google. The discussions around it originated from Nathan [unintelligible 00:12:42.08], who I don’t think ever worked at Google… But it’s funny how these things develop in parallel at different places.
Frances Perry: [00:12:50.17] Exactly. Engineers see the problem… Everyone was trying to solve the same things. Internally, we did a huge amount to innovate in how we did big data processing, but we were heads down in doing it in our very internal, homogeneous environment. Obviously, the external community took the MapReduce paper and a number of other ideas and ran with that too, and it’s great to see they ended up going in similar directions.
Jeff Meyerson: [00:13:15.21] One way to articulate why the lambda architecture is probably not the best sustainable solution is that it has this assumption baked in that we can look at our data in a periodic fashion. We can just think of these daily jobs that resolve our data as an okay thing, but one quote from the Data flow paper that counters this is “We as a field must stop trying to groom unbounded data sets into finite pools of information that eventually become complete. Instead, we must live and breathe under the assumption that we will never know if or when we have seen all of our data.” Could you explain this quote in more detail?
Frances Perry: [00:14:03.26] Yes. One of the things we want people to start thinking about is that even your traditional notion of batch processing in bounded data is often streaming unbounded data. When you have your log files and you’re taking your logs from Tuesday and running a nightly cron job on them, and then the logs from Wednesday and running a nightly cron job – people think of that as batch processing, but really what they’ve done is they’ve taken an infinite stream of data and forced it into these finite chunks; they’ve artificially divided it into finite pieces so that you can then process it with a batch system. There’s actually a number of drawbacks to that.
[00:14:43.26] For example, you’re dealing with the latency of the batch system, so if you’re trying to calculate the top users per hour, by 1 AM you may have most of the data needed to do that for that hour. But if your nightly batch job isn’t going to run until midnight of that day, you’re waiting 23 unnecessary hours for that data.
Another issue is that by artificially putting in these boundaries between the days to separate your infinite, unbounded data into finite chunks, you may lose the ability to analyze certain portions of your data. There’s a concept called a session, which is a very common way that people analyze web logs and so on. A session basically gathers up a burst of user activity. If somebody was playing our mobile game and they were playing for 20 minutes, and then they went to get a cup of coffee and chat with someone, and then they came back and played for another few minutes, we might want to treat that as two different user sessions or activities.
[00:15:42.14] Usually, when you’re trying to analyze those, you’re separating them by the time the user was offline between the activities. But when we force our data into these fixed chunks, we lose the ability to track sessions that cross that boundary. For example, a user starts playing our game at 11:55 to 1 AM. We might lose the fact that that was a single session if we process the first day’s data completely separately from the second day’s data.
Jeff Meyerson: [00:16:10.21] The Data flow paper also talks about how the programmer is always making tradeoffs between correctness and latency and cost when you are programming against an unbounded data flow. Could you explain how these three desirable characteristics are typically traded off?
Frances Perry: [00:16:34.02] Completeness is exactly that notion of how to deal with late data. If I want to give you the total score for the teams in this hour, if I do it right when that hour ends, I may risk lack of completeness because there might be late data that comes in after that point. That’s a tradeoff I can make.
If I want to be completely sure that I’ve gotten all the data in that time period, I could wait. Maybe an hour is enough, maybe two, maybe a week. At some point you have to figure out how long you want to wait to ensure completeness, and you can see how that immediately counteracts latency. Because the longer I’m waiting to get completeness, the longer I have to wait for any sort of result. Often, that’s not what you want. You want speculative results – “So far, I know there’s this score”, and then you want to evolve that result and refine it over time as additional work comes in.
[00:17:30.13] With both of those there’s going to be a cost aspect, because the more data you’re buffering, the more often you’re computing it, the more computing resources you need to put behind that.
Jeff Meyerson: [00:17:41.25] I want to come back to talking about the dealing with the event time skew. You talked about sessions. There’s also this concept of windowing – partitioning our data set along temporal boundaries. Is sessioning a version of windowing?
Frances Perry: [00:18:08.06] Yes, absolutely. Let me separate that into two pieces. We always want to think about having these two-time dimensions: event time and processing time. The notion of windowing in the Data flow model divides your data up into finite chunks of event time. There are multiple ways to do that. Fixed windows are the most common – every hour’s worth of data, every minute’s worth of data. Again, this isn’t when the data arrives, this is when the data occurred. They’re sliding windows, so every hour will give me the last 24 hours worth of data. Sessions is “Give me bursts of user activity.”
[00:18:49.26] The other dimension that you might want to divide things up on is processing time. There’s a separate notion called “triggering” that we use to divide up processing time.
Jeff Meyerson: [00:19:00.29] As we’re talking about these different windowing strategies, these strike me as things that were around even in the streaming systems that acknowledged that we were getting some inconsistencies, the streaming systems that were complemented with batch in order to have the lambda architecture. How can we move towards a conversation where we talk about how windowing is used in order to make our streaming system have a higher degree of correctness.
Frances Perry: [00:19:32.26] People have used the notion of windowing for a very long time, but what’s been a more recent shift is this distinction between event time and processing time. Often when people used windowing in the past, they were just referring to “When my data arrived”, so more processing time-based windows. But we’ve really seen a shift recently in streaming systems towards recognizing that this distinction is important.
Jeff Meyerson: [00:19:56.23] I see. Can you give an example for how we might use windowing to improve the resolution of our data in that type of gaming application?
Frances Perry: [00:20:08.18] In that gaming application what you need to do is figure out first of all what are the kinds of results that you’d like. There’s multiple ways we could choose to accumulate. There’s two basic types of processing that you wanna do. The first style system is more element-wise processing. That’s something like filtering or parsing or normalizing data – something that requires you only to look at a single element. We’re very used to that in MapReduce, that’s your map function. You’re just taking individual elements and processing them.
[00:20:42.04] Doing that in a streaming system is pretty easy. You just need to be able to scale to handle the data volumes, and then you just apply this function to each element as it goes past. It’s the aggregations that get more complicated, and this is where you’re looking at multiple elements at a time – the count of all scores for this team. You need to take all these individual scores that have the same team and sum them together. When you’re doing that, these aggregations, this is where you need some notion of dividing your data up so you can make progress.
[00:21:15.12] The question “What is the total score for this team over all time?” doesn’t make sense, because you’re not going to ever get to the point where you have that result. So windowing is a way to let you start doing those aggregations.
Jeff Meyerson: [00:21:28.29] Can we talk about watermarks and what watermarks do for our streaming systems?
Frances Perry: [00:21:34.14] Absolutely. This goes back to that event time skew. When you are trying to process data in fixed windows – “I want to process the data that happened in every hour. I want all the scores that happened between noon and 1 o’clock, and I’m going to sum those together.” Now, you immediately have to decide “When can I do that sum? At 1 o’clock I could sum it, but I probably won’t have all the data if I’m assuming my data can arrive out of order. So when do I know that it’s good enough? Should I go at 1:15, 1:30, 3 o’clock?” You have to pick that point when you think you’ve seen all the data.
[00:22:14.00] The watermark helps you track that. Conceptually, a perfect watermark would let you know exactly when you’ve seen all the data for a given time. “At this point you will see no more data that happened between 12 and 1 o’clock” is what the watermark would tell you.
A perfect watermark is very unlikely. There’s some systems that could support it. For example, if you’ve got an append-only [unintelligible 00:22:38.01] log file and you’re time-stamping elements based on when they get appended into that log file, you are able to read those in order. But whenever you’re dealing with a distributed system you have to more likely use a heuristic watermark, which is the system’s best guess of data completeness.
[00:22:55.02] The system will track how data is flowing into the system and give its best guess that “Yes, now I think I’ve seen all the data from 12 to 1 o’clock.” Maybe that happens at 1:10, and at this point your system says “Great, I can go ahead and sum those results and give you that output.”
Jeff Meyerson: [00:23:11.22] So a definition of a watermark might be that a watermark lets us track how completely we have processed our event data with respect to event times. A watermark with a value of time X indicates that all input data with the event times less than X have already been observed.
Frances Perry: [00:23:35.10] Exactly.
Jeff Meyerson: [00:23:36.22] Could you talk more about what the advantages of using watermarks are?
Frances Perry: [00:23:40.25] Watermarks let you very carefully track this distinction between event and processing time. Without it you have to just make a call, like “I’m going to wait two hours, and then I will go ahead and sum that score for that fixed window that I want, and reproduce it.” When you do that, you are not going to be able to adapt if the data becomes really late; maybe there’s a fiber cut across the Atlantic and a whole bunch of stuff slows down. You’ll still just go ahead and too eagerly sum stuff and send the result down the pipe.
[00:24:12.14] On the other hand, you may also have to be too conservative a lot of the time. You’re introducing this unnecessary latency. Without the watermark, you have to pick a time when you’re going to throw up your hands and just move on.
Jeff Meyerson: [00:24:26.06] What is a trigger?
Frances Perry: [00:24:28.17] A trigger is the opposite of a window, in that it applies to processing time. There are some times when the event time isn’t necessarily what you want. If I’m just calculating a global score for each team, it doesn’t really matter to me when the elements happened; I can just say, “Every ten minutes of processing time, take whatever data I’ve gotten and sum it up, and go ahead and emit that result.” Triggering basically is that other access, and it can also interact with event time.
[00:25:02.24] I might want to say, “I’m summing scores every day. I want to use the event time and get daily fixed windows.” That would basically say, “Once I’ve seen all the data for Tuesday and the watermark has passed and I think I’m done getting data for Tuesday, I’m going to emit that result.” But if I do that, I’ve got very low latency, so I can also use a trigger to say, “Every hour I’d like an update of how that sum for Tuesday is progressing.”
Jeff Meyerson: [00:26:04.11] Can you talk a little more about the architecture for how these events are being placed from the client to the server? For example, is it like the client is constantly streaming off events from the game or from an Uber application or some other real-time application with unbounded data and the server is aggregating all of these events in a queue and deciding when to process those events based on windowing? Give me a better picture of how that architecture looks like and how the server is determining where these windows are being processed or where these windows are being imposed upon.
Frances Perry: [00:26:57.11] The watermark is actually a per-source concept. Each input source defines its own definition of a watermark and how you can track what’s going on with that source. The system will be asking each of the individual sources about their watermark, and then using that to calculate a global system watermark. We do want to try to separate here very much the Data flow or Beam [unintelligible 00:27:21.21] concepts, which is generally how you want to describe this data from the particular system details.
[00:27:29.11] At some point we should talk a little bit more about Apache Beam [unintelligible 00:27:29.19] and the distinction between that and Data flow, and I can go into details on that then.
Jeff Meyerson: [00:27:35.07] Absolutely, we should talk about that now. We’ve been talking about streaming in a somewhat general context, but what we are really talking about is the Data flow model of streaming. Can you explain how this Data flow model that we’ve been discussing – how does that contrast with the other streaming models that were prominent before Data flow?
Frances Perry: [00:28:02.10] The Data flow model is what evolved out of this internal work at Google that took MapReduce, Flume and [unintelligible 00:28:10.05] and all these systems, and designed this one that unified batch and streaming use cases into a single model. This is where we realized that if we can come up with the right set of concepts for letting you reason about this style of programming, you won’t have to do it differently for streaming and batch.
[00:28:29.20] The Data flow model really focuses on that unification, and it works based on forming questions. The first one is you talk about what you’re computing, what is the algorithm you’re doing. Are you doing joins, histograms, counts, sums, that kind of thing? The second access is where, which is our shorthand for “Where in event time? What do you want to do about the event time of the data? Does it come from fixed windows, hourly windows, sliding windows?” The next one is “When?”. “When” is in processing time. How often do you want to trigger your results? And finally, “How?” What do you want to do when multiple results are emitted for a window? When we refine a result over time, should those accumulate? How should those relate?
[00:29:21.27] What we’ve done is taken all these concepts and put them into a very clear framework that lets you clearly tune each of those concepts. You can build your same data processing algorithm that you want to do – that’s your “What?”. The bulk of what you’re doing is the “What?” question. “What am I trying to compute over my data?” and then you tune the other three questions to adapt that single algorithm from a traditional batch-style use – where you’ve got all your data, you just want to process it and get a single result out – into a more continuous, real-time data where you have to start tuning how often and how complete you want your results to be.
Jeff Meyerson: [00:30:04.05] What is Apache Beam?
Frances Perry: [00:30:08.19] This is a lot of fun. We worked on Data flow; it came out based on this internal evolution of work we’d been doing, and we also created Google Cloud Data flow, which is part of Google Cloud Platform, and it is a system for executing these pipelines. There are two separate concepts here. There is the programming model, which is the SDKs and the APIs that let you express your data processing pipelines, and then there’s the service for actually executing those pipelines.
[00:30:40.07] What happened when we designed this is that the model itself is a key way to use the surface. The model is what gives you these really intuitive ways for writing these pipelines, the right tools for reasoning about event and processing time, and it also has the right hooks so that the system can very efficiently execute these things. But by designing a new SDK and a new programming model, you really make it hard for users to adapt and start using a system, because there’s a huge barrier for users to rewrite their programs to learn these concepts and to invest in that.
[00:31:18.18] What we realize is that this programming model itself has a huge amount of value, and if we contributed that to the open source community, we can let that model grown and flourish, and at the same time we could have one of the systems and the places to run that. What we did is take that Data flow model and donated it to the Apache Software Foundation. It’s currently in incubation as the incubating project Apache Beam. Beam is now what we used the Data flow model.
[00:31:52.03] What we’re really excited about is we’re working with the community to build different runtime options. You can take this Beam program and you can run it on Cloud Data flow, but you can also run it on other existing systems. You can run Beam pipelines on Apache Spark, you can run them on Apache Flink; we’ve just had a Gearpump runner at it; we’ve got interest in Apex and Storm and other systems.
Jeff Meyerson: [00:32:16.15] Is writing a program that is compliant with the Beam spec – is that like writing a program in one specific language that can be put through an interpreter to a variety of other languages and then the analogies to the language here would be these different processing systems like Flink or Spark?
Frances Perry: [00:32:39.29] Exactly. What’s in Beam is the description of your data. That’s where we use the terms “bounded” and “unbounded.” You’re talking about purely the properties of your data and how you’d like to process it. This is your API for expressing that. Then, when you’re ready to run that, you choose what you want to compile it into. Do you want to have it translated into Spark concepts and executed on Spark? Do you want to translate it into Flink and run it there? And so on.
Jeff Meyerson: [00:33:08.06] Let’s go through an example with that game. If we wanted to use the Beam model on our game, what kinds of abstractions do we need to define across the events of our game?
Frances Perry: [00:33:29.13] Let’s start with answering the “What” question. We’re going to do simple score summing, so we’ll take our input data, we’ll parse each element to exact the team and the number of points scored for that team, and then we’ll do a sum to get the total score per team. So that’s our “What”, that’s the first question. We could just take that and run it in a traditional batch style over some large amount of input data. But now, if we want to evolve that, we can start answering those other questions; we can answer the “Where”, the windowing, and we can take that same algorithm that we wrote, add one or maybe two lines of code, depending on how long you like your lines, and say “Now I’ll do this in fixed windows, so every hour give me a different sum.” At this point we’re starting to take that single algorithm and evolve it.
[00:34:19.05] Then we can go further and we can adapt into a streaming pipeline by saying how often we want to trigger those results. “I’d like hourly windows, and I’d like to know how those are progressing every ten minutes. Every ten minutes give me an update on what my hourly windows look like.”
Jeff Meyerson: [00:34:41.15] Why would it be useful to run this model on Storm, versus Spark, versus Flink, versus Data flow? Once we define this unified programming model version of our streaming feature of our application, how would these different processing systems contrast in the way that they would actually execute that model?
Frances Perry: [00:35:09.20] This comes down to user choice. Some folks like to run on premise, some folks like to run in the cloud; some folks like to use open source, some folks want to be part of an integrated platform. This is really where your choice comes up. The more choice users have for where they’re processing their data, the better things are. Because they’ve got their data in different places, they’ve got different requirements. They want the freedom to move between those systems and not be locked in.
Jeff Meyerson: [00:35:38.28] Is it just about the lock-in, or are there certain types of operations that might run faster on Spark or might run faster on Data flow?
Frances Perry: [00:35:48.09] Yes, absolutely. What we’re trying to do here is two things. We want to generalize how you reason about the style of programming into the most intuitive model that we can. That’s the first step. We drive the Beam model based purely on what makes sense for reasoning about this style of programming. Then we want to give you the portability of running that model across multiple environments, but each of those environments is a little bit different. Every one of those systems has things that can and cannot do as well as the others, so that’s where we work really hard to categorize which features each of those runtimes will be able to support in the Beam model.
Jeff Meyerson: [00:36:30.09] What are the streaming systems that do conform to the Beam model?
Frances Perry: [00:36:33.24] The closest one right now, with the cleanest alignment, is Apache Flink. There’s definitely some information out there that the Flink guys were really excited to read the Data flow model paper, so they went off and they took some of those concepts and they integrated them deeply into Flink. Flink’s notions of event time and processing time align beautifully with the Beam model. Systems like Spark are sort of moving in the right direction. Spark 2.0 has a lot of new features that will help align those, but it’s still getting there in terms of adding watermarks and so on.
Jeff Meyerson: [00:37:10.01] The Flink team was enthusiastic about how well the Data flow model provided a desirable way of modeling those data processing pipelines. How did Flink’s data processing work prior to its implementation of the Beam model?
Frances Perry: [00:37:31.16] That’s a great question and one I have the background to answer.
Jeff Meyerson: [00:37:35.24] Okay, fair enough. Do you have much knowledge on how they went about making their processing systems Beam-compliant?
Frances Perry: [00:37:46.00] We definitely had some early chats with them when the paper got out, and then we ended up having some Hangouts to chat with them and talk about these concepts. The thing I loved is that the Flink guys are wicked smart. They grok this stuff deeply, so we were really able to have these deep, technical conversations with them, which was a lot of fun. They ended up evolving in a similar direction, and we were able to refine the two and really get them to align.
Jeff Meyerson: [00:38:15.04] Can you tell me what is involved in making a system Beam-compliant? For example, what do both Apache Flink and Apache Spark have to implement in order to make their APIs Beam-compliant?
Frances Perry: [00:38:30.22] I think the core concepts in Beam are things like element-wise processing. That’s very standard since the MapReduce days; most systems can do that very easily. Then we’ve got the different styles of aggregation and windowing, and this is where systems start to diverge a little bit in their ability to support the Beam model. Some will be able to do that and some won’t as much. We’ve laid out on the Beam’s site a really nice matrix of all the concepts and what the current status is of support for each of these different runners.
Jeff Meyerson: [00:39:04.27] Let’s talk about that term more. What is a Beam pipeline runner?
Frances Perry: [00:39:10.19] You write your Beam pipeline – and we want that to be only about the shape of your data and your algorithms and your business logic; then the Beam runner is the thing that takes that and actually executes it for you. Now, most Beam runners want to do this at scale, so they shell out two existing systems. Most runners will take that generalized Beam pipeline and translate it into the primitives used by a system like Spark or Flink and call out to that system for you.
[00:39:40.14] We do have one runner that actually does the implementation. That’s called the direct runner, and that’s just your simple in-memory runner. It’s great for testing, unit tests, integration tests and development and so on. But most of the runners that actually do this at scale are really wrappers around other distributed processing environments.
Jeff Meyerson: [00:40:02.10] When I’m defining my Beam pipeline, what does it look like? Is there a file format or a configuration style that I use that will work across all of the different Beam pipeline runners that have been implemented?
Frances Perry: [00:40:23.10] To answer that, I’d like to do the short term and the long term. In the short term, what you do to write a Beam pipeline is use the Java SDK, which lets you say things like, “I’m going to create a pipeline. I’m going to read from this type of data, I’m going to count it, I’m going to join it, I’m going to window it like this, I’m going to write it over there…” Basically, what you’re doing is writing a Java program that builds a DAG that describes the shape of your pipeline, and then from that, the representation of that DAG is what is then translated into each of the systems.
[00:40:56.23] That’s where we’re at right now – mainly Java-based. But our goal is to really support as many languages as you want. We’ve just actually gotten started on the Python SDK, it’s now in a feature branch on Beam. When we use the term “Beam programming model”, we want to use that instead of SDK, because the concepts in Beam are much more general. Which language you use to express those concepts is almost just an implementation detail.
[00:41:22.06] Where we’d like to get to is the part where we have a language-independent representation of a Beam pipeline, which means we can have Beam SDKs in multiple languages, that will go ahead and generate that graph description that describes the pipeline and the data shapes. Then the runners can pick that up and go ahead and run them. That will make it very easy for users to choose the language that works best for whatever else they’re doing, and the runtime that works best.
Jeff Meyerson: [00:41:50.18] There are plenty of companies that have a streaming system that is already in place at their company, whether it’s Flink or Spark or Storm. Is one of the visions for Apache Beam to be this way to move forward with new streaming systems while maintaining back compatibility with your old streaming systems?
Frances Perry: [00:42:14.24] Unfortunately it’s a little more complicated than that, because in order to use Beam there is a step of rewriting your logic into Beam. If you have a Flink cluster that you’ve got up and running and you’re really comfortable with it, you’re welcome to run Beam on that Flink cluster, but it is going to require some effort on your part to frontload the work to use Beam.
[00:42:38.27] It’s not as easy to transition from “I’ve got a bunch of business logic in one system already”, and that’s a real shame. That’s part of what makes us a hard project and where we really want to get the community involved, working to help folks understand why this model is an improvement over the previous ways that people have expressed programs.
Jeff Meyerson: [00:43:02.00] What are the discussions with the community like? Why is it important for Beam to be an Apache project?
Frances Perry: [00:43:10.06] The Apache Software Foundation is where all the big data stuff is. Everything’s there – Hadoop, Spark, data formats like Parquet and Avro, Storm, Cassandra, Flink… Plenty of things are there, so it’s absolutely the right group of data processing geeks for us to join in with. What we’re really looking forward to is what happens when we try to generalize this model across all these different systems.
[00:43:39.09] The beginnings of the Beam model obviously came out of Cloud Data flow, so it was very much tied to how Cloud Data flow chose to execute things. But when we generalize that and we see the diversity of the way these different systems work, it really helps you to separate out what is properties of your data and your algorithm from what is properties of that specific runtime.
Jeff Meyerson: [00:44:00.13] Can you help me understand a little bit more — we touched on this question a little bit, but what are the pervasive strengths and weaknesses of these different streaming systems? Once we have this unified programming model that gives the top level description of how we are creating our data pipelines, how should we be evaluating which systems to run that data model on top of?
Frances Perry: [00:44:36.22] There it will come down to which features of Beam are really important for your use case. As we’ve already talked about, some systems do a better job of distinguishing event time and processing time. It’s very clear that the streaming community as a whole is moving in that direction and being able to distinguish those two, but some systems are slightly further along than others in their ability to do that.
[00:45:00.21] There’s also dimensions like “Do you need your processing to be at most once, at least once, or exactly once?” There are systems that are able to work in real time and get data out fast by potentially duplicating elements and double counting things, or potentially losing elements. For many applications, that’s okay. If you just want rough heuristics about your data, that may be a tradeoff you’re totally willing to make. But other systems are going to be able to provide exactly-once guarantees. That might be what you want if you’re running a billing pipeline, or something where money is involved.
Jeff Meyerson: [00:45:37.06] So this exactly-once processing – you could call it a strong consistency – why is this form of correctness often challenging in streaming systems?
Frances Perry: [00:45:46.24] You have to be able to hand off the data and handshake with the next part of the system as the data goes through the pipeline. Basically, you’re reading an element, you’re processing it, and you want to make sure that you don’t act back to whoever you read it from until you have persisted it somewhere, because machines fail. It happens all the time – things go offline, network connections go down… How are you going to make sure that failures don’t cause data loss? That’s really the trick.
Jeff Meyerson: [00:46:16.17] Is this the equivalent of three-phase commit in a database transaction?
Frances Perry: [00:46:20.25] Right, something like that. As you’re tracking [unintelligible 00:46:21.28] about these different time dimensions and buffering things across windows, it’s pretty complicated to keep track of this across different stages in the pipeline.
Jeff Meyerson: [00:46:35.29] Tell me more about your work on Cloud Data flow. Cloud Data flow is the Google hosted version of Data flow. What is the Cloud Data flow project?
Frances Perry: [00:46:47.09] Cloud Data flow is a part of Google Cloud Platform. You can think of it, as we finish this transition towards Apache Beam, as being a fully-managed service for executing Beam pipelines. In other words, you describe the shape of your data, and then we’ll deal with all the nastiness under the hood. We’ll provision the machines, spin them up, tear them down, scale the number of machines dynamically as you’re executing, which is a huge deal.
[00:47:16.06] In a batch pipeline you often see that the amount of data that you’re processing changes throughout the life of the pipeline. You might start with a huge amount of data that you’re filtering down to a very small amount of data, which then somehow expands to a really large amount of data. Through those different phases you might want different numbers of machines. You don’t want to be over-provisioning and paying for more machines than you need. If you’re under-provisioning, then you’re going slower than you have to. Being able to dynamically adapt to that is a huge benefit.
Jeff Meyerson: [00:47:46.12] Is that to say that the streaming applications we’re writing today we have to write a lot of server management code that is maybe not going to have to be written on top of the future cloud platforms?
Frances Perry: [00:48:03.17] Yes. We really want to get into the mode where the system is a no knobs experience. You don’t have to tune the amount of RAM you’re using, the number of machines you’re using, how they’re configured… You want to be focusing – as an application developer, as a data processing person – on your data and your algorithms. You don’t want to be futzing with the system that’s running that for you.
Jeff Meyerson: [00:48:26.01] What does building that as an automated feature look like from your point of view?
Frances Perry: [00:48:33.00] It’s a really interesting back and forth between the programming model and the service. If you think about what makes these systems complicated, it’s you’re not just giving them a couple instructions and letting them go off and calculate something for you; you’re giving them bits of your own serialized code to execute. These systems are handling, getting the data from one place to another and running this stage and then that – they’re doing that management, but the code that’s executing what you’re doing to every element, how you’re aggregating, all that is described by the user.
[00:49:06.22] That means that when you implement a system to do this, the system’s almost dealing with all these little black boxes that are your user code. It’s taking data and giving it to your user code, and then taking it from that and giving it to something else, and what it doesn’t know is how that user code is going to behave. That makes it very difficult to, for example, parallelize. I don’t know how long your code is going to run, I don’t know how many machines to do it, how to schedule it if I have no visibility into what it’s going to do.
[00:49:36.19] Designing an SDK and a programming model that gives the system just the right hooks to be able to efficiently schedule and rebalance that work is one of the really fun parts.
Jeff Meyerson: [00:49:50.15] In order for this project to reach the fruition that you want it to, do each of the popular streaming systems need to implement a Beam-compliant model?
Frances Perry: [00:50:06.00] We want enough to. I don’t think everyone needs to; what we really want to set up with Apache Beam is — at the core of it it is just this programming model, the set of concepts. Then we want to set up a framework that allows someone who is a language developer, who has a passion for Ruby or R or Go or Visual Basic, to come along and help build out the SDK for that language in Beam; to let you build Beam pipelines in that language, so that it feels native to people who live and breathe that language.
At the same time, we want to allow people who have a runtime environment, they have some sort of distributed processing system, to come and teach Beam how to run programs on that environment. What that will really do is let the community to decide which path is the most exciting.
[00:51:01.08] As Beam, we see our job to set up this framework that allows these different things to flourish, and as to what path become the most popular and the most well-loved and used, that will depend on the user communities.
Jeff Meyerson: [00:51:14.22] One of the increased expectations that we’ve had about our data processing that we’ve been discussing in this episode is the idea that you need improved fidelity on the difference between event time and processing time. What are the things that you think we’re going to be focusing more on with regards to our data as time goes on? Maybe geospatial data or some other aspect of the shape of our data. What things do you think are going to be changing in the next five to ten years?
Frances Perry: [00:51:50.01] Event time and processing time is a big one, and we’re also really seeing folks starting to understand that streaming is a runtime choice. You want to separate out those data shapes from where you execute them and how you execute them, and by giving people that layer portability, it will make things much better in the future for the way you write your programs without having to leak those underlying implementation details up into your user code.
Jeff Meyerson: [00:52:19.27] How can people get started with Data flow or with Google Cloud’s version of Data flow?
Frances Perry: [00:52:27.29] If you want to come and run on Google Cloud Platform right now, just come check us out at cloud.google.com. We’ve got Getting Started free trials, you can try out Data flow. If that’s what you’re interested in, that is absolutely great – come do that. Also, I want to put out a call-to-action for Apache Beam. Beam is a community that is much more broad than Google. We’ve got folks from multiple companies, Google included, but also data Artisans, Cloudera, PayPal, Intel, folks working on making that programming model really the most generalized, efficient, unified model that we can build.
[00:53:06.11] There we’re really trying to kickstart that development community. You can come to beam.incubator.apache.org and join the dev list, start listening to the discussions, read our contribution guide, grab a starter task – really dive in and start helping contribute to that community.
Jeff Meyerson: [00:53:25.15] Great. Where can people find out more about you and your work?
Frances Perry: [00:53:29.28] I’m on Twitter, @francesjperry – that’s probably the main place. Read The Data flow Model paper from VLDB last year.
Jeff Meyerson: [00:53:40.03] Yes, that’s an illuminating paper. Also, some of the videos on YouTube about Data flow and event-time processing.
Frances Perry: [00:53:52.20] There are some videos and also there’s a series of blog posts written by a co-worker of mine, Tyler Akidau. If you search for O’Reilly, The world beyond batch: Streaming 101 and Streaming 102, he goes through a lot of these concepts and he’s got these magic animating gifs that really show you examples of watermarks progressing through event time and processing time. Those magic, animating gifs are really the best way to understand this. It’s so hard to describe in words, but when you see it, it starts to make a lot more sense.
Jeff Meyerson: [00:54:23.16] As I was preparing for this show, that was one of the things I was struggling with – how to articulate this discussion of the watermarks and the event time versus processing time, and I do think that probably the visual representations are the best places to look. We’ll put those in the show notes.
Frances Perry: [00:54:43.02] Yes, that would be great.
Jeff Meyerson: [00:54:44.02] Was there anything in the Streaming 101 and 102 presentations about streaming that we didn’t cover that you’d like to touch on?
Frances Perry: [00:54:59.02] We hit the major points… That we want to start thinking about data shape, and bounded and unbounded, separate from implementation details, and that by tuning these different axes of event time and processing time we can really cover stuff. Another thing we weren’t able to show visually right now is the code that you write in the Beam model is so beautifully modular. The questions, the What, Where, When and How – they’re literally four lines in a row. That makes it so easy to build these modular pipelines where you can tune each of those independently.
[00:55:33.14] If you go and look online, there’s a blog post comparing the Data flow Beam SDK to what it looks like to write similar programs in Spark for example, and you can really see that these concepts become much more reusable and much more modular, and you don’t have to leak the implementation details into your user code.
Jeff Meyerson: [00:55:53.02] That’s great. Frances, thanks a lot for coming on this show. I really enjoyed this conversation.
Frances Perry: [00:55:57.03] Excellent, thanks for chatting.
Jeff Meyerson: [00:55:58.29] Okay, great.
* * *
Thanks for listening to SE Radio, an educational program brought to you by IEEE Software Magazine. For more information about the podcast, including other episodes, visit our website at se-radio.net.
To provide feedback, you can write comments on each episode on the website, or write a review on iTunes. Mention or message us on Twitter @seradio, or search for the Software Engineering Radio Group on LinkedIn, Google+ or Facebook. You can also e-mail us at [email protected]. This and all other episodes of SE Radio is licensed under the Creative Commons 2.5 license. Thanks again for your support!

Join the discussion
2 comments

More from this show