Episode 436: Apache Samza with Yi Pan
Yi Pan, lead maintainer of Apache Samza discusses the internals of the Samza project as well as the Stream Processing ecosystem. Host Adam Conrad spoke with Pan about the three core aspects of the Samza framework, how it compares to other streaming systems like Spark and Flink, as well as advice on how to handle stream processing for your own projects, both big and small.
Related Links
- Yi’s Twitter
- Samza official website
- Blog
- Samza official Twitter
- Awesome Streaming
- Auto-Scaling for Stream Processing at LinkedIn
- Episode 222 – Real-time Processing with Apache Storm
- Episode 346 – Streaming Architecture
- Episode 219 – Apache Kafka
Transcript brought to you by IEEE Software
Adam Conrad 00:00:21 This is Adam Conrad for software engineering radio Yi Pan has worked on distributed platforms for internet applications for over 11 years. He started at Yahoo on Yahoo’s no SQL database project leading the development of multiple features, such as real-time notification of database updates, secondary indexes, and live migration from legacy systems to new SQL databases. He joined and led the distributed cloud messaging system project later, which is used heavily as a pub sub and transaction log for distributed databases and Yahoo from 2014. He joined LinkedIn and has quickly become the lead of the SAMHSA team in LinkedIn and a committer and PMC member in Apache. Thanks for coming on to software engineering radio.
Yi Pan 00:01:05 Thanks. Thanks Adam for introducing.
Adam Conrad 00:01:07 All right, so let’s get started with SAMHSA.
Yi Pan 00:01:10 So sends actually is a stream processing platform building LinkedIn. So before we get into the more detail, the introduction bio, the sensor itself, let’s first introduce the concept of stream processing. So string processing start you know data processing in a big data area starts from the MapReduce Dubin HDFS, which process a larger size of the data in batches. The technology is optimizing for throughput, not latency. So usually people needs to wait four hours a day is to get unethical results from the batch sharks. However, the emerging needs in the dynamic interaction with the user in session for internet applications demands faster iteration and shorter latency in getting the analytical results from the real-time data. This requires us to process the events as they come in in January out to serve an online requests in minutes or seconds, which we call it stream processing.
Yi Pan 00:02:11 That’s the the particular domain that sends up
technology is built in, right? So sensor as one of those instrument processing platform, it started in 2011, sorry, 2013, late 2013 in LinkedIn, right after Kafka became widely adopted in LinkedIn. The first applications that motivated the whole idea is an internal application that tracks the core tree in LinkedIn’s microservice stacks. Uh, we want to get odd indications on microservices, according to a single user request received from the front end and then being able to join them all together to build a polygraph tree, to understand, okay, how much time that you know, how many costs each of the user requests. And we’ll find out too, how much time we’re spending each of these Microsoft, Microsoft is stacks from inception. Then we have built a platform with a foreign concept, a it is tightly integrated with Cafco for high performance B.
Yi Pan 00:03:12 It is hosted execution platform to support multitenancy and C this platform needs to support high throughput and the Charlotte NC state stores to support stateful student processing with at least one semantics. So this sounds quite a bit, quite a lot requirements to achieve all the above. Then a sensor actually is designed with the following techniques to address each of them. So we have created an advanced, a wrapper of the IO connector on Kafka to achieve per partition pro control when consuming many topic particular experience, a single consumer instance. So we were able to achieve a single using a single consumer instance who consume tens or even hundreds of a topic petitions, paralyze processing on all of them and having a very fine granular from control for each of the topic petitions. So one of the, one of the threat becomes slow for one of topic.
Yi Pan 00:04:11 Petitions will not affect yet the other to address the, you know to provide a hosted execution platform for multitenancy works. Uh, we integrate with the yarn as the hosted execution platform to manage hardware and also the container exclusion for the use application. That way then user won’t need to manage their own. Uh, the user will not, will not need to manage their own hardware. We’re not need to manage it their own monitor and manage their own process. Liveliness and yarn is leveraged as the distributed and hosted experience platform to run these use applications, third, to achieve the highest root canals shot they didn’t see for the state stores. We also implement a unique architecture using Kafka as a rider hairline, and then the rocks TB as the local state stores for center containers. That allows us to achieve really high throughput, low latency when we are re right, the state stores, which is effectively a Brock CB database. So in addition sends that also leverages Kafka as an intermediate shuffling cues between stages in the stream process pipeline to achieve the repetition with, for with the fate with fault and failure isolation lately, we have advanced sensors API a lot as well to our SICO to add beam runners and multi-languages support on top of being especially Python to allow easier usage and support Reacher windows semantics fare the beam API, so that hopefully give a overall view of what that is. And then how some details about how sensors implemented.
Adam Conrad 00:05:48 Yes, that is a very thorough introduction. So I would love to take a higher view. What’s a pull it back to maybe a 20,000 foot view stream processing in general. Uh, what constitutes big data in 2020? Are we talking terabytes of data, petabytes of data, where does one start when they’re looking at stream processing?
Yi Pan 00:06:08 So that depends on whether we are looking at the whole collective of all these pipelines in a company, or we’re looking at a single one as for today, like in, as a whole, we are processing at the North like three to 5 million events per second, as a whole collectively in a cluster as a as a whole. And then in some of the individual jobs, the larger jobs can process itself process close to 1 million messages per second, and then having a state store close to a terror, 10 terabytes and the growth,
Adam Conrad 00:06:46 And why would I need to process streams of data? But what kinds of use cases are people finding for stream processing
Yi Pan 00:06:52 Processing usually is as, as I explained earlier, that is usually applied it to the cases that you want to foster interaction to foster interaction and a shorter latency to get an aggregate result. For example, in your session, when user open a website and start to read some articles or click on some interesting content, then those user activity will be sent back in real time to the backend system. And usually in the session itself within seconds or minutes, the platform needs to understand, okay, what is the current user intention? And based on that to supply dynamic contents, like further recommendations of a more interesting content or some advertisement pertaining to the to the user interests, for example, the user is looking at some job market and stuff, and then it probably is interested in a relevant like job posting. So these kinds of interactions you cannot suffer from latency that, okay, say, Oh, I have this tracking user tracking activities sent back to my Hadoop cluster and the wait for flowers and, you know, getting you know get back the result to the user by that time, because it is already gone.
Adam Conrad 00:08:10 So anytime I see a recommendation engine where it’s recommended to me something else to watch or listen to there’s likely stream processing going on underneath to ensure that that returns quickly. Yes, correct. And so then when should I not consider using a stream processing framework to handle my
Yi Pan 00:08:26 So stream processing definitely favors latency versus throughput. And because of that, it consumes a lot of resources to store and keep the intermediate result in the partial and the intermediate result in the system. So, and the Ron’s 24 seven. So if you’re an ethical result does not really impact in session time with engagement with the user. Then you can save a lot by doing batch process, which dynamically allocates tons of uh, tons of resources and then shrink it down and wrap it up once it’s done. So that’s a great, that’s a trade off between running a stream process, which is you know, long running forever and then effective results, an echo froze versus a batch based basically, okay, once that data is there, you can schedule a massive parallel than just the transfer of the data, or you know, as fast as you can, and then stop then release all the resources to the system.
Adam Conrad 00:09:30 And is there a pivot point where batching just doesn’t work and it makes more sense to switch to a stream processing framework? Because I think a lot of companies batching is definitely one of the first ways that they think about scaling the data that they’re ingesting. And so is there a, is there a signal that people can look for when they say, okay, batching just isn’t working and that there’s something I might want to consider in advance before they start looking at stream processing frameworks?
Yi Pan 00:09:58 Uh, I think there’s a no like hard set rules for, okay. When use a switch to you assume processing, but definitely there are some or some, like if it, because if you start a, if the batch process, you start to have a shorter, shorter cadence than the batch process needs to have a more you know, heavier load on the scheduler saying, okay, instead of every hour you schedule once now every hour, you need to schedule four times for them, like average every 15 minutes, you need to you need to schedule something. Then it is it is into the boundary of the area that, okay, if that’s the case, why don’t you just schedule once for 2024 seven and then, you know, allocate the resources there because even in that area, if you, if you’re running the 15 minute window batch process, and then each of the batch processing also takes, for example, 14 minutes to finish, then literally you’re not saving any resource in terms of like batch per city mode, that you can use a resource and release anymore because now your floor flight continuously using that amount of resource, then in that sense, then why don’t you the, the, the sentiment here is okay now with more and more scheduled shortly, then see bursty batch shops.
Yi Pan 00:11:24 You actually are continuously using those resources anyways, and you’re paying the cost of ramping up and ramping down. Why not just let it run, let it run all the time.
Adam Conrad 00:11:34 Great. And for more information on general streaming architectures, you should check out episode three 46, where we’d go into that in more detail. So that’s a, that’s a good overview in choosing those frameworks. and you did a really great job of explaining Sam’s at a very high level. So talking more about those internals, I’d like to dive into each of those three parts, the core architecture. So we’ve got the streaming provided by Kafka. We’ve got the execution provided by yarn and the processing itself done by the SAMHSA API. So first I’d like to start with the streaming from Kafka. So are these two projects developed together?
Yi Pan 00:12:08 Uh, some very closely, so sounds as as I explained a little bit earlier, so sounds as the study right after Kafka became widely adopted anything. And it was the development of bike, even the the co-founder of the project is also some you know heavily influenced by Kafka and even Jake crops is also their co-creator sends a project. So it’s literally just a sister team in LinkedIn with.
Adam Conrad 00:12:38 And so are they required to work together? Like, do you have to have Kafka in order to be able to use SAMHSA?
Yi Pan 00:12:44 Not necessarily. So sends a, has all this although it sounds like it’s a code developer, almost a code developer buyer, top guy, and leverages a lot of uh, it’s architecture choices. You know, the implementation leverages calf out a lot, but all the interfaces are Blackboard. So you can for, for Senza you can plugging any systems, pops up systems was was sensor. As long as you implement the system interface system producer’s system, consumer interface, then it will work. Example would be we implemented it into solution of Sandra with event hubs and Kinesis. Those are, those are orders.
Adam Conrad 00:13:24 Okay. So there are examples of swapping out frameworks with Kafka for a different one. Yes. Okay. And then, is it just the fact that SAMHSA came shortly after a CAFCA that they said, well, we, since we have to use streaming as a, as a core service for this group, we might as well leverage Kafka already since it’s part of the same organization. And it’s already part of the Apache umbrella,
Yi Pan 00:13:46 That’s definitely one of the aspiration, but at a time that, you know as for a large distributed pops up system Kafka at that time is one of those little, just one or two choices there.
Adam Conrad 00:14:01 Oh, they’re for open source. So right now SAMHSA is leveraging Kafka externally. It is its own sort of pluggable service, as you mentioned, is there a plan where Samsung can come pre-packaged with some dedicated streaming layer as part of the Samsung project, or will it always forever have some sort of plug to a different streaming layer?
Yi Pan 00:14:23 Oh it, the current implementation already have Connexus and event hub implemented and the package of together.
Adam Conrad 00:14:32 Oh, so I guess my question is those are all external solutions from different companies or different projects, right. Is there an, is there an idea to have, CAFCA be replaced by something that is internal to SAMHSA? So Samsung owns the streaming layer.
Yi Pan 00:14:48 Got it. So there’s an intention to replace Kafka as an internal air, but we made it architecture to be open that’s the that’s there that’s a concept.
Adam Conrad 00:15:00 Okay. So this was a decided choice to ensure maximum plugability and modularity across multiple systems. Got it. Okay. And then yarn yarn execution layer, is that meant to stay with
Yi Pan 00:15:13 No, there’s a definitely extension to that. We spent quite a bit of energy to move from a very tight integration with CRM basically says I cannot run without a yarn to a mode that the sender can run with yard, or you can run a standalone center. And we have some experiment the, you know I think in the, in the recent one year or two, we already have some experiment department, whereas you know, try to ground center on top of Kubernetes as well. And then that’s definitely ongoing. Like we will have integral words, more of advanced state of art digitally executing moment.
Adam Conrad 00:15:52 Okay. Now, is this the same yarn? I think of, I come from a front-end background. I do a lot of web development. Is this the same yarn that we use for yarn package management, or is this a different yarn system?
Yi Pan 00:16:02 So that is patchy yarn, which is a part of the Hadoop ecosystem. It is you know it’s the resource ma uh, resource management system that if you submit a job asking for a set of resource to run a set of a process, it identifies the the physical holes to run the containers for you and the manager, they live Nissan.
Adam Conrad 00:16:24 Right. And so in general, for folks this is not the yarn you’re thinking of in the MPM system. This is like you said, resource management, job scheduling, job execution. It’s ensuring that those jobs that get input through, through the streaming process data get executed appropriately. Right. So when I was looking at the architecture, I noticed that Hadoop does many of these same things as well. And it also seems to be the only service that you share was this intentional to share Hadoop as that execution service.
Yi Pan 00:16:57 Uh, I do mean, you mean specifically a yarn, right? So, correct. Yeah. So yeah, just to clarify Hadoop is a big ecosystem. A yarn is just part of the project doing the resource scheduling, and we’re only leveraged that part. The and also as I explained, sends up leverages yarn to do the coordination among all the processors in your same job to have a single master, to allocate and assign all the works amount, all the worker or the processes, and then maintain the RNs and the membership of these group. But it’s not necessary. The only implementation the standard we’ll send the implementation, actually replace that model with zookeeper based the coordination, which allows a set of processes, basically doing the leader election to lead their li uh, leader, a single leader for the whole for the whole quorum. And then do the same resource do the same, like a workload balancing and the assignments among all the alive workers as if they’re, you know indicates that you don’t have a yard.
Adam Conrad 00:18:09 Okay. Yes. That’s how I was going to ask. You mentioned a standard on SAMHSA. So that means just like Kafka, you could also plug in a different job scheduler as well, outside of yarn. Yes. So I’m guessing the theme here is that in order to quickly accelerate, and this may just be a guess, but in order to quickly accelerate the development of Samsung in the beginning, it was so much easier to leverage CAFCA or yarn because they were already part of the Apache ecosystem that to have to build them out yourself. Correct. Yep. Got it. But the streaming of the actual processing part of the application is done through the Sams API. So I’m guessing when we look at the entire sort of three core system, the Sims, the API itself is the most core and central to this actual Sam’s app project. Right. This is the area that has the most unique code compared to the other layers.
Yi Pan 00:18:56 Yeah. So one API, depending on, well, in my mind, Dan, that when we talk about some processing API, so in this case, then it will involve not only the IO wrapper that IO connector that read right front of pub sub system, and also like should include the job checkpoint system, the state, you know, the state store system as as well because in stream processing, without the checkpointing and without a state store management system, you cannot really achieve the you know failure recovery and starting from you know, and achieve the at least ones and even exact ones semantics without checkpointing or state.
Adam Conrad 00:19:42 Right. And so when we think about the entire project, this was actually one part that was confusing for me was what do I actually call SAMHSA is SAMHSA the entirety of Kafka plus yarn and the Sam’s API or a Sam’s adjust the API part to the whole stream. Processing architecture
Yi Pan 00:19:59 Sounds actually built uh, built a platform or framework to hide all this system level integrations with IO with a pops up IO system, with the scheduling system and with the state source system in a set of APIs uh, going around it and then allows user basically allows users to plugging their user logic in the event loop, you know, exposed to the sort of this API,
Adam Conrad 00:20:28 Right? So the stitching layer happens all seamlessly through the API. And then there’s a subset of that API that is then revealed to the consumer who actually wants to handle that Jaida data and send it through either, you know, a container or some sort of job to schedule that for later. Yes, that’s fine. Yep. And so that’s actually a good transition point when I started looking at the Sams API, essentially everything has, seems to be either a job or some sort of container. So I’m very curious to see how this integrates into databases. So when we think of databases, there are things like acid guarantees, like atomicity consistency, isolation, durability. Are there those kinds of guarantees in a system like in a stream processing framework
Yi Pan 00:21:13 Processing in terms of like at a state you know I take it as, okay. The if you’re comparing this, they uh, state you know, kind of a stream process system, whereas a database than there’s a famous, like a stream to table, your ILT products that a few, I think a few years ago geographic pro uh, posted there, right? So in streaming than it is always viewed a table or database as an ever-changing thing. Right. And then there’s a if you’re thinking about asset control usually it is not in the synchronous you know it’s not guaranteed to in a synchronous way, right? Most of these consistency guarantee is based on the eventual consistency, or you can replay the same sequence and, and achieve the unimportant as their final rule.
Adam Conrad 00:22:11 Yeah. And actually you saying that makes a ton of sense if this data is constantly changing there, in essence, there is no sense of consistency ever because it’s constantly changing, but like you said, I think the eventual consistency right. Is the right thing there, because at some point the job will end that you’ve scheduled and you’ll expecting something to look like some kind of result at the end of that job, but right in while it’s processing, it’s going to be consistently changing. So that makes sense from a, from a consistency standpoint, are the jobs themselves atomic, could you run them again and again and again, and get the expect sort of the, the action to complete?
Yi Pan 00:22:47 Yeah, so Tommy, Tommy is also a strong word that is, I think only applicable to a single host or a single you know, kind of a CPU and the centralized locking system, whereas stream processing with a sensor as a distributor’s twin processing platform has designed from beginning. Atomic is not the design principle. The design principle is a consistency. And also B is the failure. Recovery was replete with a replayable idempotency. So what the differences is, okay, Tom city of basically you have few operations, and as you said, you have a job you want to the job has 10 different uh, 10 different workers, right. And each of them are doing something and you have to say, okay, all the 10 processes finished one operation in autonomous city and then close, synchronize lock, and then proceed right in the ventricles, in the, in the way that eventual consistency and the dish you know unimportant as it works is okay for each of these 10 processes, you can proceed as your own pace, but you keep the state saying, okay, when you fail or when you’re shut on the process saying, okay, Oh, I remember process one is at the step three.
Yi Pan 00:24:11 And the process two is at step one. And then you have a vector of all those States of the whole set of processes there. Once you restore the whole process the check point and may always be at a step zero for all the 10 process. Then when the input are replayed, the, all this process of we’ll need to check with their own States and say, Oh, this step I it’s already implemented. It’s already applied. Uh, from the previous round for me, I was skipped as a no op and then up to the point that, okay, they see a step that I have not exceeded in the previous one, and I continue. Right. So, so then that’s the, that’s the way that you achieve a Ripley consistency and item potency in the dishware system. Right. Okay.
Adam Conrad 00:25:01 And then, so that really doesn’t make sense then to think about it in terms of atomic. Cause like you said, these aren’t short running jobs. These could run for several hours or several days or just, they continuously run. And so
Yi Pan 00:25:11 Let me just clarify a little bit on that as well. So Thomas city, okay. Here, if you want to apply at Tom city, you can still do that by a centralized, the login system say you apply a barrier, a centralized barrier sort of was we’ll keep her or something like that, but that greatly affects us throughput. And a lot of a lot of those stream processing applications does not really work well in the, in the case of you, you have to keep up with the thousands or millions of incoming messages as important,
Adam Conrad 00:25:44 Right? So it almost defeats the purpose then like you could technically do it. The logging itself almost acts like a ledger then. So you could, like you said, if, if the service gets interrupted or it goes down for a bit of time, the ledger is acting sort of like the state machine, which has the, here’s the state at a given time when the services backup resume at this line, at this given state, but all that extra logging is going to affect throughput. And so you almost don’t like you’re saying there, if you needed those kinds of requirements for that data, you probably wouldn’t even be doing stream processing in the first place. Yeah, exactly. Right. That makes sense. Okay. So that covers, I think the, the job aspect for containers, essentially, it says that you can execute arbitrary code. So does that mean that security is forced and forced by the customer? Or is it expected by the customer? How does security handle for these containers, which could execute anything?
Yi Pan 00:26:39 Right. So that’s a good point. So I think for sends the applications, the the word arbitrary is also like slightly misleading. So it is, it is just referring to, okay, there is a, Hey, there’s a set of Java API. And then w you know, just using this set of uh, system, provide API APIs, then you can do whatever you want as implementing your user larger. So if the security is in the aspect of, okay, saying, are you able to in the regular, not really doing any hacky way in are you able to get beneath the API layer to get down to the system level resources and objects then the security is there? Yes. Uh, there’s no exposure of the attorneys system objects to the user, except for, okay. There’s a message that comes in messages there. If you want to use a state store, the state story P there, right.
Yi Pan 00:27:38 And if you want to input output, you know, if you want to write to the and the system, the producer API, there, that’s the level of security that is there, but if you’re talking about the user just in their code arbitrary, right? Some ref, you know, Java, refraction code, try to get some system that will instance through refraction. Currently, we don’t really enforce that serially you know, in theory that we can we can try to enforce this level of security by some access rules or the different class loaders for, you know, for system versus a user space to enforce that. But that, that is a good question. We’re still working on this,
Adam Conrad 00:28:19 Right? Yes. And it’s important to point out too, that because you can use Kafka right out of the box CAFCA itself provides no security for those topics. So therefore by extension Sam’s, it doesn’t as well. So what should people be doing to ensure that their streaming code is secure? Do you have any tips or suggestions?
Yi Pan 00:28:37 Don’t, don’t do refraction, right. Don’t do refraction. And they’re always going through the sensor system API called to interact with their system. And then including the IO system, the state stores, and then coordination system. Right.
Adam Conrad 00:28:55 And are there validations within that API themselves, or what kind of guarantees does you’re saying if the Sam’s API is the safest way, what kind of guarantees does that API provide that helps to ensure security? Yeah.
Yi Pan 00:29:07 I mean, does the sensor API self validated whether
there’s an invalid access to that?
Adam Conrad 00:29:13 Right. So the APIs themselves have certain types of signatures for each function that you’re using. I’m assuming that there are security, there’s security layer built, baked into all those functions. I don’t think there is currently. Got it. And so then what, what benefit are you drawing from ensuring that you properly use that API then?
Yi Pan 00:29:34 So the, if you properly use a set of API, then the coal pass is all controlled by a platform that we actually a sensor committee member that we does not really go underneath the outlandishness, the skiing of certain objects. And they, you know manipulate with the system authorize a, set them objects by date.
Adam Conrad 00:29:55 Okay. So it’s more for debugging or for, you know, if someone posts a bug to the Apache project, they can say, well, I use X, Y, Z B API APIs. And then there’s some sort of guaranteed around trying to fix that because it’s part of that you’re going through the ecosystem and then rather than going around it.
Yi Pan 00:30:12 Yeah. Because underneath this set of API stand, no user code, it’s all perfect.
Adam Conrad 00:30:17 Got it. Okay, great. So this is, this is a very helpful introduction to the, to the whole ecosystem for SAMHSA. One really interesting thing I noticed about Apache in general is it seems to own the streaming space overall. And there are a lot of tools for streaming in the Apache ecosystem. So I’d love to take a little bit of a detour and talk more about each of those and see where Sam’s fits into the overall landscape of streaming. Sure. So we’ve already talked about Kafka Kafka itself is a streaming framework that you could use on its own. So while Sam is allows you to plug into Kafka Kafka, you don’t necessarily need to plug SAMHSA into, right. So Kafka itself, if you needed to handle streaming of data, you could use that standalone spark is another one. What instance would I want to use spark for?
Yi Pan 00:31:08 So if you’re talking spark, so spark historically spark incremental swimming in Michael bachelor, right? So only, I think only after two, four or two five, the structured streaming is kind of a reinvented the whole API. So that from an API point of view, you don’t see the micro bachelor pad anymore, but on the knees, there’s the exclusion of a moment for the years bar structure of screaming is still either micro batch or con or there’s another way called a continuous execution. Right. And you have to switch between them, the micro bachelor’s well adopted spark way of doing the streaming and the continuous exclusion is not yet popular yet. Right. And for calf. So that’s, that’s the the spark side thing. So if you compare Sandra was sparked, Dan sounds like is a pure streaming engine. So the comparison with the micro batches, with the pure screaming, you can do event by event stream processing, and async commit with each of the events, right?
Yi Pan 00:32:17 And with the micro batch, you always need to synchronize on the batch boundary and the, your top, your checkpoint is also need to synchronize with a batch boundaries. Right? So in, you know, in a lot of our use cases that is not convenient, right? So that’s the short comparison between Senza and a spark there you know for the year to them, between Sandra and the Kafka stream actually is an interesting story that CARF Kuskokwim is literally a for cough front sensor, the reason being okay at the early discussion in the Senator community, there are two different schools, schools of thinkings that, Hey, things, Kafka is already a popular product. And then there are tons of need that. Okay. for small media use cases, I do not really want to stand up on my own young cluster, my zookeeper. So I want to have a, you know, a standalone or as a library as soon processing as a library, as my primary use case, but on the other.
Yi Pan 00:33:22 And for for example, for the larger department like LinkedIn, and at that time, Netflix and other recovery, big installment, having each of the different teams who developed a stream processing pipelines to run their own hardware and having a, you know, many, many repeated SRE and operational effort to monitor those 24 by seven jobs across the company is definitely not the way to go. So there’s there’s also a strong need to keep the hosted and the Mo uh, you know, running stream processing as a service as the primary offering in the India in the Sandia community. So not that that’s out, not saying that we are not you know as I said, as a company, we don’t do not recognize the standalone you know, stand alone way of diplomate things.
Yi Pan 00:34:21 Uh, just the priority-wise that’s a different, so, um literally at that time then Kafka wants a first party, the you know serving the more media use cases. So they spin off the standard old, faster, and it took us some while to get to do coordination mode that implementing the standalone sender stand along with zookeeper, but that’s the main difference. Even it looked at a calf can swim as a port today. it is still as a, as a library model, it doesn’t really provide, you know, as a service.
Adam Conrad 00:35:00 Okay. So if I’m understanding you correctly, you’d want to use Kafka stream for these small to medium instances, because it’s all baked in together for, it’s just an easier use case. Whereas if you really want to use the full to the full extent, you’d want to integrate Kafka with Samza. So these, these larger customers are going to be using the distinct Kafka for it. It’s used, sums up for it to use between the streaming and the processing portions, but for the smaller use cases, they’re going to go with Kafka stream built into Kafka.
Yi Pan 00:35:30 Yeah, I think as it to me, is it changeable for for the smaller media cases between K swim and a standard standalone as kind of exchangeable. So Kafka swim may have led with lighter with it in terms of okay, so we can present up and those things. Yeah.
Adam Conrad 00:35:47 Got it. But LinkedIn themselves is using the full
Yi Pan 00:35:50 LinkedIn. Oh, you use both.
Adam Conrad 00:35:53 So what instances internally, are you using K stream for versus
Yi Pan 00:35:58 No, no, no, no. Not the case for him. LinkedIn used both sends us a yard and a standalone, right. Yeah. Right. Got it. Okay.
Adam Conrad 00:36:06 That makes sense. So there’s Def I mean, there’s way more than just those two. So we’ve, we’ve covered spark, which is definitely more in the sort of cluster computing, big data analytics. We’ve got K stream, which is another sort of standalone service. That’s more integrated into Kafka. And it sounds like spark then just, just to wrapping up that piece spark is probably used before you would want to use something like Sam’s like given you’re using batching as a batching is definitely something you’re going to be trying out before you’re trying out a processing framework. So do you find a lot of people coming from spark and then trying out a SAMHSA or is it just different use cases in general?
Yi Pan 00:36:41 So in the while in early days as spark is the same, you know spark and the swim side as kind of mostly isolated silos and people who, right. Uh, people who want to write bad jobs uses Bart and then people who want to use use the real time analytics, the right example. Right. But we’ve seen as, as like since 2016 or 17, we’ve seen more and more use cases wants to both. Right. And that’s exactly also our journey toward okay. Uh, try to get a convergence API between the bachelor and then the steam ward. And so far, our strategy is we actually discover in the year of 2016 and the later 2017 beam is a you know, a good choice for us because beam API is stressing for basically putting a lot of rich stream processes, semantics and syntax on the API, but the whole API layer is engine agnostic.
Yi Pan 00:37:54 So this the, the beams, you know, architecture is okay, there’s an API layer describing, or the user logic what you want to do with uh, with a stream processing, but underneath you can have different runners, which is a template area to translate you the logic to a particular engine. For example, spark can be an engine there. Sensor can be an engine there, slink can be an engineer, and even there are more like apex and all that, right. So we feel that is a best fit for our strategy because okay. There if you compare the different you know, different engines in in LinkedIn, that spark has tons of for throughput and uh, integrated well with the batch echo system stream side sensor has its own tons of optimization in the wrong time and integration with the nearline online ecosystem in LinkedIn.
Yi Pan 00:38:51 So there’s a huge gap. If you want to replace one versus the other, yet you probably need to spend the years of effort to replicate the optimizations you’re done on the other side, but with beam, since the beam application bean layer allows the user to just suspect you know, focusing on the application logic, then the optimization, and also how do you wrong the same logic stream versus batch environment becomes, you know, kind of a hidden layer from the right. So that that’s that allow us to have, you know, kind of a more quicker integration for the convergency use cases and allow us to demands on the you know uh, up with my patients for each different, different engines separately.
Adam Conrad 00:39:42 And this being later you’re talking about, is this the same one that powers are laying in elixir, the concurrency languages?
Yi Pan 00:39:48 No, the beam is Apache beam, which is, I think as a project started in 2016 and the rhino is, becomes the top level Apache project already. Yeah.
Adam Conrad 00:40:00 There are just, the more I sort of unpacking this, the more you can see that, that Apache really has so many services that power, this whole streaming space. And you also mentioned Flink earlier, which is more for distributed data flow or parallelization. So that sounds like it’s another system yet again, where it’s not about necessarily replacing one thing with the other, but it’s about complimenting each other through this beam layer.
Yi Pan 00:40:23 That, that’s a thing that if I want, if I want to say a short comparison with, you know with the landscape I usually categorizing for different aspects, right? So talking about all the you know different state of our platform, the engines are there. Uh, one aspect is whether the API is pure streaming versus micro batch. So you can see that frankincense actually is mostly pure streaming because, you know, usually the API is event by event processing allows async commit features and dull things and data from which Google Detrol and the sparks structure of streaming mostly operated on the micro batch mode. But then that’s the API layer layer of difference. The implementation architectural layer of difference is okay, there are RPC versus Chevron queues in the, in the whole pipeline. So that’s you know, if you compare all their different systems, it’s very unique, like for a sensor Flint data for all, and sparks structure, swimming in a single job pipeline, they all use RPC between the stages, a single pipeline, which has to be started on timeout when the communication between the operators in different stages have issues, right from funder, upstream, downstream sends that, you know, implements a different way that it leverages the Kafka queue as an intermediate shuffling shuffling queue.
Yi Pan 00:41:51 So instead of IPC, so if if you have intermediate you know stage one, produce something, stay to continue to process this output in front of stage one it is all communicated through a carved packet. So it has an advantage of the F okay, you isolate the failures in the downstream from the upstream, and you guarantee the orders of all the sequences that once you, once the materialized, once of all the events you want to process that downstream will all be, see the same sequence there. The downside of course, is, you know added some added some latency and also added some position costs in the whole package, right? Yeah. That’s the second aspect. The third aspect is okay, the largest state sport supports. So if you, if you compare all this then Flink and a sensor actually comes with a very comparable feature sets that it all comes with the native free offering of be based local store for large States, spar structured streaming by default offers.
Yi Pan 00:43:00 Okay. HDFS based stay store. If you say, if you ride a map with a state operator in spark, right? The issue is that as as called the advantage brakes offering on Azure, the FreeState store implementation on HDFS has some performance issues when the size of each estate goes beyond a certain certain size, I forgot exact number, but I think it’s in the gigabyte range. Then in that case, you will have to purchase the Rock’s team based the state store for this for the largest state store, from Databricks as a proprietary offering. Right. So that’s, that’s a, that’s a difference, right? Okay. The weather for our Flink and therefore for Centra that by free you get the state store of, you know largest state store support with the rocks and things, Bart, you need to pay money.
Yi Pan 00:43:53 Correct. And then last last aspect I will thinking is you know, to compare all of those is a cloud integration, right? Whether you have a native integration, whereas the public vendor has a service. So on that front actually Cal data fro, and the spark streaming has a better integration as they call their fro is a native offering from Google spark streaming has the offering basically a first-class service offering from Azure Flink. And the sands are mostly are freelancer. So you can just run your own, run your own service by yourself. But as for now, there is no cloud vendor based offering for that yet the, hopefully that that’s giving a good overall comparison by that.
Adam Conrad 00:44:42 Yeah. And you know, there’s yet another project that I think stitches some of them together. One of the topics you mentioned, that’s a big differentiator between the different services is whether you’re focusing on batching versus pure streaming. Apache apex is yet another streaming tool. That’s mostly around processing either streams or batches. Uh, so it sounds like if you’re, you’d mentioned this earlier with with CAFCA that if you, if you have spark, you, you could compliment spark with SAMHSA while one handles the micro batching and one handles the streaming. It sounds like apex could be that meta processor that’s allowing either streams or batches to do complete, like through this system. How would you use apex in that sense?
Yi Pan 00:45:28 Apex itself is an engine. So I don’t think apex itself as API layer. What I was mentioned earlier is a patch of beam is an API layer. So it doesn’t really demands a specific engine to run beneath an apex, I think is a, it has a beam runner. Basically it runs the beam program as an engine, but itself is not, it is designed as a convergence engine, but I don’t I don’t think itself, you can swapping any and you’ve been used to it.
Adam Conrad 00:46:01 Okay. Yeah. So in that sense, if you didn’t want to use beam, you can, apex could run the whole thing for you because it’s is itself the engine that’s powering this.
Yi Pan 00:46:10 Yeah. You can, you can choose to all in for apex, but unfortunately I don’t think the project is currently very, very active and it also comes way later than what we have the amount and we have given up.
Adam Conrad 00:46:28 Got it. Okay. And then, so then just sort of rounding the quarter here in Apache streaming projects. The one that actually sounded most similar to me, you had mentioned that there’s a lot of similarity in pure streaming between Flink and Sam’s. Uh, but I also noticed storm sounded very similar as well, too, because that handles stream processing computation. So to imagine that there’s a lot of overlap with how things processes, the jobs where the containers that are in those jobs, right? So it sounds like you might actually have to look at the comparisons
Yi Pan 00:46:59 Storm was the earlier competitor was the center that we analyze. I think that there are a few disadvantages we see in storm that at that moment that, you know stop us from, in masculinity, that one is storms largely topology management, like the the Nimbus based a largely topology amendment is not quite reliable at a time that often causes a stalling of the pipelines. And then the other is okay, a storm doesn’t really come with a good large state store support with shot latencies. Right? So even for that, for that effort, you know, for that aspect, in fact, I sent, it was the first one into, with the rocks DB by native by native first cost support Flink actually works with the in-memory stay stores offering first. Then it comes with a, a fallout with Roxby implementation storm. Doesn’t really have that animal. Okay.
Adam Conrad 00:47:59 So you, so at this point, then storm is not something that people are actually actively trying to compare. If you’re going to choose something in the competition space, you’re going to be choosing Sam’s and you’re not going to have to worry about storm at this point anymore.
Yi Pan 00:48:13 Uh, for us, I think in my opinion, the most currently if I’m looking at, okay, what are the top competitors in this space? Um I’m going to be looking at Flint and then spark structured streaming. Um a Flink is in a lot of sense that the functional wise offered feature wise as very close to what sends out is offering spar structural streaming. The main reason is spark has a big ecosystem from the batch side.
Adam Conrad 00:48:46 Right. So then how do I know which framework to choose if there’s so many to evaluate, what am I actually looking for in particular pieces to help me choose the right system for me?
Yi Pan 00:48:54 Right. So that’s, that’s a from our point of view, then we always stress for performance, right? So I can, I can not really I think it’s hard to come up with a general rules or reasons for people who want tos Ava’s of B. I can explain several reasons why censor is adopted as a main stream processing engine in LinkedIn, right? So from the LinkedIn perspective point of view, so the first reason performance, so sends us Kafka interface is thus far the most advanced the IO connector on CAFTA among all the stream processing platforms at MC if you look at most of the stream processing platforms, integrations, Kafka is saying, okay, each topic, or each topic partition, you have to have a consumer instance for that. Right? So then this Kafka this instance will feed for a for a task in S uh, you know, one uh, in, in your task for processing this being order sequence coming from petition.
Yi Pan 00:50:02 but then if you if you’re consuming from many different, the topic particulars, one, this analogy we’ve seen is some of the, you know, some of the platforms that basically need to instantiate a moldable corporate consumer instances in the same container to consume multiple different they’re CAFCA topics and topic petitions from there, and in order to having a one-to-one mapping front, okay, there’s one input, one input, a sequence that one threat processing that it works with two conditions. A if the input topic petition is relatively static set, and B the number of total memory footprint of all this number of top reputations sorry including all those number of our consumer instance in a single process is not a high pressure, but sends out works in uh, Martin of ancillary that we actually instantiate a single consumer instance for the for the Kafka cluster.
Yi Pan 00:51:03 Even you have consumed from even your, a consumer from 10 or a hundred of a topic predictions from this customer, and then internally there’s a multiplexer and also a per potential fraud control mechanism that we can, we see, okay. The there’s the there’s multiple extra we’ll take the messages from all the top of petitions sent out to different thread that is working on different different you know petitions. And if one of the thread is going slow, we just pass on this particular input partition, which, which doesn’t really pass the whole consumer instance, a lot of other system will do, right? So that’s, that allows us to Frank, you know kind of uh, doing a very dense consumption model in a single consumer instance yet to maintain a high throughput in terms of, you know, it needs to fan out to multiple paradigms that gives a, a lot of benefit in terms of like performance for sensor processing the other, the other advantage of costs, the the integration with the native integration rocks DB as a local embedded store, and which it yellow to the submit a second to read, write latency and a strong synchronized, the read after write semantic consistency in the state store.
Yi Pan 00:52:23 We arrived, we had an earlier performance benchmark showed that sense of supports a 1.2 million requests per second, on a single host in a very old host with only like a one gig NIC, right. You know we have, we have no reason to, you know no reason to not believe in that it can skill up with modern hardwares with 10 gig next and the other beefier uh, CPU and a memory that’s that’s, that’s the one big aspect of LinkedIn tuning center, right there also LinkedIn reading fortune stands also is okay. The with the improve, the API APIs with rich windowing semantics. So that is coming from two aspects in terms of the API since 2016, Hey, you started supporting SICO. So CQL supports allows many users, such as data scientists or AI engineers to write the stream processing without understanding underneath system details.
Yi Pan 00:53:29 And B is, we are into with BMAP. The beam runner is the one that we built since 2016, which allows us to leverage the reach semantics of BA period specifically windowing, and even, yeah, and also there there’s, you know, triggers, which ha is a very advanced it API. And the concepts in the student processing, you can read, you know tailor our kiddos stream processing one-on-ones to, in a one or two on to understand what is the trigger and late arrival and arrival means, right? This will, this also allows us to support multi-language support from beam, which recently beginning we have successfully advanced to support Python stream processing applications in LinkedIn, right? So that’s, that’s the that’s the API side. Those are all built on top of again, those are all built on top of a high level API center, which allows you to specify the multi-stage pipelines in a single application experience that there’s more much of a work that is still going on with sequel and B uh, the third aspect that, you know, I want to mention this, we have actually improved on the upper PDT at a scale for stim processing sends out quite a bit.
Yi Pan 00:54:45 So the biggest challenge of course, in stream processing is maintaining a large number of stream processing pipelines with strict SOS, 24 seven. We have actually put a lot of effort into you know in many aspects of that. Uh, I will only mention two here. One is we actually added advanced auto scaling capabilities in Sandra platform. One of the biggest complaints of course from users is okay, the compacity difficulty in tuning to come to the region for blank for running pipeline, the incoming traffic for choice. And when you deploy the deploy the same pipeline from data center, a to data center B the traffic differences, and then the hardware may be different. So there’s tons of the tuning you have down in data center, Hey, doesn’t work for this and B. This is a nuance and a lot of creates a lot of pains from the user.
Yi Pan 00:55:42 This is especially true for CQL users since sequel user, mostly not familiar with their operational details on physical system. So we since has put a lot of effort in providing tools and system to help monitoring and auto skill the jobs. So our users won’t need to carry the operational overhead. We are recent success in this effort. you can find it in a paper and hot cow 20. The title is auto scanning for a stream processing application and LinkedIn. We actually deployed the auto skinning system together with our you know our largest scale operational Xander pipeline in LinkedIn, and uh, achieved the grade performance and also operational cost savings. And we S we, we view it as an important differentiator of our technology. The second aspect I want to mention in this operate upper ability improvement is the largest state failure recovery and resilient.
Yi Pan 00:56:41 So, as I explained earlier, sends our 11 year ross’ Phoebe for logo States and Flink follows, and even spark structured streaming. When you need to have a large state, it also provides the large rocks DB as the local storage. It provides the fast access to the in the healthy state of the job, but sounds are using Kafka topic as a change log to recover the state score in failure. And this is heavily tuned toward the healthy high throughput in time. But the recovery from the Kafka topic is relatively slow. So we actually implement it two solutions to fast, and the year a state recovery already, one is a whole set for needy in yarn. So we allow the job to schedule the container back to the same host in yarn. If the host is still available, then we can leverage whatever this local snapshot that is, that exists on the yarn young host and uh, fast and the recovery.
Yi Pan 00:57:44 This actually in production achieves 50 times or 20 times faster recovery. And the other is to battle against the case that the host is totally gone. So it’s called hot standby container with implemented a way that it can stand up hot standby container in another holes, which keeps the state in sync and immediately take over the work from the failed container. If the mission is critical that is that is allowing us to operate in the range of, for patient critical jobs that you cannot really suffer any downtime. There, there are works still in progress to improve the state recovery in case of a host failures going on try to strike a balance between, okay the cost of your pay in having many standby hose standby containers versus the time you you take to recover. So that’s the third aspect of that. And just want to you know, this this set of reasons that we choose sends out as their run time platform in LinkedIn, right? Uh, there are examples of this and the users like E-bay Ruffin TripAdvisor, but Optimizely, which is a marketing analysis company, Slack, and those those companies are, think most of the use cases chooses end up because of the above reason.
Adam Conrad 00:59:09 Great and willing to all this documentation on the show notes. So you’ll be able to see the SAMHSA official website, the Twitter, the link to other resources at awesome streaming, as well as the paper you referenced earlier, auto-scaling for stream processing at LinkedIn. so we’re basically gonna wrap things up. E could you just summarize really quick, you know, when someone’s ready to make the dive into Samza, what kind of things, you know, what sort of parting advice could you give to folks that are looking to get started with streaming stream processing or SAMHSA in general?
Yi Pan 00:59:40 so I think first of all, so when you are looking when you’re looking into a stream processing platform, of course, I think you need to understand, okay. In this domain, the usually, okay, the throughput fast performance and you know consistency is a big trade-off. So, you know, just keeping in mind that in any dish with that streaming platform that we need to pay, you know, you need to look for those trade-offs between that. And if you’re specifically looking for a comparison between center between this two processing platforms, I think as I mentioned earlier than API, Y you know, the four aspects I will recommend is you know look at API, see whether it’s pure screwing or micro batch, and, you know, look at, look back to your use case, whether you can take the batch way Micheal batch way of writing your logic, or you have to do a stowaway because there are some there are some logic, it is hard to align the logic with the batch boundary, for example, the session windows and all the things, right.
Yi Pan 01:00:57 And then the other, the other is about the, the other aspect is okay for the whole pipeline that do, um D how much of a consistent we’re gonna need to, okay. Uh, Tom city or not Thompson, but most something is snapshot, consistent snapshot, or exact ones you want to achieve. If you want a strong exact ones across the whole pipeline, then you may pay the additional cost of the synchronization of the estates in different stages, which RPC is probably give you a better option here, but if you are you know, going with, okay, I do not want to sacrifice my performance because of that. I want to have a, you know, eventual, consistent way, and being able to isolate different stages of failures in the without without impacting, other than a persistent shuffling queue may help to solve your problem in, you know, in between the stages of a pipeline, the size of the state store is another big problem.
Yi Pan 01:01:59 If you’re doing a stay for, stay for stream processing, then I would recommend that if you, if the sizes are big, then look for a Roxie B solution. Yeah. Mostly is that, and lastly, the on the Cal vendor integration side than it is up to appetite of, okay. Do you want to, you know, have your own installation and have to run a stream processing cluster, or you can leverage some existing call vendor offerings. That’s just a star Wars. There are a lot more details to you know, to evaluate perfect, E just about an out of time. Thank you so much for coming on software engineering radio today. No problem. Thank you a lot, Adam.
[End of Audio]
This transcript was automatically generated. To suggest improvements in the text, please contact content@computer.org.
SE Radio theme: “Broken Reality” by Kevin MacLeod (incompetech.com — Licensed under Creative Commons: By Attribution 3.0)
Podcast: Play in new window | Download
Subscribe: Apple Podcasts | RSS
Tags: apache, apex, Flink, IEEE Computer Society, kafka, podcast, samza, SE-Radio, Spark, Storm, stream-processing, streaming, yarn