SE Radio 337: Ben Sigelman on Distributed Tracing

Ben Sigelman CEO of LightStep and co-author of the OpenTracing standard discusses distributed tracing, a form of event-driven observability useful in debugging distributed systems, understanding latency outlyers, and delivering “white box” analytics. Host Robert Blumen spoke with Sigelman about the basics of tracing, why it is harder in a distributed system, the concept of tracing context, how context is propagated, how trace data is collected, interoperability in a polyglot environment, two approaches to collection: instrumentation versus injection, the architecture of a tracing back end, what type of databases are used, querying the back end database, typical queries that a human user would run, other systems that query the back end, integration with monitoring and alerting, distributed tracing as a source of analytics for business insights, and adoption of distributed tracing in a software organization.

Show Notes

Transcript

Transcript brought to you by IEEE Software

[0:00:00]

Robert Blumen: This is a Software Engineering Radio, the podcast for professional developers. On the web at SE-Radio.net. SE-Radio is brought to you by the IEEE Computer Society, IEEE Software Magazine, online a computer.org/software.

Digital Ocean is the easiest cloud platform to deploy, manage, and scale applications of any size, removing infrastructure friction and providing predictability so developers and their teams can deploy faster and focus on building software the customers love.

With thousands of in-depth tutorials and an active community, we provide the support you need. Digital Ocean stands out of the crowd due to its simplicity and high-performance with no billing surprises. Try Digital Ocean for free by getting a $100 infrastructure credit at DO.co/SERadio.

For software engineering radio, this is Robert Blumen.
[0:01:00]
Today I have with me Ben Sigelman. Ben is the co-founder and CEO of a LightStep where he is building reliability management software. Ben hold a bachelor’s in Mathematics and Computer Science from Brown University. He is a co-author of the Open Tracing Standard.

Previously he built Dapper, Google’s production systems tracing infrastructure, and Monarch, Google’s fleet wide time series collection and analytics system. Ben, welcome to Software Engineering Radio.

Ben Sigelman: It is a pleasure to be here. Thanks a lot for having me.

Mr. Blumen: Ben, today we’re going to be talking about distributed tracing. Prior to talking about distributed tracing, I’d like to talk about tracing in a monolithic or single instance system and then we’ll move on to look at what are the complexities that are added by a distributed system. Let’s start with what is tracing?

[0:02:00]

Mr. Sigelman: It’s a great question. I’m glad you started there because I think I often forget to do that myself and then people who have an idea of what that means can get themselves tied up in knots. The industry has used the word tracing to refer to a bunch of different things that all have some commonality but really are quite different. So you have things like stack traces which have that word trace in them.

And then you have things like kernel tracers, like DTrace and its kin. And then you have distributed tracing systems like Dapper, Zipken, or Jeger and so on and so forth which are really a separate animal.

In terms of the common thread, I think really if you talk about all of monitoring and observability there are fundamentally two types of data. There is event data and there is statistical data. And that’s kind of it. Of course, those are really broad categories. The tracing data itself is definitely event data. So that’s something that we have in common across all these domains.

[0:03:00]
Typically, if you’re talking about kernel tracing, you’re looking at things that take a couple of nanoseconds and they happen very frequently. And then you do some kind of, in DTrace for example, you’d write a script that looks a lot like an OX script and we’ll take this very high frequency stream of events and then generate a bunch of useful statistics which you can interpret.

In the case of stack traces and that sort of tracing you’re tracing up the stack. Of course, everyone who’s a developer has seen a stack trace. There is nothing mysterious about that, but it’s a totally different thing than kernel tracing or distributed tracing. And then distributed tracing, I mean obviously the word distributed is in there. What you’re really doing there is you’re looking at a single logical transaction in a system.

Mr. Blumen: Let me clarify. Statistical data could be things like amount of memory used, CPU load; whereas, event data hears a lot of particulars about what happened in this case. Would that be accurate differentiation?
[0:04:00]

Mr. Sigelman: Yeah, I think event data has a time stamp that something happened, an event occurred, whereas something like a CPU level is essentially an aggregate. You’re saying over some period of time this is a summary of what happened, a statistical summary that 75% of the time your CPU is in a runnable state. So that’s an apt description. Other common statistics would be things like event rates or latency percentiles, things like that.

Mr. Blumen: We had another episode Software Engineering Radio dealing with logging. This sounds like it might be similar but different. What is the distinction?

Mr. Sigelman: Yeah, absolutely logging is very much on the side of event monitoring. That word is also really problematic in that it means different things to different people, but I think logging has come to mean event monitoring where the cost is reasonable for centralization. You assume that you can pay to centralize your logs and then search over them, generally speaking.
[0:05:00]
So I think it’s really the events that you can afford to centralize. That’s the way I think of it. It would be great if you could get every single kernel. Like, every system call would be great to get all those into your single combined instance, but you can’t because it’s too expensive. So you don’t consider that part of your logging strategy.

Mr. Blumen: I’m not sure I at this point have a great idea of what tracing is. I understand it’s event data about what happened in your system. Could you drill down into more detail about the thing we call tracing? What differentiates that from other types of events?

Mr. Sigelman: Yeah, sure, tracing I think typically in a conventional, if we go back 15, 20 years, that kind of thing. Tracing usually involves automatic instrumentation of oftentimes things that happen at the kernel boundary.

So every system call would be traced or something like that, or you trace every function call; whereas, logging usually implies — again, these are all generalizations — usually implies that a programmer added a logging statement explicitly to their code.
[0:06:00]

So, I think there is a distinction there. Of course, there are exceptions to these rules.

Mr. Blumen: I got it. Then logging would be instrumentation that programmers add to the program to collect data whereas tracing is something that at some level, this is similar to a concept in aspect-oriented programming called a crosscutting concern where you can intercept the normal operation of your system and collect event data from that.

Mr. Sigelman: Yes and this gets very blurry when you start talking about more modern architectures, but if we’re talking about the historical context I think that’s totally accurate.

Mr. Blumen: Okay. So in a modern system, what are some of the primary use cases for tracing?

Mr. Sigelman: When we define modern it means that you’re building a complex piece of software that probably involves many teams who are working in concert, developing services that work in concert. So, you can call that server list, or micro services, or what have you.
[0:07:00]

But if you’re doing that kind of distributed development and distributed architecture, that to me is what modern means. So, in that kind of modern system I think that the tracing that we’re doing is actually moving up the stack considerably because the pain point people are having is moving up the stack considerably. If you’re dealing with a distributed architecture, the kinds of questions you’re trying to answer are actually very primitive.

You’re trying to say things like, “This transaction was slow,” or, “This transaction had an error. I literally have no idea what happened. I have no idea what services it touched. I have no idea what happened in those services. I have no idea.”

You’d be thankful if you were at the point where you’re trying to understand what happened at the system call level, you’re actually just trying to figure out which services were involved. And that’s a very different question.

So distributed tracing, the granularity is actually much closer then kernel tracing, but it has this one huge advantage, which is that they can see across system boundaries. So if service A calls B, calls C, calls D, and D is having a bad day and it affects service A, you can actually figure that out very easily.

[0:08:00]
And that that’s a profound thing if you didn’t have it before, but it’s a really different problem than kernel tracing.

Mr. Blumen: It sounds like if you’re either trying to figure out what happened in this case or why something taking as long as it did that tracing it’s primarily a troubleshooting or debugging tool. Is that accurate?

Mr. Sigelman: This is a great question. I think the answer to that is changing and certainly I hope it’s changing. Historically in terms of distributed tracing, it’s been used for performance analysis and for root cause analysis. The performance analysis is often something where you have a couple of months or quarters even to make a performance improvement and you want to make sure you focus your energies where it’s actually going to have an effect on the business.

So anecdotally when I was at Google before we deployed something like Dapper, we had people spending six months on 20% performance improvements that were off the critical path for the end user. So it literally made no difference.
[0:09:00]
And user latency was not affected by these performance improvements because they were the wrong place. Distributed tracing can help you avoid that failure mode. The other use cases that you’re woken up at three in the morning.

You know that something bad is happening and you need to figure out where as quickly as possible, and if you are working in a distributed environment, a tracing system can be really helpful with that. In both cases you’re actually looking at individual traces to make these assessments.

I think that where I see the industry heading is to see this underlying data, the distributed tracing data has a fount of knowledge about how these distributed systems interact and to observe higher-level insights and service higher-level insights to developers, operators, and even management to better understand these systems.

And I can of course expand on that, but I think we’re really in early days and it’s going to be a lot more powerful than looking at individual traces in a UI.

Mr. Blumen: I do want to come back to that higher-level insights, but let’s stay on the troubleshooting and low level stuff for a bit. The idea of critical path analysis. Explain what that is.
[0:10:00]

Mr. Sigelman: Yeah thanks for asking. So the idea is that if you are operating in a modern environment, there is a lot of concurrency in the system that you’re developing. So if you call everything in serial, it would take days to get back to users if you’re operating with hundreds of services. So you do things in parallel.

Now if you were to call out to five services in parallel and you have to wait for all of them to return, the one that comes back last delays the end-user. It’s obvious.

The one that comes back last is on the critical path. That’s really the idea, it’s that you need to look whenever this concurrency or parallelism, you need to go back, understand what the laggard was, and focus or analysis there. It’s an easy thing to do automatically and programmatically if you have the structural information, which distributed traces do, to help with that analysis.

So, a good tracing system should help you service the critical path automatically so you don’t waste time analyzing things that aren’t actually holding up the end-user with a transaction at the top of the stack.
[0:11:00]

Mr. Blumen: We did another show about latency and latency outliers. On that show we covered the idea that latency outliers are far more important to the human perception of the responsiveness of a system but these are things that don’t occur by definition very often. Is tracing something that would help you identify what are the one tenth of a percentile of the worst performing requests that you served?

Mr. Sigelman: Absolutely, especially if the tracing system is designed in a way that allows it to focus its energies in those areas. And I think that, again, is something that I think about all the time. It’s absolutely a useful technology for that.

Mr. Blumen: You said one of the use cases is figure out what happened. I might say we built the system we should know what it does, but I understand the person who’s on call when the alert goes off may not know that system may cause five other systems today that it didn’t call yesterday because no one told him.

[0:12:00]
Is tracing a good reverse engineering tool for operations?

Mr. Sigelman: I would argue it’s the best reverse engineering tool. That’s why people get excited about it. There are a lot of difficulties with tracing in practice, but that’s the value prop is that in that situation where you need the information it’s just right there in front of you. That’s almost by definition of what it’s doing.

Mr. Blumen: Okay. Now we’re going to drill down into more detail about what goes on in tracing and I promise we will get to distributed tracing, but I want to go through some fundamental concepts that will enable us to have this language to have this discussion. There are concepts that are in the literature I was reading to prepare for this show and I want you to explain what these are. The first one being a transaction.
[0:13:00]

Mr. Sigelman: That’s an interesting one. I have a definition of it that I feel very confident about, but I’ve seen other people use other definitions of it that are in conflict with my own, but I think a transaction ought to be considered as a single logical unit of work in its entirety. So I say in its entirety to emphasize the fact that the transaction may move from process to process, or from machine to machine, or thread to thread and it’s still the same transaction.

There are other people that defined a transaction to be scoped to a single VM. And then when it moves to another VM, even if it’s part of the same unit of work, that’s considered its own transaction. I think that is limited and limiting as a definition. So it’s not the one I subscribe to but hopefully that answers your question at least from my perspective.

Mr. Blumen: Can give an example of a transaction, either something you worked on at Google or at another project?

Mr. Sigelman: Sure. I mean the web search is a perfectly fine example that we can all relate to. So you type your query.
[0:14:00]

The transaction actually begins in your browser. It goes through a number of front-end layers. It’s no secret that Google run something called GWS, G-W-S, which is the Google Web Server, which is where the logic starts happening. It farms that out to dozens and dozens of different services that all take a crack at your query. Google takes the results from all those things and combine them into a result set and then presents it to you.

Now if you abyss caches along the way that transaction will involve literally thousands or tens of thousands of different VM’s before it comes back to you in a quarter of a second. And that’s what a cache mist looks like for Google. Cache hit of course can be much cheaper, but these transactions are very, very elaborate.

Mr. Blumen: Does the transaction include if your request places some work on a cue to be done later, and then that work gets done later? Is that part of the transaction?

Mr. Sigelman: You’re asking all the hard questions. That is an area of debate.

[0:15:00]

I think if the queue is just a 5O queue, one in one out kind of thing, yeah totally. If your queue is doing batching, so an example I would give is a lot of high throughput storage systems like Cassandra, or H-Space, or Big Table at Google, things like that.

They’ll do some kind of log structured right where you’ll do the right, it’ll transact, it will return to the user and you continue doing your thing and then later it’ll take a sequence of rights and consolidate them into one read-optimized block. And that ends up being very expensive.

So, if you’re trying to profile the things you need to understand where all the expenses coming from. And unfortunately it comes from many transactions because there is batching. So whenever there is batching, the notion of what a transaction actually is gets to be kind of difficult.

At the end of the Dapper paper that that I helped to write, we talked about this as being a pernicious issue for us from an analytical standpoint, and I don’t think that there is a very clear way of saying one way or the other if that’s one transaction or many transactions.

[0:16:00]
It is true that if you consider it to be one transaction, your transactions get to be very large because it’s no longer a tree. It is a graph, and a full graph can suck up your entire system.

Mr. Blumen: I’m thinking if I care about latency and that’s the use case then work that’s going to be done later is not going to count toward latency. But if I was more interested in what are all those steps in serving this request, I would potentially care about work that’s going to be done later. Is that correct?

Mr. Sigelman: I wish it was. I really wish it was. It’s almost correct. It’s certainly correct in our mental model. Let me go back to the example I was just giving. So in my experience if you have a system where latency degrades without a software release — I’m not going to put a number on it but let’s say almost all the time — is because something is overloaded. So most sudden production latency regressions are due to throughput. So, you have some kind of overwhelming throughput in a system and that creates a bottleneck.
[0:17:00]

In queuing theory if there is a bottleneck and you start falling behind things go haywire, and the haywire is reflected in very high latency. So the problem there is now you need to understand where that load came from. And going back to the example we were just speaking about since we already went through it, if you have some kind of storage system and the CPU is getting really hot. It’s probably because of this kind of consolidation at batching and that’s a matter of looking at all of these other requests.

You’re right in the sense that the requests can be thought of in isolation, but the request from a couple of minutes ago that are getting batched and are causing you to saturate CPU are actually affecting the latency for requests that are happening minutes later, and understanding that kind of root cause analysis is very, very challenging, because the amount of data is so overwhelming that if you centralize all of it you’re not going to be able to afford your observability system, and if you don’t you literally lack the information to run that analysis. And that’s the problem.

[0:18:00]

Mr. Blumen: You just now mentioned the data that is collected. I’m going to come back to that also. I want to move on at this moment to another important concept in tracing, which is the idea of context and propagation. Could you address that?

Mr. Sigelman: Yeah totally. So in a tracing system, primarily a distributed tracing system but not necessarily, this can happen in a single host system as well. It’s important to understand the thread of execution. I probably shouldn’t use the word thread. It doesn’t necessarily have to be a single thread like in a JavaScript environment, and Node continuation local storage is often where you put this context and that will actually move from logical thread to logical thread.

But the idea that I’m trying to get at here is that you need to understand the sequence of events that affect a single transaction and the context object is usually the thing where you store some kind of central identifier or other state that you can use to tie that sequence of events together.

[0:19:00]
It gets tricky because the context ____ can join as we were saying earlier. To make things faster we do things concurrently so the context literally splits in half and then rejoins later. And there is a lot of theory around that, but the idea of context is incredibly important to understanding transaction traces.

Mr. Blumen: Let’s go back to this Google search example. What are some interesting fields or data that would be in the context?

Mr. Sigelman: So the approach that we took with Dapper was honestly pretty simplistic, but was effective which was to have two unique ideas. One was called the trace ID, which lived for the entirety of this one transaction, and the other was called a span ID.

And a span is one logical segment of that trace that doesn’t have any internal forking and joining of its own and is “the right size” to measure in a system like this. That basically means that you’re not going to have a span for every system call but you probably will have a span for every remote procedure call or HTP call or something like that.

[0:20:00]
So the trace ID is consistent for the entire trace. The span IDs are unique within that trace. And then you form, well in Dapper’s case a tree, more formally theoretically a graph of spans pointing to their parents. And that allows you to infer the structure of the transaction and get things like the critical path out. So the context will contain the trace and span I.D.

We went further eventually. The Census Project at Google, not to be confused with the Open Census Project which derives from it, the Census Project added the originating service. So if the request initially came from a user and Google Web Search or Gmail user or a calendar user, et cetera, et cetera, that product ID was essentially encoded and used for resource accounting, which is actually a really profound application of this underlying tracing that has nothing to do with latency analysis.
[0:21:00]
And I think it’s important for people to understand the applications of this technology are much broader than observability and it really comes back to the context rather than the latency measurements in that case.

Mr. Blumen: I’m hoping to get something more concrete like the name of each service it visits, the IP address, maybe this stack trace or call stack on that process. What kind of things?

Mr. Sigelman: That’s not in the context. So let me try again. I’ll be less or both. There’s really two types of data that you want to record. One is the data that you record in band, which is to say if you’re sending a request from service A to service B, you have to pass some context along in band along with application data. And then the other data is out of band. The in-band data is very small. All you want to do is record unique IDs. So in Dapper we recorded a trace ID that was consistent for the entire transaction and what we called a span ID which represents that one service call.
[0:22:00]
In the out-of-band channel you record much more detailed information; so all the timing information, all of the tags, the names of things, the names of the services, names of endpoints, even like a micro log of events that took place or each span.

That’s all sent out of band and buffered and that does not need to happen in real time. You just get it out of the process as efficiently as you can, but there is this very thick buffered out-of-band channel and this very thin svelte in-band context which just records unique IDs.

Mr. Blumen: This sounds something like how log aggregators work where you do not need to forward the log message to the log aggregator in-band during the work the program is doing. As long as it gets cued up and later gets collected.

Mr. Sigelman: Exactly the same.

Mr. Blumen: And you may have said this or maybe I’m inferring this, this idea is it going to enable you to correlate all the different collections that occurred across many different servers and then you’re going to have this more detailed data that will get started with that key so that you’re able to match up all these different pieces of context.
[0:23:00]

Mr. Sigelman: Exactly.

Mr. Blumen: Okay. Do we trace exceptional conditions or errors as well as the thing that it was supposed to do, or that we wanted it to do?

Mr. Sigelman: We should. Yeah I think generally speaking, it’s a goal for a systems like this to have extra detail when things aren’t going well. And an error would be an example of that. Both soft errors and hard errors and some tracing systems are better at dealing with that situation than others. And that gets into sampling which we haven’t discussed yet.

Mr. Blumen: How big in a case like the Google search example where you’re hitting a bunch of different servers. It might be ads, spellcheck, search indexes. How big is the data that’s collected for a single search?

Mr. Sigelman: Good question. It does vary but the rule of thumb is that an individual span which is a single segment of the trace is usually between a hundred and a thousand bytes.
[0:24:00]
So you can kind of do the math if you have a couple thousand spans, you’re talking about something in the order of single digit megabytes of data for one of these traces.

Mr. Blumen: We’ve been talking now about the concepts involved in tracing and some of the plumbing of how it works. You have mentioned the distributed systems case where you’re propagating this idea across processes. I do want to dive into distributed tracing now. What makes that harder than tracing in a single process?

Mr. Sigelman: Well, earlier I was saying that in conventional single system tracing, it’s usually done via some kind of automatic instrumentation of function calls or system calls or something like that. In the case of distributed tracing, we wish very much that we had something like that to lean on.
[0:25:00]

And that’s actually what the Open Tracing Project is attempting to accomplish for everyone’s benefit, but we don’t. So you end up either having to build some kind of software agent that attempts to do this through brute force and monkey patching and introspection, which I think is honestly just impossible to do without overhead, or you have to write a bunch of instrumentation.

So the number one problem with distributed tracing historically has been that instrumentation is hard. At Google we could cheat because things were so well factored that you could add instrumentation to a very small subset of the code base and get huge coverage. There are only a few companies in the world that operate at scale and have that property. So instrumentation is quite difficult for distributed tracing.

Mr. Blumen: A typical distributed system these days you’re crossing over, but a lot of stuff you don’t own like Engine X and HA Proxy, things written in different languages which could be Go, Python, Java Ruby.
[0:26:00]
And then when you get into code that you own it’s going through layers and layers of framework functions before it calls into your code. All of these things could be somehow relevant, not just the code that you wrote. How do you get instrumentation in all these layers that you didn’t build?

Mr. Sigelman: This is absolutely the right question to ask. The purpose of projects like Open Tracing is to address that problem. So open tracing actually does have an implementation that allows for introspection into Engine X and HA Proxy and many of the other technologies you mentioned. And it’s supported in ten languages now. I said earlier that things are getting blurrier in the modern world.

What I meant by that is the separation between what I would consider to be business logic or application code and the libraries that serve the purpose of a kernel boundary but are not by any means a kernel is getting quite blurry.
[0:27:00]
There are a lot of things that we depend on, things like RPC libraries like Share RPC or side-car systems like Envoy and Istio, things like this feel to me at an emotional level like the kernel used to feel and yet they’re absolutely not the kernel. And those things need to be instrumented out of the box or you’re going to drive yourself crazy. And so we need to have a standard way of describing transactions.

That is literally the purpose of open tracing. That is why we did it. It’s so that you don’t need to have every developer hand-instrument all this stuff, much of which is incredibly unfriendly to instrumentation to begin with. And so we try to solve that problem once in a way that’s vendor neutral and implementation neutral. And that’s the purpose of that project is to address that, and I think it’s been pretty successful at that.

Mr. Blumen: Okay. I want to talk about open tracing, but I have one more question that is in this propagation area.
[0:28:00]
Very common now applications interface with an API to something else, let’s say for example I work on a product that integrates with Facebook. It’s calling Facebook API’s, waiting, gets a response back. Would you ideally like to propagate your context all the way to Facebook and back or is that a black box that becomes just one thing in the final result?

Mr. Sigelman: Well that’s what you’d like and what’s realistic. I think that would be wonderful to propagate that context to Facebook and back and have insight into their systems. That would mean that there’s a technical problem that we have to solve, and also that Facebook is willing to share that information with outside world, which they’re definitely not.

So, I think that there is almost a business concern that’s going to prevent that kind of reality from taking hold. There is a nice standards effort that’s led by a number of different vendors that’s presently at W3C Project to standardize on the propagation formats and all of their excruciating detail.
[0:29:00]
And one of the end goals for that is to develop ways for vendors like Facebook in this case to return helpful information beyond how long this thing took to help you understand how you could make it faster or if you’re doing something wrong.

So something like an explainquery and sequel that’s just always on for any SAS that you depend on I think would be a really useful thing for Facebook and for its developer users. And I would love to see that take place and seems realistic.

It is something that we think about at LightStep in that we have many of our customers are actually customers of each other. And since we’re collecting all the tracing information about both. We actually have the data. We have customer A is calling customer B’s customer giant SAS, and we have both sides of the transaction. Of course, we have contracts that forbid us from connecting the dots for them but it’s ironic in that we actually have the whole thing. I’m sure it be valuable to the caller to see that information.

[0:30:00]

Mr. Blumen: You mentioned open tracing a number of times as solving some of these issues that are otherwise hard to solve. What is open tracing and how does it solve the problems of implementing tracing?

Mr. Sigelman: Yeah, open tracing isn’t intentionally narrow project. I think that for people who haven’t had much exposure to distributed tracing, there is a tendency to focus on the UI aspects of it. But when you get into the implementation, especially the instrumentation side of it, the diversity of systems you need to integrate with to come up with a coherent truly global trace is just overwhelming. And open tracing is specifically focused on that one aspect of the problem.

There are three problems in tracing. One is gathering data. One is centralizing that data, which probably involves a lot of trickery like sampling and things like that. And then one is actually analyzing that data and presenting it to human beings. So, if you ask people to draw a picture of a computer they draw a monitor, right? It’s the same thing.
[0:31:00]
We tend to think of the UI or the analytical features as tracing. But the first third of that problem, the instrumentation and data gathering piece is just enormously complicated in practice and open tracing really only concerns itself with that.

So it takes the form of a series of API’s that are standard across languages and are coherent and consistent across languages, both with each other and with the languages themselves idiomatically. And then it also takes the form of a diaspora of instrumentation at this point covering many hundreds of projects that are supporting open tracing at this point. And the idea is that if you depend on one of those projects, you don’t need to instrument that thing. It’s already done.

So again, you mentioned things like Engine X and HA Proxy earlier. Those are examples of things that can support open tracing, but also software layers, anything in the Java ecosystem like JDPC, or Drop Wizard, or RX Java. The list goes on onto Python, and Jango, and Flask, and Node has expressed support and so on and so forth.
[0:32:00]
All of these major frameworks that people depend on have open tracing support either natively or via plugins. And that means if you depend on those, you don’t need to instrument them yourself. So as an application developer it takes away this massive frontloaded pain point of instrumentation and that’s the value prop for open tracing.

Mr. Blumen: I want to go more into that collection. I’ll take Java as an example because I’m familiar with it. There’s ability to attach something called an agent to the JVM, which can hook itself into what the JVM is doing and it could collect information like this. I understand other languages have something similar to that. Is that how you would implement the tracing collection by hacking into the language run time?

Mr. Sigelman: So conventional APM vendors, App Dynamics and New ____ are great examples of this as a ____ trace, use things like Java agents almost to the exclusion of other techniques to gather this information.
[0:33:00]
And there are a lot of advantages to that. The most obvious one is that as a user of these systems you don’t need to make any code modifications to take advantage of these agents, and that’s actually really wonderful. The troubles are overhead, which can become an issue at high throughput and I think is unacceptable for production applications and the correctness, and robustness, and durability of the instrumentation itself.

So if you’re trying to instrument something trivial that probably works, but if you’re trying to instrument something where there’s a lot of internal concurrency and so on and so forth that the agent instrumentation that you’re writing has to actually match the particular minor version of the thing that you’re depending on and that’s fragile. It’s just fragile.

There’s nothing more to it. That’s why there aren’t open sourced implementations of things like this that are very reliable because it requires a huge staff of people to maintain those instrumentations. And I think that that’s unsustainable. It’s also quite wasteful from just an industrial standpoint.
[0:34:00]
They’re spending most of their engineering effort maintaining these couplings between specific versions of libraries and their particular data formats, and a model like open tracing allows for a cleaner white box instrumentation of those systems. It does not preclude the idea that you could also have an agent.

I think Data Dog has an APM that uses open tracing support and wraps it with a Java agent to prevent you from having to do any of the gluing of white box instrumentation to your application’s main line. These are not mutually exclusive techniques, but I think that the hand instrumentation of every library under the sun is really unsustainable and that’s the part that I would argue against.

Mr. Blumen: How does open tracing do it then without these agents?

Mr. Sigelman: It’s just literally code. It’s the same thing that you see with metrics if you look into software systems that are designed for distribution to a micro services environment. Metrics are now part of the code base. And I think instrumentation like this should also be part of the code ____ like unit tests or something like that.
[0:35:00]
I think it’s as important if not more important than a lot of those other things that we’ve come to accept will be part of that maintained code base.

Mr. Blumen: Does that mean then if the Engine X maintainers want their product to be traceable they would add some trace statements. Engine X and I might run a slightly different version of it or set some flags in my conf that tell it to enable open tracing?

Mr. Sigelman: Yeah, I think that there is a couple of different approaches that people have taken. One is to design a general purpose, usually callback driven way, to observe a system like Engine X. And the other is to bake some kind of instrumentation, open tracing or otherwise, directly into the code base.

And then for a configuration of getting the data out, I believe that in Engine X’s case that could be done via environment variables or via configuration statements where you couple the instrumentation to some kind of downstream sync that you want to set it to, which is usually an open trace and compliant tracing system or some kind of commercial of APM vendor.
[0:36:00]

Mr. Blumen: I want to move on now and talk about the data and then we’ll talk about the backend and getting data out of it. The data that is collected, we had some discussion of there is the larger out-of-band data set that can get queued up and sent to the backend. How does the data get from the collection point to the backend? Is there some kind of side car type process that runs on the instance and handles that communication?

Mr. Sigelman: That varies quite widely actually. There are different ways to do it. One of my personal axes to grind is that there are people trying to say that’s the right way of doing that. I think it entirely depends on what you’re trying to accomplish. There are people that it’s important that for instance they want to make sure that if the process seg faults or something like that or has some kind of fatal exception that you don’t miss that transaction that had the fatal exception.
[0:37:00]
So naturally in that case, you can’t delay sending data out. You need to do it synchronously which naturally has a lot of overhead. That’s not wrong. It’s just it’s a requirement that you have, whereas there are other situations where you don’t care about that at all and you’re just concerned with making sure that you have the lowest overhead possible and that you’re always on production. That demands a different type of application of data collection.

So I think there are many different ways to do it. If you imagine architecture there’s probably someone who’s doing that right now. And I strongly urge the industry not to try and pick a single thing, I think it depends on the requirements, but it varies from synchronous offloading, to on dispatching, to sidecar processes, to remote collection. It’s all over the map.
[0:38:00]

Mr. Blumen: I could if I had a multi-threaded language have a thread that was the tracing collector thread that was doing this or I could dump it out to a log file, have something like log aggregator that was tailing that and picking it up and sending it somewhere. Those are all options.

Mr. Sigelman: Yeah, and there are systems like source graph, open-sourced tracing system called AppDash that actually builds the entire tracing system into the process. So it’s a go lang based system and you can actually go to a port and use a fully-fledged tracing system that’s linked into the process itself and it does everything as part of the process, which of course has its disadvantages but it also really easy to deploy. So, again, the variation is pretty extreme.

Mr. Blumen: There was a lot of heterogeneity. You’re talking about how you might do this at each step. Is open tracing taimed at creating interoperability of what part of the whole tracing infrastructure?
[0:39:00]

Mr. Sigelman: Open tracing, the goal there is to be intentionally narrow and to describe the transaction and specify almost nothing about how that description is used. It’s possible to use that description to run a Dapper-style tracing system like Zipken and Jeger. It’s also possible to build completely different applications on top of that. Solo IO, which is a startup that has done some really great open source work, recently announced distributed de-bugger uses open tracing as the underlying firmament.

So you can create breakpoints and print statements in a distributed application and debug it in production using open tracing instrumentation as the lingua franca for describing stop points and things like that. There is another startup I’m aware of that’s going to be building a technology that allows you to write integration tests against a distributed system using open tracing instrumentation. Again, as the lingua franca.
[0:40:00]
So, you can say something like, using my example from earlier, service A calls B, calls C, calls D. You can write a test for a service A that says if I call service D, I want to assert this in variant. That gets passed through the context mechanism in open tracing and the assertion is validated in service D and your test will fail if it’s not valid.

That’s a really powerful thing to be able to do that and at the moment most people write integration tests that do some horrible hack to get the data out of the process somehow, centralize it, and then do string comparison.

It’s very difficult to scale that type of operational thing so a lot of people just don’t write their integration tests anymore. And things like open tracing can help you write that sort of application. And again, I’m trying to emphasize that these have nothing to do with latency analysis.

The common thread is that you need to understand how the transaction propagates. That is the only thing that open tracing does. It’s a huge problem. It’s more than enough for one project and I think it’s really important that the scope is super, super narrow for that reason.

Mr. Blumen: Continuing more into the back end. The data we’ve been talking about gets collected at each node and gets forwarded to some kind of backend. What does that backend look like?
[0:41:00]

Mr. Sigelman: Generally, it varies. If you’re talking about someone who is running “conventional distributed tracing system”, I think that the backend usually has a couple of pieces. There is a piece that’s trying to absorb the segments of these traces and assemble them into their actual distributed transactions, and then you store that somewhere.

It’s typically in some kind of key value store. It doesn’t make a whole lot of difference. And then there’s some other piece of it that at a high-level is indexing those, and that can mean a lot of different things. I think in the case of ____ it’s literally a parametric search where you say I want this service between these times and above this latency, and then you hit search and it comes back with a list.

In other cases it’s something a lot more elaborate and you try to derive time series data or histograms from this stuff, it really varies quite widely, but the sky’s the limit analytically I think.
[0:42:00]

Mr. Blumen: The backend then to restate, it’s some kind of database. It is capturing the data and indexing it to support queries that implement the different use cases that we’ve been discussing.

Mr. Sigelman: Yes, and I would caution the world talking about tracing as a product. Tracing is a technology. If it’s not properly integrated with other technologies you will not have the use cases you need. So I think there is a mistake right now where I think we’re often talking about tracing as if it’s a product. It’s a technology, period.

And if it’s not properly integrated with metrics and other things like that you will be unsatisfied from a workflow standpoint, and I honestly think many people are unsatisfied with tracing because they deploy something that’s only capable of showing a trace and that’s not a workflow.

So it’s important that we as consumers of this technology, that we’re thinking about the workflow we’re trying to satisfy and choosing our tools and technologies appropriately to satisfy those workflows and not to assume, “Well if I deploy tracing then I’ll be able to do root cause analysis.” It doesn’t work like that.
[0:43:00]

Mr. Blumen: In cases you’ve seen, what are some of the more popular databases that people are using the backend? Would this be Elastic Search or Cassandra or what are some examples?

Mr. Sigelman: I think for open source tracing systems, typically that’s exactly right. I think by default that Zipken and Jeger both stored the trace data in Cassandra, and I think I could be wrong, but I think both support Elastic Search as an indexing system to service individual traces upon making some kind of parametric query.

That is in my mind an implementation detail of those products if we want to use that terminology. And I don’t think it should be something that people are necessarily particularly concerned with as consumers of that except obviously if you want to save some operational effort and not spin up an additional subservice.
[0:44:00]

Mr. Blumen: Based on our discussion so far I’m trying to think of what are some queries I might like to run against the backend. I could imagine if I have a given request I could say, “Given this request ID let me see all the services that it touched or let me see all the requests in this period of time when the system was badly behaving that hit a particular service. Let me see the worst latency performing requests and where did they spend most of their time.” Am I going in the right direction with these ideas?

Mr. Sigelman: I think you are, although I don’t mean this to sound like a product pitch for LightStep but there are a lot of things that you can do with LightStep that you can’t do with conventional tracing system and they’re designed to answer questions that I think are of higher-level interest. So an example might be something like this: My company has a million accounts.
[0:45:00]
A hundred of them generate 75% of my revenue. I want to know within seconds of a single one of those 100 accounts moving outside of SOA and I want to see the history of performance for that account for the operations they care the most about and I want to see examples of transactions that violate the SOA from the last ten seconds and I want you to do that all the time.

And then you would tell me when I’m violating my SOA. And a tracing system can do that too. And I think that’s a much higher-level business need and I think is a better application of that technology then forcing a developer to cough up a request ID or something like that. I mean they may have it in which case that’s great but it’s probably not something to be counted on.

You haven’t asked me about sampling, but sampling is also really important. Most tracing systems have to do some kind of sampling. The amount of data you start with is so fast that you can’t store and centralize all of it. And I don’t think there is an exception to that rule. The way that sampling is performed is incredibly important here if you have a specific request ID that you’re interested in. In Dapper we sampled one out of 10,000 requests.
[0:46:00]
It was done randomly. So almost for sure that one request ID is nowhere to be found. So if sampling isn’t done with some kind of intelligence and with the benefit of retroactive analysis, it’s unlikely that you’re going to have that specific transactions you’re looking for. So sampling also comes into play here in a major way.

Mr. Blumen: Yeah, I could see that would be a way to control the vast amount of data you would have to collect if you tried to collect everything. Going back now to these use cases. The ones I came up with were all individual user, “I have a problem. I’m trying to get information about it.”

I think what you’re talking about is you can have machine being the user of this data and it could be extracting metrics and those could feed into a monitoring and alerting system where you wouldn’t be able to get this information out of conventional monitoring, which generally deals with a single instance or single request.
[0:47:00]
So it’s a way of getting more useful or actionable data out of your system by processing it through tracing. The examples that I thought of or an individual user, “I have a problem. I’m trying to get some information out of it.” Sounds like where you’re seeing some great use cases is that you have machine users, other systems which pull information out of tracing that then goes into monitoring or alerting. Am I going in the right direction with that?

Mr. Sigelman: Absolutely. And again, that’s exactly what I’m saying about workflows. I think the workflows are that you want to know early about leading indicators of unhealthy behavior in your system. And that’s typically the domain of monitoring. There is literally no reason that we shouldn’t be building monitoring signals off of this raw data from tracing systems. That is absolutely the right way to do it. And this is what I’m getting at with tracing is a technology. It’s not a product.
[0:48:00]
The tracing technology must be integrated into those workflows. So, if you get woken up at three in the morning, imagine the difference between a page that just says, “Latency is high. Good luck.” Open your laptop and a page that says, “Latency is high.

Here are five or 15 or 50 examples of transactions from the last minute that exemplify this violation of your SOA and that will show you how you bottomed out on a specific that’s overloaded. It’s just night and day. Absolutely night and day and I mean from a customer standpoint, I know for sure that this improves root cause analysis times by 90 something percent. It’s a significant change in behavior.

Mr. Blumen: You mentioned earlier, one of the potential families of use cases is higher level insights and analytics. Give an example of how you could derive some useful insights from analytics based on this data that you’ve collected. You mentioned sampling a few times, I feel I should ask you to give an overview of the relationship of sampling to tracing.
[0:49:00]

Mr. Sigelman: Yeah, so the trouble with tracing is that the amount of data we’re collecting is absolutely overwhelming at a theoretical level. So somewhere there needs to be some sampling before you get to a centralized repository that’s indexed and analyzed and kept forever. The traditional approach was to just do random sampling. So at Dapper we sampled one out of 10,000 requests.

Some APM vendors have done sampling where they’ll wait for a bad request and then sample the next one in hopes that it has the same issue. In LightStep we’ve decided to collect all the data for some period of time and then do retroactive sampling based on features that we discover. There are different approaches but sampling ends up allowing or disallowing you to do different analytical things down the line and is one of the most important design decisions in a tracing system.
[0:50:00]

Mr. Blumen: I want to come back to earlier you said one of the use cases is gaining higher-level insights from analytics. Can you give some examples of that?

Mr. Sigelman: Yeah so teeing off of the sampling discussion, I think a lot of it depends on what sort of sampling strategy you took. If you went to some kind of random or pseudo random sampling, it’s often difficult to do much more than look at individual transactions. So hopefully what you can do is find ones that are interesting and bubble up things like critical path analysis which we referred to earlier and things like that.

You can also infer a system diagram pretty easily from tracing data in aggregate. It may be missing a transaction here or there, but the general picture will be accurate. The thing that’s interesting if you have an approach that allows you to see all the transactions, going back to this I think very common case of having a high latency event really boiling down to an overloaded system.

Tracing can do some really profound things in that case. So let’s imagine that you depend on a sequel database and that sequel database is getting slow. It’s almost certainly because something is hammering the sequel database. If something suddenly changes in sequel it’s almost always because something’s hammering sequel.
[0:51:00]
So imagine a system that can automatically detect that the sequel database is slow, which is easy to observe because everything that depends on it is now slow as well, but it can also look at changes in the ingress. So you can use the distributed tracing information to find the pattern and what changed in the data going into sequel.

What changed in the query pattern, not just locally but where did that come from? And you can even pinpoint to a specific software release way up the stack where something started a cascade of events that resulted in overwhelming your sequel database. That can be truly automated if you have both downward and upwards information about tracing and can be flexible about sampling, to go back and do that analysis with the benefit of full fidelity.
[0:52:00]

Mr. Blumen: I’ve sort of worked on problems like that and one of the challenges is you have to be looking at the system when the problem is happening. What I think you’re saying is if you capture this data then you can work backwards from the problem upstream to the cause ex-post using the data that you collected.

Mr. Sigelman: Exactly. LightStep’s approach is to keep a circular buffer that is usually at least five minutes long. It can be much longer if customers want to provision for it, but you basically have several minutes to discover that something bad is happening. By “you” I mean the system not a human being.

So the system will observe that latency is not healthy and then will say, “Oh this is bad. I’m going to do a very deep analysis to understand what happened.” And then just automatically just collect a battalion of these traces and do aggregate analysis on them in reaction to those events. So it’s a combination of the two fundamental activities in monitoring; one is measuring symptoms and the other is explaining them.
[0:53:00]
If you can measure symptoms continuously and automatically detect an anomaly, which is not rocket science from a stat standpoint, then you can tie that with automatic root cause analysis. And tracing data is kind of a skeleton key for doing that and a distributed system.

Mr. Blumen: I want to move on. This will be our last main topic, adoption. Suppose we built a system we did not build tracing in from the ground up and now we want to adopt it. What are the steps in adoption?

Mr. Sigelman: So in an environment where you have developers that are actively contributing to these code bases, it’s not a lot of work to add tracing. I think it’s important that it’s done intelligently in that it’s not done as a company-wide crosscutting effort where everyone instruments their systems in the same day.

You need to start at the core of your business where there’s actually a lot of value to be had from performance information. And again, it can be managed. It’s not that bad. In a system where you have a bunch of legacy code where there is not a developer staffed to maintain the code base, that gets harder, and I think things like software agents actually have an important part to play in that story.
[0:54:00]
That’s a longer road for sure. And then in systems where it’s not just legacy code it’s just literally not yours, like you depend on a vendor or something like that to provide just a black box. That’s the kind of worst-case scenario from an instrumentation standpoint, and then we rely on standards efforts like the ones we referred to earlier to get those vendors to provide some level of transparency into what’s happening within their particular black box.

So it’s kind of a spectrum. The thing that’s maybe a gray area are open source components that are under active development but not by you. And that’s an area where I think standards projects like Open Tracing can have a huge amount of value in that they can factor that problem out for everyone’s benefit, both the open source developers, the app developers at some large organization, and then of course vendors and tracing systems that want to consume that data all benefit from some kind of standard at that layer.
[0:55:00]

Mr. Blumen: If we’re adopting tracing in an existing system, is it more on the end of the spectrum where you get proportional benefit as you roll it out or more like you don’t really get any benefit until you’ve really done it nearly everywhere?

Mr. Sigelman: Thankfully it is not the latter, but I think the anti-pattern I see and as people naturally are conservative. And so if you want to integrate a new technology, you’ll start at the periphery of your system. And that’s a bad idea. If you start with something peripheral nobody cares about what you find.

So if you’re instrumenting something that is of no business value even if you discover something interesting from a latency standpoint, it’s not interesting from a business standpoint. So the most important thing is to be conservative about it, start small, but start at the core of your system. Start somewhere where it really matters. When LightStep was just getting kicked off as a company, our customers are naturally skeptical because we had nothing to lean on except for just our groveling and promises.
[0:56:00]
So they would integrate this into the periphery of their systems and it was difficult to deliver value there, but then when they moved us into the core, it was really only 5% of their system or something like that. It wasn’t a big instrumentation effort but it delivered a huge amount of value that was differentiated against what they had previously.

And so I think most companies that are adopting tracing, whether it’s from a vendor or using Jeger or whatever, the strategy that works best is to think about it really from a business standpoint; which transactions are most valuable to my business? And those are the areas that you should focus on.

So you should think about how to answer questions with these traces that will actually matter to somebody from KPI standpoint and then the effort goes much more smoothly, even if there is a gap in the trace and you have an area where there is something on a critical path that’s clearly missing information, I guarantee you getting someone to add that data is going to be a lot easier if you can say, “Hey, this is really important to our top customers that we understand what happened within your service,” and that’s going to be much more motivating than just asking nicely without that evidence.
[0:57:00]

Mr. Blumen: Within a software development organization are the advocates of tracing adoption, is that usually developers, dev ops, ops? Who is really pushing for this?

Mr. Sigelman: Dev ops is one of those words that means a lot of different things to different people, but if we’re talking about people who both write code and carry pagers for instance, just morally speaking I know that it’s probably just their phone or whatever, but that audience definitely cares about this kind of stuff. If you’re getting woken up by a system and you know how to make changes to the code base you definitely care about this kind of thing.

We also see a lot of enthusiasm for this type of technology from engineering management actually, which stands to reason in that Conway’s law dictates that your org chart is going to resemble your system architecture and with micro services that’s just even more obvious and it’s an important managerial tool as well because going back to the service A, B, C, D thing, if service D is having a really bad day, you don’t want to pin that on service A, B, and C even though they’re also having a bad day because they depend on service D.
[0:58:00]
So it’s important from a managerial standpoint to be able to understand within your organization where the latency problems are coming from. And tracing is actually a pretty valuable tool for that since the systems are getting so large that the developers don’t often understand these dependencies themselves.

Mr. Blumen: Then moving into the wrap up here, is there anything else you’d like to say about this topic that we haven’t covered?

Mr. Sigelman: I think you’ve actually asked really excellent questions and I appreciate especially that you started with the high level of what is tracing in and what is distributed tracing and to differentiate from it. The one thing that I would maybe want to emphasize and that I think is not well understood is that in the movement towards “observability”, which is the modern way of talking about how you make sense of these systems we’re creating. There is a lot of talk about logging metrics and tracing as being the three pillars, and I think that’s baloney.
[0:59:00]
I really do. I like a lot of people who say it, but I think it’s total baloney. I think we need to talk about use cases and workflows. That’s it. There’s only two of them really. You measure symptoms and you explain them, but we need to think about the technologies we use in the frame of those workflows. It’s the most important point.

If I could make one point, it’s that point and that if you deployed tracing without understanding in detail how those workflows are going to play out you’ll probably be disappointed, and I think it’s the single most important thing to think about when you’re talking about adopting this type of technology.

And then the other thing I would maybe add as a secondary note is that instrumentation, it seems daunting but it’s not that bad. I think if you rely on standards, open tracing has made things a lot better. If you rely on standards and you’re focused in the way you adopt this stuff and don’t try to do a crosscutting change across your entire company, it’s actually quite doable and delivers a lot of value really quickly.

Mr. Blumen: Ben, if people would like to see any content you produce or reach out to you where is the best place to look?
[1:00:00]

Mr. Sigelman: Well you can always reach out to me. I like email from strangers and my email address can probably be distributed with the show as far as I’m concerned, but it’s [email protected] or [email protected] which I got as a favor from a friend when I was still working at Google, but you’re welcome to send me an email.

In terms of talks and so on, I did a number of talks that Coop-Con over the last couple of years that I think have covered the high level of tracing and I’ve talked about some of the nuances that we’ve discussed in the show but with video and demos and so on and so forth. So those are probably things I’d recommend.

Mr. Blumen: You mentioned LightStep, your company, and mentioned OpenTracing. Where could people get more involved with OpenTracing?

Mr. Sigelman: Open tracing is a really vibrant open source project that has contributors both on OpenTracing Cory API’s as well as on the vendor and open source trace system side as well as in the diaspora of software systems that support tracing. And contributions are welcome in any part of that. They should go to OpenTracing.IO and just send someone a note. We use Getter really heavily.
[1:01:00]
For LightStep you’re welcome to reach out to me if you want, but if you go to our website it’s easy to experience the product through that channel as well.

Mr. Blumen: Ben Sigelman, thank you very much for speaking to Software Engineering Radio.

Mr. Sigelman: Thank you so much for having me.

Mr. Blumen: This has been Robert Blumen for Software Engineering Radio and thank you for listening.

Mr. Sigelman: Thanks for listening to SE Radio, an educational program brought to you by IEEE Software Magazine. For more about the podcast including other episodes, visit our website at SE-Radio.net. To provide feedback you can comment on each episode on the website or reach us on LinkedIn, Facebook, Twitter or through our Slack channel at SERadio.slack.com. You can also email us at [email protected]. This and all other episodes of SE Radio is licensed under creative comments license 2.5. Thanks for listening.
[1:02:00]

[End of Audio]

Join the discussion

You must be logged in to post a comment.

1 comment

Praful says:

September 11, 2018 at 7:43 pm

I haven’t heard this one but wanted to leave some feedback for Robert. I absolutely love him as a host. I look forward to episodes where he is hosting. I love his style where he starts with the basics and goes into complex topics *gradually* while clarifying/rephrasing complex concepts along the way. Also his slow speaking style helps me to digest things especially given I am a non native English speaker. Love you Robert! 😊

SE Radio 337: Ben Sigelman on Distributed Tracing

Show Notes

Related Links

Transcript

Join the discussion

1 comment

More from this show

SE Radio 720: Martin Dilger on Understanding Eventsourcing

SE Radio 719: Birol Yildiz on Building an Agentic AI SRE

SE Radio 718: Will Sentance on JS Modernization

Menu

Recent posts