Putting Services First; A New Scorecard for Observability
Putting Services First
A New Scorecard for Observability
Ben Sigelman, LightStep Co-Founder and CEO
Observing software used to be straightforward. So much so, that we didn’t even need a word for it. Of course today, where distributed systems are commonplace, we have observability: a long mouthful of a word to describe how we understand a system’s inner workings by its outputs. In the previous post in this series (Three Pillars with Zero Answers), we presented a critique of the conventional wisdom about “the three pillars of observability.” We argued that “metrics, logs, and traces” are really just three pipes that carry three partially overlapping streams of telemetry data. None of the three is an actual value proposition or solution to any particular problem, and as such, taken on their own, they do not constitute a coherent approach to observability. We now present part two: an alternative, value-driven approach to independently measuring and grading an observability practice — a new scorecard for observability.
Microservices Broke APM It’s worth remembering why we got ourselves into this mess with microservices: We needed our per-service teams to build and release software with higher velocity and independence. In a sense, we were too successful in that effort. Service teams are now so thoroughly decoupled that they often have no idea how their own services depend upon — or affect — the others. And how could they? The previous generation of tools wasn’t built to understand or represent these deeply layered architectures.
We transitioned to microservices to reduce the coupling between service teams, but when things break — and they often do — we can’t expect to find a solution by guessing-and-checking across the entire system. Yes, a service team’s narrow scope of understanding enables faster, more independent development, but during a slowdown, it can be quite problematic.
WAN In the example above, Service E is having an issue, and Services A, B, C, and D all depend on E to function properly.
If you depend on Service E but only have tools to observe Service B and its neighbors, then you have no way to discover what went wrong. It’s virtually impossible to re-instrument on the fly: You can’t redeploy Service C or Service D to understand what they are doing. (Organizations designed to “ship their org chart” are at an even greater disadvantage; as reduced communications across teams leads to a reduced understanding of services for which a given team is not directly responsible.) Conventional APM is unable to provide an answer to the problem in the example above. Yet, if this were a monolith and not a microservices architecture, all it would take is a simple stack trace from Service E, and it would be fairly obvious how you depended on it.
Tracing Is More Than Traces Distributed traces are the only way to understand the relationships between distant actors in a multilayered microservices architecture, as only distributed traces tell stories that cross microservices boundaries. A single trace shows the activity for a
single distributed trace
transaction within the entire application — all the way from the browser or mobile device down through to the database and back. Individual distributed traces are fascinating and incredibly rich, recursive data structures. That said, they are absolutely enormous, especially when compared with individual
structured logging statements or metric increments.
The sheer size and complexity of distributed traces lead to two problems: 1 Our brains are not powerful enough to effectively process them
without help from machines, real statistics, or ML.
2 When we consider the firehose of all of the individual distributed
traces, we are unable to justify the ROI of centralizing and storing them. Hence the proliferation of sampling strategies (related: why your ELK and/or Splunk bills are so unreasonable after a move to microservices). It’s simply too costly to store all of this data for the long term without some form of — hopefully intelligent — summarization.
That’s where distributed tracing comes into play. It’s the science and the art of making distributed traces valuable. By aggregating, analyzing, correlating, and visualizing traces, we can understand these patterns: service dependencies; areas of high-latency, error rate, and throughput; and even the critical path.
A New Scorecard for Observability Before we can improve the performance and reliability for any given service — and take advantage of the insights offered by distributed tracing — we must first define performance and reliability for that service. As such, the single most important concept in a service-centric observability practice is the Service Level Indicator, or SLI. Service Level Indicator (SLI): a measurement of a service’s health that the service’s consumers or customers would care about.
Most services only have a small number of SLIs that really matter and are worth measuring. This often ends up taking the form of latency, throughput, or error rate.
For example, an SLI could be the length of time it takes a message to get in and out of a Kafka queue for a particular topic. If the queue crosses a key latency threshold, then the services that depend on it would be significantly impacted. Whereas average CPU usage across the microservice instances would not be an appropriate SLI, as it’s an implementation detail; nor would the health of any particular downstream dependency (for the same reason).
Observability: Two Fundamental Goals Service-centric observability is structured around two fundamental goals: 1 Gradually improving an SLI (i.e., “optimization”) 2 Rapidly restoring an SLI (i.e., “firefighting”)
For a mature system, improving the baseline for an SLI often involves heavy lifting: adding new caches, batching requests, splitting services, merging services, and so on. The list of possible techniques is very long. “Gradually improving an SLI” can take days, weeks, or months. It requires focus over a period of time. (If you are working on an optimization but don’t know which SLI it is supposed to improve, you can be fairly confident that you are working on the wrong thing.) By contrast, “rapidly restoring an SLI” is invariably a high-stakes, high-stress scramble where seconds count. Most of the time our goal is to figure out what changed — often far away in the system and the organization — and un-change it ASAP. If we’re unlucky, it’s not that simple. For instance, organic traffic may have taken a queue past its breaking point, leading to pushback and
the associated catastrophes up and down the microservice stack. Regardless, time is of the essence, and we are in a bad place if we suddenly realize that we need to recompile and deploy new code as part of the restoration process.
Observability: Two Fundamental Activities In pursuit of our two fundamental goals, practicing observability is comprised of two fundamental activities: 1 Detection: measuring SLIs precisely 2 Refinement: reducing the search space for plausible explanations
So, how do we model and assess these two activities? We prefer a rubric that presupposes nothing about our “observability implementation.” This reduces our risk of over-fitting to our current tech stack, especially during a larger replatforming effort to move to microservices or serverless. We also need to measure outcomes and benefits, not features — and certainly not the bits, bytes, and UI conventions of “the three pillars” per se.
Modeling and Assessing SLI Detection Great SLI detection boils down to our ability to capture arbitrarily specific signals with high fidelity, all in real-time. As such, we can assess SLI detection by its level of specificity, fidelity, and freshness.
Specificity, at its core, is a function of stack coverage and cardinality. Stack coverage is an assessment of how far up and down the stack you can make measurements: Can you measure mobile and web clients in the same way you measure microservices? (If your goal is, let’s say, lower end-user latency, then this would be nearly mandatory.) Can you look below app code and into open source dependencies and managed services to understand how failures at that level are propagating up into the application layers? Can you understand off-the-shelf OSS infrastructure like Kafka or Cassandra? In effect, stack support is your ability to observe any layer of your system and to understand the connections between them. Cardinality refers to the number of values for a particular metric tag. It’s an expression of the granularity with which you can view your data. Since there is often a literal dollar cost associated with cardinality (a single trace can have hundreds of millions of tags), it’s important to understand your cardinality needs when structuring your metrics strategy. How fine-grained should the criteria be for reviewing the performance of a host, user, geography, release version, specific customer, etc?
Fidelity represents access to accurate, high frequency statistics. Accurate statistics may seem like a given, but unfortunately, that is not often the case. Many solutions, even quite expensive ones, measure something as fundamental as p99 incorrectly. For example, it’s commonplace for p99 to be averaged across different hosts or shards of a monitoring service, rendering the data effectively worthless. (If you haven’t been storing histograms or meaningful summaries, there is no way to compute the p99 globally.) p99 latency, 1s granularity
p99 latency, 5s granularity
p99 latency, 10s granularity
p99 latency, 25s granularity
000 0 10000
But fidelity isn’t simply an assessment of a calculation’s accuracy. For detection to work well, you need to be able to detect the difference between intermittent and steady state failures, and that’s only made possible through high frequency data. The images above are from the exact same p99 data from an internal system. The only difference is the smoothing interval that we used to compute these percentiles. As the smoothing interval lengthens, your ability to detect outliers diminishes. Conversely, as the smoothing interval shortens, you’re able to detect patterns of failure you wouldn’t otherwise have seen.
For example, at 10-second granularity, it’s difficult to tell whether failure is steady state or intermittent, but at 1-second granularity, it becomes clear that this is indeed intermittent.
Freshness is an expression of how long you have to wait to access your SLIs. An SLI is only useful during an emergency if it can be accessed immediately. This is especially true for our “SLI restoration” goal (and firefighting use cases), though we should never have to wait more than a few seconds to see if a change made a difference. The less “fresh” our data, the less relevant and helpful it becomes, regardless of its accuracy.
In Our Next Installment Now that we have a framework for measuring SLIs precisely, we can better understand the severity of an issue — but that doesn’t necessarily help us understand where it may be. In our next and final post, we’ll cover SLI refinement: the process of reducing the search space for a plausible explanation to resolve an issue.
Root cause analysis can be particularly difficult to automate in a microservices architecture, simply because there are so many possible root causes. But in a refined search space, it becomes much easier to identify the root cause — and to automate systems to continuously and systematically reduce MTTR.
About LightStep LightStep’s mission is to deliver confidence at scale for those who develop, operate and rely on today’s powerful software applications. Its products leverage distributed tracing technology – initially developed by a LightStep co-founder at Google – to offer best-of-breed observability to organizations adopting microservices or serverless at scale. LightStep is backed by Redpoint, Sequoia, Altimeter Capital, Cowboy Ventures and Harrison Metal and is headquartered in San Francisco, CA. For more information, visit https://lightstep.com or follow @LightStepHQ.
Try It Now Start a free trial of LightStep Tracing today. © 2019 LightStep, Inc. LightStep is a registered trademark and the LightStep logo is a trademark of LightStep, Inc. All other product names, logos, and brands are property of their respective owners. LS-2019-04