Distributed Tracing: We Did It Wrong

Note. transl.: The author of this material is Cindy Sridharan, an engineer at imgix, dealing with API development and, in particular, testing microservices. In this material, she shares her detailed vision of actual problems in the field of distributed tracing, where, in her opinion, there is a lack of truly effective tools for solving urgent problems.

Distributed Tracing: We Did It Wrong
[Illustration adapted from other material about distributed tracing.]

It is believed that distributed tracing difficult to implement, and the return on it dubious at best. The "problem" of tracing has been attributed to many reasons, often citing the complexity of configuring each component of the system to send the appropriate headers along with each request. While this problem does exist, it is by no means insurmountable. By the way, it does not explain why developers do not like tracing very much (even if it is already functioning).

The main difficulty with distributed tracing is not the collection of data, nor the standardization of formats for dissemination and presentation of results, nor the definition of when, where and how to sample. I'm not trying to imagine trivial these "digestibility problems" - in fact, there are quite significant technical and (if we consider truly Open Source standards and protocols) political challenges that need to be overcome so that these problems can be considered solved.

However, assuming all these issues are resolved, there is a good chance that nothing will change significantly in terms of end user experience. Tracing may still not be of practical use in the most common debugging scenarios, even after it has been deployed.

Such a different tracing

Distributed tracing includes several disparate components:

  • equipping applications and middleware with controls;
  • distributed context transfer;
  • collection of traces;
  • trace storage;
  • their extraction and visualization.

A lot of the talk about distributed tracing boils down to thinking of it as a kind of unary operation whose sole purpose is to aid in complete system diagnostics. This is largely due to the way distributed tracing has been historically formed. IN blog entries, made when the Zipkin sources were opened, it was mentioned that he [Zipkin] makes Twitter faster. The first commercial proposals for tracing were also promoted as APM tools.

Note. transl.: In order for the following text to be better perceived, we define two basic terms according to documentation of the OpenTracing project:

  • Team — the basic element of distributed tracing. It is a description of a certain workflow (for example, a database query) with a name, start and end times, tags, logs, and context.
  • Spans usually contain links to other spans, which allows you to combine many spans into Trace - visualization of the life of a request in the process of its movement through a distributed system.

Trace's contain incredibly valuable data that can help with tasks such as: testing in production, conducting disaster recovery tests, testing with error injection, etc. In fact, some companies already use tracing for similar purposes. Let's begin with universal context transfer has other uses besides just moving spans to storage:

  • For example Uber uses trace results to distinguish between test traffic and production traffic.
  • Facebook uses trace data for critical path analysis and for switching traffic during regular disaster recovery tests.
  • Also social network applies Jupyter notebooks that allow developers to run arbitrary queries on trace results.
  • Adherents LDFIA (Lineage Driven Failure Injection) use distributed traces for testing with error injection.

None of the options listed above apply entirely to the scenario. debug, during which the engineer tries to solve the problem by looking at the trace.

When it's up yet reaches the debug script, the primary interface remains the diagram trace view (although some also call it "Gantt Chart" or "waterfall chart"). Under trace view я I mean all the spans and associated metadata that together make up the trace. Every open source tracing system, as well as every commercial tracing solution, offers one based on trace view user interface for visualization, detailing and filtering of traces.

The problem with all the tracing systems I've seen so far is that the final visualization (traceview) almost completely reflects the features of the trace generation process. Even when alternative visualizations are offered—heatmaps, service topologies, latency histograms—in the end, they still boil down to trace view.

In the past I complained that most "innovations" in UI/UX tracing seem to be limited to inclusion additional metadata in the trace, by embedding information with high cardinality in them (high cardinality) or by allowing you to drill down on specific spans or run queries inter- and intra-trace. In this case, trace view remains the main means of visualization. As long as this state of affairs persists, distributed tracing will (at best) take 4th place as a debugging tool, following metrics, logs and stack traces, and at worst, it will turn out to be a waste of money and time.

Problem with traceview

Purpose trace view - provide a complete picture of the movement of a single request through all components of the distributed system to which it is related. Some more advanced tracing systems allow you to drill down into individual spans and view time breakdowns inside one process (when spans have functional boundaries).

A basic premise of a microservices architecture is the idea that an organizational structure grows with the needs of a company. Microservices proponents argue that dividing various business tasks into separate services allows small, autonomous development teams to control the entire lifecycle of such services, giving them the ability to independently build, test, and deploy those services. However, the disadvantage of this distribution is the loss of information about how each service interacts with others. In such conditions, distributed tracing claims to be an indispensable tool for debug complex interactions between services.

If you really a staggeringly complex distributed system, then no person is able to keep it in his head complete picture. In fact, developing a tool on the assumption that it's possible at all is something of an anti-pattern (an inefficient and counterproductive approach). Ideally, debugging requires a tool that helps narrow the search areaso that engineers can focus on a subset of the dimensions (services/users/hosts, etc.) relevant to the problem scenario under consideration. When determining the cause of a failure, engineers are not required to understand what happened during all services at once, since such a requirement would be contrary to the very idea of ​​a microservice architecture.

However, the traceview is exactly This. Yes, some tracing systems offer compressed traceviews when the number of spans in the trace is so large that they cannot be displayed within a single visualization. However, due to the large amount of information contained in even such a stripped-down visualization, engineers still forced "sift" it by manually narrowing the sample to a set of services that are the sources of problems. Alas, in this field, machines are much faster than humans, less prone to error, and their results are more repeatable.

Another reason I think the traceview method is wrong is that it is not well suited for hypothesis-based debugging. At its core, debugging is iterative a process that begins with a hypothesis, followed by verification of various observations and facts received from the system along different vectors, conclusions / generalizations and further assessment of the truth of the hypothesis.

Possibility fast and cheap test hypotheses and improve the mental model accordingly is cornerstone debugging. Any debugging tool should be interactive and narrow the search space or, in the case of a false trail, allow the user to go back and focus on another area of ​​the system. The perfect tool will do it proactively, immediately drawing the user's attention to potential problem areas.

Alas, trace view cannot be called a tool with an interactive interface. The best you can hope for when using it is to find some source of high latency and look at all sorts of tags and logs associated with it. It does not help the engineer to identify patterns in traffic, such as the specificity of the distribution of delays, or to detect correlations between different measurements. Generalized trace analysis can help get around some of these problems. Really, there are examples successful analysis using machine learning to identify anomalous spans and identify a subset of tags that may be associated with anomalous behavior. However, I have yet to see compelling visualizations of machine learning or data analysis findings applied to spans that are significantly different from traceview or DAG (directed acyclic graph).

Spans are too low level

The fundamental problem with traceview is that spans are too low-level primitives for both latency and root cause analysis. It's like parsing individual processor instructions in an attempt to resolve an exception, knowing that there are much higher-level tools like backtrace that are much more convenient to work with.

Moreover, I will take the liberty of stating the following: ideally, we do not need full picture occurred during the life cycle of a request, which is represented by modern tracing tools. Instead, some form of higher-level abstraction is required, containing the knowledge that gone wrong (similar to backtrace), along with some context. Instead of watching the entire trace, I prefer to see it частьwhere something interesting or unusual happens. Currently, the search is carried out manually: the engineer receives a trace and independently analyzes the spans in search of something interesting. The approach of having people stare at spans in separate traces hoping to detect suspicious activity doesn't scale at all (especially when they have to make sense of all the metadata encoded in different spans, such as span ID, RPC method name, span duration). 'a, logs, tags, etc.).

traceview alternatives

Tracing results are most useful when they can be visualized in a way that gives a non-trivial view of what is happening in interconnected parts of the system. As long as this is not the case, the debugging process remains largely inert and depends on the user's ability to spot the right correlations, test the right parts of the system, or piece the puzzle together—as opposed to toolto help the user formulate these hypotheses.

I'm not a visual designer or a UX expert, but in the next section I'd like to share a few ideas of what these visualizations might look like.

Focus on specific services

In an environment where the industry is consolidating around ideas SLO (service level objectives) and SLI (service level indicators), it seems reasonable that individual teams should prioritize their services to meet these goals. It follows that service oriented visualization is best suited for such teams.

Traces, especially without sampling, are a treasure trove of information about every component of a distributed system. This information can be fed to a crafty handler that will supply users with service oriented finds. They can be detected in advance - even before the user has looked at the traces:

  1. Delay Distribution Diagrams for Outlier Requests Only (outlier requests);
  2. Diagrams of distribution of delays for cases when the SLO-goals of the service are not achieved;
  3. The most "common", "interesting" and "weird" tags in queries that are most often are repeated;
  4. Delay breakdown for cases where depending on services do not achieve their SLO goals;
  5. Delay breakdown by different downstream services.

Some of these questions are simply not answered by built-in metrics, forcing users to scrutinize spans. As a result, we have an extremely user-hostile mechanism.

In this regard, the question arises: what about complex interactions between diverse services controlled by different teams? Is trace view is not considered the most appropriate tool for highlighting such a situation?

Mobile developers, owners of stateless services, owners of managed stateful services (like databases), and platform owners may be interested in other things. submission distributed system; trace view is too generic a solution for these fundamentally different needs. Even in a very complex microservice architecture, service owners do not need deep knowledge of more than two or three upstream and downstream services. In fact, in most scenarios, it is sufficient for users to answer questions regarding limited set of services.

It's like looking at a small subset of services through a magnifying glass for scrutiny. This will allow the user to ask more immediate questions about the complex interaction between these services and their direct dependencies. This is analogous to backtrace in the world of services, where the engineer knows that not so, and also has some understanding of what is happening in the surrounding services in order to understand why.

The approach I'm promoting is the exact opposite of the top-down approach based on traceview, where the analysis starts from the entire trace and then gradually descends to individual spans. In contrast, a bottom-up approach starts with an analysis of a small area close to the potential cause of the incident, and then expands the search space if necessary (with the possible involvement of other teams to analyze a wider range of services). The second approach is better suited for quickly testing initial hypotheses. After receiving concrete results, it will be possible to move on to a more focused and detailed analysis.

Building a topology

Service-specific views can be incredibly useful if the user knows what a service or group of services is responsible for increasing latency or is a source of errors. However, in a complex system, identifying the offending service can be a non-trivial task at the time of a failure, especially if no error messages were reported from the services.

Building a service topology can go a long way in figuring out which service is experiencing an error rate spike or latency spike that causes the service to noticeably degrade. When I talk about building a topology, I don't mean services map, which displays each service available in the system and is known for its death star architecture maps. Such a representation is no better than a traceview based on a directed acyclic graph. Instead, I would like to see dynamically generated service topology, based on certain attributes such as error rate, response time, or any user-specified parameter that helps clarify the situation with specific suspicious services.

Let's look at an example. Imagine a hypothetical news site. Home page service (front page) communicates with Redis, a recommendation service, an advertising service, and a video service. The video service takes the videos from S3 and the metadata from DynamoDB. The recommendation service receives metadata from DynamoDB, loads data from Redis and MySQL, writes messages to Kafka. The advertising service receives data from MySQL and writes messages to Kafka.

Below is a schematic representation of this topology (many commercial routing programs build a topology). It can be useful if you need to understand the dependencies of services. However, during debugwhen a service (say, a video service) exhibits increased response time, such a topology is not very useful.

Distributed Tracing: We Did It Wrong
Scheme of services of a hypothetical news site

The diagram below would be better. It has a problematic service (video) depicted right in the center. The user immediately notices it. From this visualization, it becomes clear that the video service is working abnormally due to an increase in the response time of S3, which affects the loading speed of part of the main page.

Distributed Tracing: We Did It Wrong
Dynamic topology displaying only "interesting" services

Dynamically generated topological diagrams can be more efficient than static service maps, especially in elastic, autoscaling infrastructures. The ability to compare and contrast service topologies allows the user to ask more relevant questions. More precise questions about the system are more likely to lead to a better understanding of how the system works.

Comparative display

Another useful visualization would be a comparison display. Currently, traces are not well suited for side-by-side comparisons, so it is common to compare spans. And the main idea of ​​this article is precisely that spans are too low-level to extract the most valuable information from the trace results.

Comparison of two traces does not require fundamentally new visualizations. In fact, something like a histogram representing the same information as a traceview is sufficient. Surprisingly, even this simple method can bring much more fruit than simply studying two traces separately. Even more powerful would be the possibility visualize comparison of traces In total. It would be extremely helpful to see how a recently deployed database configuration change to enable GC (garbage collection) affects downstream service response time on a scale of several hours. If what I'm describing here seems like an A/B analysis of the impact of infrastructural changes in many services using trace results, then you're not too far off the mark.

Conclusion

I don't question the usefulness of the trace itself. I sincerely believe that there is no other way to collect as rich, casual and contextual data as that contained in a trace. However, I also find that all tracing solutions use this data extremely inefficiently. As long as tracing tools are fixated on the traceview, they will be limited in their ability to make the most of the valuable information that can be extracted from the data contained in traces. In addition, there is a risk of further development of a completely unfriendly and non-intuitive visual interface, which will severely limit the user's ability to troubleshoot errors in the application.

Debugging complex systems, even with the latest tools, is incredibly difficult. Tools should help the developer formulate and test a hypothesis, actively providing relevant information, identifying outliers and noticing features in the distribution of delays. For tracing to become the tool of choice for developers when troubleshooting production failures or solving problems spanning various services, original user interfaces and visualizations are needed that are more in line with the mental model of the developers who create and operate those services.

It takes a lot of mental effort to design a system that will represent the various signals available in the trace results in a way that is optimized for ease of analysis and inference. Consideration needs to be given to how to abstract the topology of the system during debugging in a way that helps the user overcome blind spots without having to look into individual traces or spans.

We need good abstraction and layering capabilities (especially in UI). Those that fit well into the process of debugging based on hypotheses, where you can iteratively ask questions and test hypotheses. They won't automatically solve all observability problems, but they will help users hone their intuition and formulate more thoughtful questions. I call for a more thoughtful and innovative approach to visualization. Here there is a real prospect to expand horizons.

PS from translator

Read also on our blog:

Source: habr.com

Add a comment