Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

A beginner’s guide to Distributed Tracing

Why we need Distributed Tracing

What is Distributed Tracing?​

Let’s begin to talk about what Distributed Tracing is.

You can think of Distributed Tracing as a Monitoring for Microservices

Distributed tracing is a tool of tracking application requests as they flow through your services. This gives you visibility into your system.

Tracing is agnostic to your programming language and can be used with almost every type of application or service.

For a more in-depth understanding about Distributed Tracing, I recommend you to read these two books.

Distributed Tracing in Practice

Mastering Distributed Tracing

Why do we need Distributed Tracing?

Imagine you have a Microservice architecture like this one below with a request from the client.

source

As you can you see from the Microservice architecture above, a request can go through tens or hundreds of network hops. This makes it very difficult to know the entire path the requests takes, and also very complicated to troubleshoot if you only have logs and metrics.

When, something fails, there are many issues that you must address.

  • How can we found out the root cause?
  • How can we keep an eye on all the services it went through?

Distributed Tracing helps you to see the interaction between the services during the whole request and provides you with the insights into the full lifecycle of requests in your system.

The (trace) data (in the form of spans) generates information (metadata) which can help to understand how and why latency or errors are occurring and what impact they’re having on the entire request.

What problems does it solve?

Having the chance to understand how individual services in a request contribute to the overall performance of each request and with that understand the behaviour of their system in production is extremely helpful for developers.

This can uncover issues with the system that we were never able to see.

Having the visibility into the operation of your application helps in cases such as:​

  • Root cause analysis of failures or other incident.
  • Reduces the time to detect
  • Debugging / Troubleshooting
  • Resolves performance issues in your application.
  • Able to discover problems that you wouldn’t otherwise realise you had.

The result is that developers will spend less time troubleshooting, debugging as well as improving your systems user experience.

Distributed Tracing can help us to answering the following questions:

  • Where are the performance bottlenecks?​
  • Was there an error, if so, where did it start from?
  • What service did a request pass through?​
  • What occurred in each service for a given request?​
  • Is there a latency, if so, where?​
  • Who should I page (i.e on-call person)?

Spans and Traces

Traces

A trace is a view into a request as it moves through a distributed system. ​It’s a unique id for an action, say customer purchasing an item.

A trace is made of one or more Spans.

Spans

We can create spans in our instrumentation to describe what is happening in your application. A span represents a unit of work or operation. The first span is called the Root Span represents a request from start to finish. It creates a trace id and passes that to all downstream services as a header.

Each span belongs to exactly one trace and each span underneath the parent contains a more in-depth context of what occurs during a request.

It sort of paints a picture of what happened during the time in which that operation was executed.

source

For more information about traces, see Traces | OpenTelemetry

Sampling

Sampling is about how much data is kept, and how much is discarded in a specific trace.

There are many types of sampling:

  • Head-based sampling
  • Tail-based sampling
  • Adaptive sampling

Juraci Paixão Kröhling wrote an excellent post about this , called The role of sampling in distributed tracing which I recommend you to read.

Context propagation

Distributed Tracing tracks the progression of a single request for the services that makes up your application. Context propagation is a way to identify the execution of code within a process. With this, it allows your traces to be connected to their parent/child spans and follow the flow of execution.

Without context propagation, you would find several orphaned spans in your traces.

Let’s have a look at an example trace from the OpenTelemetry page. This trace shows what happens when a request is made by user or an application.

{
"name": "Hello-Greetings",
"context": {
"trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",
"span_id": "0x5fb397be34d26b51",
},
"parent_id": "0x051581bf3cb55c13",
"start_time": "2022-04-29T18:52:58.114304Z",
"end_time": "2022-04-29T18:52:58.114435Z",
"attributes": {
"http.route": "some_route1"
},
"events": [
{
"name": "hey there!",
"timestamp": "2022-04-29T18:52:58.114561Z",
"attributes": {
"event_attributes": 1
}
},
{
"name": "bye now!",
"timestamp": "2022-04-29T22:52:58.114561Z",
"attributes": {
"event_attributes": 1
}
}
],
}
{
"name": "Hello-Salutations",
"context": {
"trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",
"span_id": "0x93564f51e1abe1c2",
},
"parent_id": "0x051581bf3cb55c13",
"start_time": "2022-04-29T18:52:58.114492Z",
"end_time": "2022-04-29T18:52:58.114631Z",
"attributes": {
"http.route": "some_route2"
},
"events": [
{
"name": "hey there!",
"timestamp": "2022-04-29T18:52:58.114561Z",
"attributes": {
"event_attributes": 1
}
}
],
}
{
"name": "Hello",
"context": {
"trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",
"span_id": "0x051581bf3cb55c13",
},
"parent_id": null,
"start_time": "2022-04-29T18:52:58.114201Z",
"end_time": "2022-04-29T18:52:58.114687Z",
"attributes": {
"http.route": "some_route3"
},
"events": [
{
"name": "Guten Tag!",
"timestamp": "2022-04-29T18:52:58.114561Z",
"attributes": {
"event_attributes": 1
}
}
],
}

The trace contains three requests:

"name": "Hello-Greetings",
"name":
"Hello-Salutations",
"name":
"Hello",

If you now look at the line where it says context, you will see that each requests has the same trace_idThis means that all of the information can be tied together and provides a trail through the requests’ various routes and timestamps.

"context": {
"trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",

Propagation is the means by which context correlates traces. The data is first injected on the client side often via HTTP headers, and then extracted , from the headers on the server side, where they are deserialised.

Some headers that can be used to send trace information are W3C Trace-Context HTTP Propagator and B3 Zipkin HTTP Propagator.

Propagation is the process of sending the contents of the context as metadata via a network request.

Important to know that the propagation settings is set to the same on all systems.

There is obviously a lot more to read about Context Propagation, I recommend you to read the following pages for a better understanding of it.

Embracing context propagation

Lightstep

Signoz

OpenTelemetry

Conclusion

In this post we talked about about Distributed Tracing at a very high level. We also looked at why we need it and what problems it solves.

Further we went over traces , spans, sampling and context propagation.

Thank you for reading my article. If you found this useful, please hit that clap button and follow me to get more articles on your feed.

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇

🚀Developers: Learn and grow by keeping up with what matters, JOIN FAUN.


A beginner’s guide to Distributed Tracing was originally published in FAUN Publication on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share the post

A beginner’s guide to Distributed Tracing

×

Subscribe to Top Digital Transformation Strategies For Business Development: How To Effectively Grow Your Business In The Digital Age

Get updates delivered right to your inbox!

Thank you for your subscription

×