Get full-fidelity visibility without full-volume costs with our Telemetry Pipeline update.Learn more

Honeycomb Users Are Living in the Future, Part 1: Sampling

BLOG_-Honeycomb-users-are-living-in-the-future-part-1_-Sampling

Any sufficiently advanced technology is indistinguishable from magic.

Arthur C. Clarke

When we talk to new Honeycomb users, a few things stand out as sounding downright magical. Sometimes we’ll hear, “Wow, is that a new feature?” and we’ll say that no, it’s been like that for years. Clearly we need to get the word out!

This is the first installment of a blog series I’ll be writing, covering areas of Honeycomb that elicit reactions of awe and disbelief from new users. And today, I’d like to talk about one of my favorite topics: how Honeycomb handles sampling traces and logs.

Many engineers and leaders we talk to, even staff-level and VPs, are under pressure to apply sampling for cost savings purposes. But at the same time, they have concerns about the impacts on data quality and/or are worried about missing something important. It takes a good bit of reassurance to remind everyone that it’s just math—specifically, statistics.

image

Automatically adjusting for sample rate

To the best of our knowledge, and please correct me if I’m wrong, there are only two observability vendors that can correct all data (query results, chart lines, alerts, and SLOs) for sample rates, and Honeycomb has been doing it since 2016. In a Honeycomb chart, every single datapoint will be automatically corrected for sampling so you don’t even have to think about it.

But if you want to think about it, because you want to understand the impacts of sampling on your margin of error, you can reason about it by switching into Usage Mode. In Usage Mode, Honeycomb reveals the sample rate value for each event. We express sample rates as 1/N rather than a percentage (e.g., a Sample Rate of 20 = 1/20 = 5%) and that makes the reverse mental math much easier. Every event’s COUNT(), SUM(), etc. is multiplied by its sample rate.

We find it astonishing that pretty much all other tools would leave you in the dark on this crucial data.

image-1

Tail-based sampling

Now that we have this superpower of fully granular and customizable sampling, what kinds of awesome things can we do with it?

Our favorite trick is to combine it with tail-based sampling of tracing data in order to ensure that all errors and slow traces are kept—or at least a much higher percentage of them can be kept rather than normal, boring, high-volume successful traces.

Refinery, our powerful sampling proxy, waits for all the spans of a distributed trace to arrive before evaluating rules that you can customize in order to ensure that any criteria you can think of can result in a different than normal sampling decision.

This is particularly useful in customer support and service situations where you want to ensure that all tracing data related to user-facing errors or SLO violations are kept. In fact, these conditions can get very specific. For example:

  • Keep all traces where a gRPC error occurred
  • Keep all traces where any span was longer than 5000 ms
  • Keep all traces where customer ID is 55209
  • Keep all traces where span named processRetryQueueError exists
image-2

Dynamic sampling

The value of your telemetry data can take all shapes and sizes. Perhaps it’s not just errors and slow requests that matter. What if you could ensure that your most unique combinations of API endpoints and customer types also had boosted priority when it comes to sampling decisions?

Enter EMA Dynamic Sampling, thanks again to Refinery. In this mode, Refinery constantly adjusts sample rates for various traces in order to achieve a target sample rate or event throughput rate. Now we can unleash the full potential of Honeycomb’s sampling correction! Behold the chart below, representing nearly 300 distinct sample rates spread across one billion events pre-sampling.

image-3

Enhance!

What if you decide after the fact that you really did need some data that was dropped? For example, a customer-reported issue? Fear not: our newly-launched Enhance button provides Data Lake-like functionality to automatically pull those events out of an S3 bucket.

image-4

Want to learn more?

We’re proud to have launched our Honeycomb Academy course for Refinery (and sampling). I hope you enjoy these workshops as much as we enjoyed building them! 

Also, don’t forget to check out my recent blog post on Data Strategy for SREs and Observability Teams.

New to Honeycomb? Get your free account today.

Want to know more?

Talk to our team to arrange a custom demo or for help finding the right plan.