The New Rules of Sampling
It is not feasible to run an observability infrastructure that is the same size as your production infrastructure. Past a certain scale, the cost to collect, process, and save every log entry, every event, and every trace that your systems generate dramatically outweighs the benefits. The question is, how to scale back the flood of data without losing the crucial information your engineering team needs to troubleshoot and understand your system’s production behaviors?
Starting at the highest level, there are three ways to approach the problem:
- Measure fewer things: Focus is good, and it’s certainly possible that you’re tracking some things that aren’t directly critical to your business outcomes, such as noisy debug logs. Once the low-hanging wastage has been eliminated, further reducing the coverage of your telemetry overall is not usually desirable. Measuring less means lower fidelity overall.
- Send aggregates (metrics): Your business needs a high-level view of your systems, and you should be using dashboards as a starting point before diving into the individual events and traces that generate them. That said, pre-aggregating the data before it arrives into your debugging tool mean you simply can’t dig in any further than the granularity of the aggregated values. The cost in terms of flexibility is too high.
- Sample: Modern, dynamic approaches to sampling allow you to achieve greater visibility into the most important parts of your infrastructure while maintaining the benefits of transmitting only a portion of your volume to your instrumentation service. Collapsing multiple similar events into a single point, while retaining unusual behavior verbatim enables you to get more value out of each incremental amount of telemetry data.
