Meet the Author

Nick Travaglini
Senior Technical Customer Success Manager
Nick is a Technical Customer Success Manager with years of experience working with software infrastructure for developers and data scientists at companies like Solano Labs, GE Digital, and Domino Data Lab. He loves a good complex, socio-technical system. So much so that the concept was the focus of his MA research. Outside of work he enjoys exercising, reading, and philosophizing.
Explore Author's Blog

Preempting Problems in a Sociotechnical System
Here at Honeycomb, we emphasize that organizations are sociotechnical systems. At a high level, that means that “wet-brained” people and the stuff they do is irreducible to “dry-brained” computations. That cashes out as the inability to ultimately remove or replace people in organizations with computers, in spite of what artificial general intelligence (AGI) ideologues would have you believe. The best that such artifacts can do is “relieve labor-intensive toil,” as my colleagues Charity and Phillip put it.

Determining a CoPE’s Efficacy—and Everything After
As discussed in the first article in this series, a Center of Production Excellence (CoPE) is a more or less formal, provisional subsystem within an organization. Its purpose is to act from within to change that organization so that it’s more capable of achieving production excellence. The series has, to date, focused mainly on how best to construct such a subsystem and what activities it should pursue. In this concluding post, however, I want to return to the point of a CoPE, discuss signs of success, and evaluate the impacts it’s having.

A CoPE’s Duty: Indexing on Prod
Building a center of production excellence (CoPE) starts with indexing on production. Here’s why. Odds are that a software engineer today is really focused on one place: pre-prod. Short for “pre-production,” this is slang for an environment where software code operates in a prototype phase of its development lifecycle. Common sense would have one believe that this is a safe space, a workbench of sorts, where problems can be found and remediated. Then, once engineers are reasonably certain everything’s working properly, they advance it to a matching environment called production, where the code behaves like it did in pre-prod and it merely needs to be managed by an operations team. That story is a comforting lie.

An Ode to Events
At this point, it’s almost passé to write a blog post comparing events to the three pillars. Nobody really wants to give up their position. Regardless, I’m going to talk about how great events are and use some analogies to try to get that across. Maybe these will help folks learn to really appreciate them and to depreciate a certain understanding of the three pillars. Or maybe not.

A CoPE’s Guide to Alert Management
Alerts are a perennial topic, and a CoPE will need to engage with them. The bounds of this problem space are formed by two types of alerts: Reactive alerts (in Honeycomb, we call these Triggers): They are alerts that fire after some event, like crossing a pre-determined boundary. Proactive alerts (Burn Alerts based on Honeycomb’s SLO feature): These give notice before crossing a threshold; in the case of SLOs, that means before failing to meet the stated objective.

The CoPE and Other Teams, Part 2: Custom Instrumentation and Telemetry Pipelines
The previous post laid out the basic idea of instrumentation and how OpenTelemetry’s auto-instrumentation can get teams started. However, you can’t rely only on auto-instrumentation. This post will discuss the limitations in more detail and how a CoPE can help teams overcome them.

The CoPE and Other Teams, Part 1: Introduction & Auto-Instrumentation
The CoPE is made to affect, meaning change, how things work. The disruption it produces is a feature, not a bug. That disruption pushes things away from a locally optimal, comfortable state that generates diminishing returns. It sets things on a course of exploration to find new terrains which may benefit it more—and for longer.

Independent, Involved, Informed, and Informative: The Characteristics of a CoPE
In part one of our CoPE series, we analogized the CoPE with safety departments. David Woods says that those safety departments must be: independent, involved, informed, informative. In this post, we’ll elaborate on what each of those characteristics means, why the CoPE should also match those qualifications, and how to achieve that status.

Establishing and Enabling a Center of Production Excellence
Software is in a crisis. This is nothing new. Complex distributed systems are perpetually in a state far from equilibrium, operating in what Richard Cook has called a “degraded mode.” It’s through a combination of technical artifacts, organizational practices and policies, and pure gumption that they manage to maintain themselves through time.

Autocatalytic Adoption: Harnessing Patterns to Promote Honeycomb in Your Organization
When an organization signs up for Honeycomb at the Enterprise account level, part of their support package is an assigned Technical Customer Success Manager. As one of these TCSMs, part of my responsibilities is helping a central observability team develop a strategy to help their colleagues learn how to make use of the product. At a minimum, this means making sure that they can log in, that relevant data is available, that they receive training on how to query, and perhaps that they collaborate with the rest of Honeycomb’s CS department to solve problems as they arise.