## Overview
Honeycomb is an observability company that provides data analysis tools for understanding applications running in production. Their platform is essentially an evolution of traditional APM (Application Performance Monitoring) tools, allowing users to query telemetry data generated by their applications. Philip Carter, a Principal Product Manager at Honeycomb and maintainer in the OpenTelemetry project, shared their journey of building and deploying a natural language query assistant powered by LLMs.
The core challenge Honeycomb faced was user activation. While their product had excellent product-market fit with SREs and platform engineers, average software developers and product managers struggled with the query interface. Every major pillar of their product involves querying in some form, but the unfamiliar tooling caused many users to simply leave without successfully analyzing their data. This was the largest drop-off point in their product-led growth funnel.
## The Solution Architecture
When ChatGPT launched and OpenAI subsequently reduced API prices by two orders of magnitude, Honeycomb saw an opportunity to address this activation problem through natural language querying. They set an ambitious one-month deadline to ship a production feature to all users.
The system works by taking natural language input and producing a JSON specification that matches their query engine's format. Unlike SQL translation, Honeycomb queries are represented as JSON objects with specific rules and constraints—essentially a visual programming language serialized as structured data. The system also incorporates dataset definitions as context, including canonical representations of common columns like error fields. This allows the model to understand, for example, whether an error column is a Boolean flag or a string containing error messages, and select the appropriate one based on user intent.
One interesting architectural decision was the use of embeddings to reduce prompt size. Rather than passing entire schemas to the model, they use embeddings to pull out a subset of relevant fields (around the top 50), capturing metadata about this selection process for observability purposes.
## Iterative Improvement Through Production Data
A key insight from their experience was that working with real production data was far faster than hypothetical prompt engineering. Once in production, they could directly examine user inputs, LLM outputs, parsing results, and validation errors. This data-driven approach helped them escape the trap of optimizing for hypothetical scenarios.
They instrumented the system thoroughly, capturing around two dozen fields per request in spans within traces, including: user input, LLM output, OpenAI errors, parsing/validation errors, embedding metadata, and user feedback (yes/no/unsure). This allowed them to group by error types and identify patterns. For example, they discovered that many users asking "what is my error rate" would trigger a common parsing error, which they fixed through prompt engineering and saw a 6% improvement in success rate within a week.
Their service level objective (SLO) approach was particularly interesting. They initially set a 75% success rate target over seven days, essentially accepting that a quarter of inputs would fail. The actual initial rate was slightly better at 76-77%. Through iterative improvements combining prompt engineering and manual corrections (statically fixing outputs that were "basically correct"), they improved to approximately 94% success rate.
## Cost and ROI Analysis
The cost analysis was refreshingly practical. Carter projected costs based on the volume of manual queries (about a million per month) and estimated approximately $100,000 per year in OpenAI costs. This was framed as less than the cost of a single engineer and comparable to conference sponsorship costs. Importantly, this was only viable because they used GPT-3.5; GPT-4 would have cost tens of millions per year, making it economically unfeasible.
Rate limiting was noted as an indirect cost concern—OpenAI's tokens-per-minute limits per API key would eventually become a constraint if the feature became successful enough. This motivated work to reduce prompt sizes and input volumes.
The ROI was measured through two channels. First, qualitative feedback from the sales team indicated the feature was reducing the hand-holding required during the sales process. Second, and more quantifiably, they tracked activation metrics: the percentage of new teams creating "complex queries" (with group-bys and filters) jumped from 15% to 36% for query assistant users, and the percentage adding queries to boards jumped from 6% to nearly 17%. These metrics were identified as leading indicators that correlate strongly with conversion to paying customers.
## UI Design Decisions
An important product decision was deliberately not building a chatbot. Early prototypes included chat capabilities that could explain concepts, suggest hypothetical queries for missing data, and engage in multi-turn conversations. They scoped this out for several reasons.
First, sales team feedback indicated users wanted to get a Honeycomb query as fast as possible, not chat with something. Second, they observed that once the query UI was filled out by the assistant, users preferred to click and modify fields directly rather than returning to text input. Third, and critically, a chatbot is "an end user reprogrammable system" that enables rapid iteration by bad actors attempting prompt injection attacks.
By constraining the interface to single-shot query generation rather than conversational interaction, they made attacks significantly more annoying to execute while still providing the core value proposition.
## Security and Prompt Injection Mitigations
Honeycomb took prompt injection seriously given they handle sensitive customer data. They noted that their production data already showed attack patterns—unusual values containing script tags appeared in telemetry when grouped by least frequent unique values. Their multi-layered approach included:
- Parsing all LLM outputs, which naturally drops malformed content
- Existing SQL injection and security protections in their query execution layer
- Complete isolation of the LLM from their main database containing user/team info and API key validation
- Rate limiting (initially 25 uses per day, later adjusted upward), since most legitimate users never exceeded this
- Input length limits
- Parameterizing the prompt inputs
- Using a simulated conversation pattern in prompts (user input, LLM output, repeated) that appears to orient the model toward staying in that pattern
The philosophy was making attacks "too difficult to get really juicy info out of" rather than claiming complete security—acknowledging there are easier targets for bad actors.
## Challenges with Function Calling
When OpenAI introduced function calling, they tested it but found it unsuited for their use case. Their system needed to handle highly ambiguous inputs—users pasting error messages, trace IDs (16-byte hex values), or even expressions from their derived column DSL. The current prompting approach could generally produce something from these inputs, but function calling more frequently produced nothing because it couldn't conform the output to the required shape. This highlighted how prompting techniques are not universally transferable between different OpenAI features.
## Perspective on Open Source Models
At the time of the discussion, Honeycomb was not considering open source models as a replacement for OpenAI. The reasoning was pragmatic: fine-tuning and deploying open source models wasn't yet easy enough to justify the investment, especially when OpenAI regularly releases improved models. They acknowledged motivations for self-hosting (control over model updates, latency improvements) but felt the ecosystem wasn't mature enough for their needs. They expressed interest in a hypothetical marketplace of task-specific models with easy fine-tuning workflows, but this didn't exist at the time.
## Agents and Chaining
Carter explicitly stated they would never use agents or chaining for the query assistant feature. The reasoning was that accuracy compounds negatively with each step—a 90% accuracy rate becomes much worse across multiple chained calls. However, they did see potential for agents in other parts of their product where the trade-off is different: "compile time" features where latency of several minutes is acceptable in exchange for higher quality results, versus the "runtime" concern of query generation where speed matters and users can easily correct minor issues.
## Organizational Dynamics
An interesting meta-observation was about the convergence of skills needed. Carter advocated for ML engineers becoming more product-minded—doing user interviews, identifying problems worth solving, and understanding business metrics—while product managers should become more data-literate, understanding embeddings, LLM limitations, and data pipelines. The ease of calling an LLM API has shifted the complexity from model training to data quality, instrumentation, and understanding when to snapshot production data for evaluation systems.