## Overview
Hostinger, a web hosting company, has developed an LLM evaluation framework to systematically assess and improve their customer support chatbots. This case study, presented by a member of Hostinger's AI team, provides insight into their approach to LLMOps, specifically focusing on the evaluation component of the LLM lifecycle. The framework is designed to support multiple teams across the organization and represents an early-stage but thoughtful approach to production LLM quality assurance.
## Problem Statement
The Hostinger AI team faced a common challenge in production LLM systems: how to reliably evaluate chatbot quality both during development and in production. They identified three specific goals for their evaluation framework:
- Evaluate changes made during development against predefined datasets before deployment
- Continuously assess the performance of chatbots already serving live customers
- Identify and prioritize emerging issues to inform the next iteration of improvements
This reflects a mature understanding of the LLMOps lifecycle, recognizing that evaluation is not a one-time activity but an ongoing process that connects development to production and back again.
## Evaluation Framework Architecture
The evaluation framework consists of several interconnected components that together provide a comprehensive view of chatbot quality.
### Golden Examples
At the core of the evaluation approach is the concept of "golden examples" – carefully curated pairs of inputs and expected ideal outputs that represent the kind of responses the chatbot should provide. These serve as benchmarks against which actual chatbot responses are compared. The speaker emphasizes that these golden examples will "organically grow" over time as the team encounters new scenarios and edge cases, suggesting an iterative approach to building evaluation datasets rather than trying to create a comprehensive set upfront.
### Automated Metrics
The framework employs multiple automated checks and metrics to evaluate chatbot outputs:
- Basic component checks for output correctness and URL validation
- Factuality metrics to assess whether responses contain accurate information
- "General direction" metrics to evaluate how closely the response aligns with the expected golden answer
These metrics leverage LLMs themselves to evaluate outputs, a technique sometimes called "LLM-as-judge." The speaker notes there is "a lot of flexibility to kind of ask an LLM to evaluate whatever way we want," indicating they're not limited to predefined metrics but can create custom evaluation criteria as needed.
### Human Review and User Feedback
Recognizing the limitations of purely automated evaluation, the framework incorporates human judgment in multiple ways:
- Domain experts can provide scores and assessments through the evaluation UI
- User feedback from the production chatbot frontend is incorporated into the same evaluation framework
This multi-signal approach to evaluation is a best practice in LLMOps, as automated metrics may miss nuances that humans catch, while human review alone doesn't scale to the volume of interactions a production chatbot handles.
## Tooling: Braintrust Integration
Hostinger adopted Braintrust, a third-party evaluation platform, to help automate and orchestrate their evaluation workflow. The speaker describes Braintrust as helping them "add more functionality" to their evaluation process. Based on the demo described, Braintrust provides:
- Experiment tracking and visualization
- Comparison between experiments to show improvements or regressions
- Summary scores and detailed drill-down capability for individual examples
- Dataset management for golden examples
- Programmatic API for integrating with CI/CD pipelines
- UI for manual review and annotation
It's worth noting that while the speaker is positive about Braintrust, this is a relatively early implementation ("first iteration"), and the long-term effectiveness of this tooling choice remains to be proven. The LLM evaluation tool landscape is evolving rapidly, and there are multiple alternatives in this space.
## Development Workflow Integration
A key aspect of the framework is its integration into the development workflow. The evaluation can be run programmatically as part of GitHub Actions, meaning that changes to the chatbot can be automatically evaluated before deployment. The speaker describes a workflow where:
- A developer has an idea for an improvement
- They run the evaluation with golden examples and defined metrics
- If the evaluation looks good, a new version is created
- The new version feeds into production
- Production conversations are then evaluated to monitor quality and identify further improvements
This creates a feedback loop where production insights drive development priorities, and development changes are validated before reaching production. This is a solid LLMOps practice that helps catch regressions and ensures continuous improvement.
## Cross-Team Adoption
The speaker mentions a goal of making this framework usable by "multiple teams" at Hostinger for "different sorts of services." This suggests an approach to LLMOps infrastructure as a shared platform, which can help ensure consistent evaluation practices across the organization and reduce duplicated effort. However, it's early days, and the speaker acknowledges the framework "will evolve" from this starting point.
## Strengths of the Approach
Several aspects of Hostinger's evaluation framework represent LLMOps best practices:
- **Multi-signal evaluation**: Combining automated metrics, human review, and user feedback provides a more complete picture of chatbot quality than any single approach.
- **Development-production feedback loop**: The framework explicitly connects production monitoring to development priorities, enabling data-driven improvement.
- **CI/CD integration**: Programmatic evaluation through GitHub Actions helps catch issues before deployment.
- **Experiment comparison**: The ability to compare evaluation results between versions helps quantify the impact of changes.
- **Growing datasets**: The plan to organically grow golden examples over time acknowledges that evaluation data is never "complete" and should evolve with the product.
## Limitations and Considerations
As an early-stage implementation presented in a demo format, there are aspects that are not fully addressed:
- **Scale and performance**: No details are provided about how the evaluation performs at scale or with large volumes of production conversations.
- **Metric validation**: There's no discussion of how they validate that their automated metrics actually correlate with human judgment or business outcomes.
- **Alert and triage systems**: While they mention identifying issues, the mechanisms for alerting on quality degradation or prioritizing issues aren't detailed.
- **A/B testing**: The framework appears focused on offline evaluation; it's unclear how they handle A/B testing of different chatbot versions in production.
- **Latency considerations**: No mention is made of how evaluation impacts chatbot response times or whether evaluation is done synchronously or asynchronously.
## Technical Implementation Details
From the code snippet described, the implementation appears relatively straightforward:
- A function retrieves golden example data (inputs and expected outputs)
- The actual chatbot engine is called to generate outputs
- Metrics are defined, some using Braintrust's built-in metrics and some custom LLM-based evaluations
- Results are logged and visualized in the Braintrust UI
The simplicity of this implementation is actually a strength – it makes the evaluation framework more accessible to multiple teams and easier to maintain.
## Conclusion
Hostinger's LLM evaluation framework represents a thoughtful early-stage approach to LLMOps evaluation. By combining automated metrics, human review, user feedback, and tight integration with their development workflow, they've laid the groundwork for systematic improvement of their customer support chatbots. The adoption of a third-party platform (Braintrust) provides structure and tooling, while the emphasis on organic growth of golden examples and cross-team adoption suggests a pragmatic, iterative approach to building evaluation capabilities. As with any early-stage initiative, the proof will be in how well this framework scales and evolves as the team's needs grow.