Hostinger's AI team developed a systematic approach to LLM evaluation for their chatbots, implementing a framework that combines offline development testing against golden examples with continuous production monitoring. The solution integrates BrainTrust as a third-party tool to automate evaluation workflows, incorporating both automated metrics and human feedback. This framework enables teams to measure improvements, track performance, and identify areas for enhancement through a combination of programmatic testing and user feedback analysis.
Hostinger, a web hosting company, has developed an LLM evaluation framework to systematically assess and improve their customer support chatbots. This case study, presented by a member of Hostinger’s AI team, provides insight into their approach to LLMOps, specifically focusing on the evaluation component of the LLM lifecycle. The framework is designed to support multiple teams across the organization and represents an early-stage but thoughtful approach to production LLM quality assurance.
The Hostinger AI team faced a common challenge in production LLM systems: how to reliably evaluate chatbot quality both during development and in production. They identified three specific goals for their evaluation framework:
This reflects a mature understanding of the LLMOps lifecycle, recognizing that evaluation is not a one-time activity but an ongoing process that connects development to production and back again.
The evaluation framework consists of several interconnected components that together provide a comprehensive view of chatbot quality.
At the core of the evaluation approach is the concept of “golden examples” – carefully curated pairs of inputs and expected ideal outputs that represent the kind of responses the chatbot should provide. These serve as benchmarks against which actual chatbot responses are compared. The speaker emphasizes that these golden examples will “organically grow” over time as the team encounters new scenarios and edge cases, suggesting an iterative approach to building evaluation datasets rather than trying to create a comprehensive set upfront.
The framework employs multiple automated checks and metrics to evaluate chatbot outputs:
These metrics leverage LLMs themselves to evaluate outputs, a technique sometimes called “LLM-as-judge.” The speaker notes there is “a lot of flexibility to kind of ask an LLM to evaluate whatever way we want,” indicating they’re not limited to predefined metrics but can create custom evaluation criteria as needed.
Recognizing the limitations of purely automated evaluation, the framework incorporates human judgment in multiple ways:
This multi-signal approach to evaluation is a best practice in LLMOps, as automated metrics may miss nuances that humans catch, while human review alone doesn’t scale to the volume of interactions a production chatbot handles.
Hostinger adopted Braintrust, a third-party evaluation platform, to help automate and orchestrate their evaluation workflow. The speaker describes Braintrust as helping them “add more functionality” to their evaluation process. Based on the demo described, Braintrust provides:
It’s worth noting that while the speaker is positive about Braintrust, this is a relatively early implementation (“first iteration”), and the long-term effectiveness of this tooling choice remains to be proven. The LLM evaluation tool landscape is evolving rapidly, and there are multiple alternatives in this space.
A key aspect of the framework is its integration into the development workflow. The evaluation can be run programmatically as part of GitHub Actions, meaning that changes to the chatbot can be automatically evaluated before deployment. The speaker describes a workflow where:
This creates a feedback loop where production insights drive development priorities, and development changes are validated before reaching production. This is a solid LLMOps practice that helps catch regressions and ensures continuous improvement.
The speaker mentions a goal of making this framework usable by “multiple teams” at Hostinger for “different sorts of services.” This suggests an approach to LLMOps infrastructure as a shared platform, which can help ensure consistent evaluation practices across the organization and reduce duplicated effort. However, it’s early days, and the speaker acknowledges the framework “will evolve” from this starting point.
Several aspects of Hostinger’s evaluation framework represent LLMOps best practices:
As an early-stage implementation presented in a demo format, there are aspects that are not fully addressed:
From the code snippet described, the implementation appears relatively straightforward:
The simplicity of this implementation is actually a strength – it makes the evaluation framework more accessible to multiple teams and easier to maintain.
Hostinger’s LLM evaluation framework represents a thoughtful early-stage approach to LLMOps evaluation. By combining automated metrics, human review, user feedback, and tight integration with their development workflow, they’ve laid the groundwork for systematic improvement of their customer support chatbots. The adoption of a third-party platform (Braintrust) provides structure and tooling, while the emphasis on organic growth of golden examples and cross-team adoption suggests a pragmatic, iterative approach to building evaluation capabilities. As with any early-stage initiative, the proof will be in how well this framework scales and evolves as the team’s needs grow.
This comprehensive case study examines how multiple enterprises (Autodesk, KPMG, Canva, and Lightspeed) are deploying AI agents in production to transform their go-to-market operations. The companies faced challenges around scaling AI from proof-of-concept to production, managing agent quality and accuracy, and driving adoption across diverse teams. Using the Relevance AI platform, these organizations built multi-agent systems for use cases including personalized marketing automation, customer outreach, account research, data enrichment, and sales enablement. Results include significant time savings (tasks taking hours reduced to minutes), improved pipeline generation, increased engagement rates, faster customer onboarding, and the successful scaling of AI agents across multiple departments while maintaining data security and compliance standards.
AI practitioners Aishwarya Raanti and Kiti Bottom, who have collectively supported over 50 AI product deployments across major tech companies and enterprises, present their framework for successfully building AI products in production. They identify that building AI products differs fundamentally from traditional software due to non-determinism on both input and output sides, and the agency-control tradeoff inherent in autonomous systems. Their solution involves a phased approach called Continuous Calibration Continuous Development (CCCD), which recommends starting with high human control and low AI agency, then gradually increasing autonomy as trust is built through behavior calibration. This iterative methodology, combined with a balanced approach to evaluation metrics and production monitoring, has helped companies avoid common pitfalls like premature full automation, inadequate reliability, and user trust erosion.
Abundly.ai developed an AI agent platform that enables companies to deploy autonomous AI agents as digital colleagues. The company evolved from experimental hobby projects to a production platform serving multiple industries, addressing challenges in agent lifecycle management, guardrails, context engineering, and human-AI collaboration. The solution encompasses agent creation, monitoring, tool integration, and governance frameworks, with successful deployments in media (SVT journalist agent), investment screening, and business intelligence. Results include 95% time savings in repetitive tasks, improved decision quality through diligent agent behavior, and the ability for non-technical users to create and manage agents through conversational interfaces and dynamic UI generation.