ZenML

AI-Powered Customer Feedback Analysis System for Container Shipping

Hapag-Lloyd 2026
View original source

Hapag-Lloyd, a global container shipping company, transformed their manual and time-consuming customer feedback analysis process into an automated AI-powered system using Amazon Bedrock. Previously, product managers spent hours or days manually categorizing sentiment and themes from hundreds of feedback comments exported as CSV files. The new solution automatically ingests customer feedback, performs sentiment classification using Claude Sonnet 4.6, generates embeddings, indexes data in OpenSearch, and provides stakeholders with interactive dashboards and an AI chatbot for natural language queries. The system now processes over 15,000 feedback items monthly with 95% accuracy on sentiment classification, enabling teams to move from insight to action within days instead of weeks, and has already driven measurable improvements in product decisions and user satisfaction.

Industry

Other

Technologies

Overview

Hapag-Lloyd, one of the world’s leading container shipping companies with 313 ships, 2.5 million TEU capacity, and approximately 14,000 employees across 140 countries, implemented a production AI system to automate customer feedback analysis. Their Digital Customer Experience and Engineering team, distributed between Hamburg and Gdańsk, faced a significant operational challenge: manually analyzing customer feedback was taking hours to days, especially before review ceremonies when product managers had to export CSV files, read through hundreds of comments, and manually categorize sentiment and themes. This manual process, while valuable for product decisions, was repetitive, time-consuming, and difficult to scale.

The team developed a comprehensive generative AI solution using Amazon Bedrock as the foundation, combined with OpenSearch for vector storage and retrieval, and orchestrated using LangChain and LangGraph frameworks. This case study represents a well-documented example of moving from manual workflows to production AI systems, with specific attention to orchestration, data management, responsible AI practices, and monitoring.

Architecture and Production Infrastructure

The production system is deployed using AWS CloudFormation, demonstrating infrastructure-as-code practices crucial for LLMOps. The architecture consists of several interconnected components working together in a daily operational workflow.

The feedback collection layer captures user ratings and text comments from web and mobile applications serving hundreds of thousands of customers monthly. This represents the data ingestion point for the LLM pipeline. An AWS Lambda function executes once daily to fetch new feedback entries from the feedback repository into Amazon S3, establishing a batch processing pattern rather than real-time streaming. This design choice reflects practical considerations around cost, complexity, and actual business needs—feedback analysis doesn’t require millisecond latency but does require consistent, reliable processing.

The processing pipeline then categorizes feedback using Amazon Bedrock for sentiment detection. Each comment is classified as positive, negative, mixed, or neutral through LLM inference calls. The processed records are indexed in Amazon OpenSearch Service, which serves dual purposes as both a full-text search engine and vector database. This dual-use pattern is increasingly common in production LLM systems, where semantic search capabilities need to coexist with traditional keyword-based queries.

Model Selection and Orchestration

The team selected Claude Sonnet 4.6 as their primary model, accessed through Amazon Bedrock. The choice is explicitly justified in the case study: Claude Sonnet 4.6 offers “frontier performance across coding and agentic workflows” and “excels in multi-turn conversational exchanges and agentic workflows.” For a chatbot that requires reliable performance across both single and multi-turn interactions with stakeholders, these characteristics align well with requirements. The model’s “precise workflow management capabilities” and ability to serve in both lead agent and subagent roles made it suitable for their multi-agent architecture.

An interesting production consideration is their use of Cross-Region Inference Service (CRIS) endpoints to manage traffic bursts by distributing compute across multiple EU AWS Regions. This demonstrates awareness of availability and resilience patterns critical for production LLM deployments. The geographic distribution keeps data within EU boundaries while providing elasticity during peak usage periods.

Orchestration is handled through two frameworks working in concert. LangChain provides modular, reusable components for calling models, transforming data, and integrating with external systems like OpenSearch. The integration with Amazon Bedrock models is straightforward using LangChain’s native support. For the internal chatbot, LangGraph implements a multi-agent architecture where each assistant is defined declaratively with its own logic and tools. This design choice moves away from rigid step-by-step flows toward an agent-based approach where the LLM dynamically selects appropriate tools and actions to answer queries. This architectural pattern represents modern LLMOps thinking—building flexible, composable systems rather than hardcoded pipelines.

The code example provided shows how guardrails are applied directly through LangChain configuration during model initialization, demonstrating practical integration patterns that other practitioners could adopt.

Data Management and Processing Workflow

The daily Lambda function represents a critical operational component. It fetches feedback data, triggers sentiment classification through Bedrock, and manages the indexing process into OpenSearch. While the case study doesn’t provide granular details on error handling, retry logic, or data validation steps, these would be necessary considerations for production reliability. The batch processing approach simplifies some operational concerns compared to real-time streaming but introduces different challenges around processing failures and recovery.

The system processes over 15,000 feedback items monthly, which translates to roughly 500 items per day on average. At this scale, the batch processing approach is reasonable, though the team would need to consider scaling strategies if volume increases significantly. The case study reports 95% accuracy for sentiment classification on a labeled test dataset, which suggests the team has implemented proper evaluation practices including maintaining labeled ground truth data for ongoing validation.

OpenSearch serves as both the operational data store and the knowledge base for the chatbot. Product managers and support teams can access real-time feedback insights through OpenSearch Dashboards, which provide visualizations of sentiment distribution, rating scores, feedback volume, and trends over time. The dashboards support filtering by multiple dimensions including time period, sentiment, product version, and application features. This enables targeted analysis—for example, visualizing how sentiment around a specific app update changed week over week, or drilling into negative comments about particular features.

Multi-Agent Chatbot Architecture

The internal chatbot, built with LangGraph, queries the OpenSearch index as its knowledge base and allows stakeholders to ask natural language questions like “What pain points do customers mention most often?” This represents a retrieval-augmented generation (RAG) pattern, though the case study doesn’t explicitly describe the retrieval strategy, chunk sizes, or how context is assembled for the LLM. Product managers and support teams use the chatbot for interactive exploration of feedback, which represents a significant shift from static reports to dynamic, conversational analytics.

The multi-agent design using LangGraph suggests that different types of queries might be routed to specialized agents with different tool access or processing logic. While the case study doesn’t detail the specific agents implemented, this architecture pattern enables composability and separation of concerns—agents focused on sentiment analysis might differ from those handling trend detection or comparative analysis across time periods.

Responsible AI and Guardrails

Amazon Bedrock Guardrails are applied at multiple points in the pipeline to enforce content moderation policies and ensure responses align with brand and compliance standards. The case study provides concrete CloudFormation infrastructure-as-code examples showing how guardrails are defined declaratively. The configuration blocks harmful content categories including hate speech, insults, sexual content, violence, misconduct, and prompt attacks. Each category is configured with input/output strength settings (HIGH in most cases) and output actions (BLOCK).

Importantly, the team also validates raw user input programmatically before passing it to the LLM, implementing a defense-in-depth strategy against prompt injection and misuse. The validation function calls the apply_guardrail API with the user’s question and checks whether the guardrail intervened. If blocked, the system logs details including user_id and the question itself for audit purposes. This proactive input validation prevents problematic prompts from ever reaching the model, saving inference costs and reducing risk.

The guardrail configuration includes managed word lists for profanity and filters for various content types. The explicit blocking of PROMPT_ATTACK with strength NONE suggests awareness of adversarial input patterns. However, it’s worth noting that guardrails, while valuable, are not foolproof—determined users may find ways to bypass filters, and the effectiveness depends on the sophistication of the guardrail implementation and ongoing tuning based on observed attack patterns.

Automated Reporting and Integration with Product Workflows

Beyond the interactive chatbot and dashboards, the system generates biweekly automated reports. A second Lambda function aggregates and analyzes feedback trends over the two-week period, generates a report with key metrics, highlights, and sentiment breakdowns, and delivers it automatically to product managers and product owners. This automation directly feeds into sprint planning and roadmap discussions, creating a tight integration between AI-generated insights and product development workflows.

The case study provides specific examples of how these insights drove product decisions. The “Preview” functionality in Shipping Instructions was prioritized directly in response to high volumes of negative feedback about the lack of preview capability. After release, AI-driven reports tracked user reactions in detail, showing that feedback related to this feature shifted positively as the core user need was addressed. Another example is the ability to upload cargo data via Excel files, a feature repeatedly highlighted by AI recommendations, which is now available and expected to reduce manual effort for large shipments.

These examples demonstrate measurable impact—the feedback loop closes when AI insights inform decisions, features are built, and subsequent feedback validates (or challenges) those decisions. The team also created new OpenSearch-based dashboards to help verify and analyze user-reported issues, showing how the AI system’s outputs spawn additional operational tooling.

Monitoring and Observability

The production system incorporates monitoring through Amazon CloudWatch, which collects raw data and processes it into near real-time metrics. Model invocation logging is enabled to capture invocation logs, model input data, and model output data for all calls. This provides visibility into full request data, response data, and metadata associated with inference calls—critical for debugging, performance optimization, and cost management.

Amazon Bedrock also integrates with AWS CloudTrail to capture API calls as events. This generates an audit trail and insights that can be used for further optimization, such as improving response latency or reducing costs. While the case study doesn’t detail specific monitoring dashboards or alerts configured, these logging and tracing capabilities form the foundation for observability practices essential to production LLM systems.

The case study doesn’t mention specific metrics tracked beyond the 95% sentiment classification accuracy, but one would expect monitoring of latency distributions, error rates, token consumption, cost per feedback item processed, and potentially custom business metrics like time-to-insight or report generation success rates.

Critical Assessment and LLMOps Maturity

This case study demonstrates several hallmarks of mature LLMOps practices: infrastructure as code through CloudFormation, modular orchestration with LangChain and LangGraph, responsible AI controls with guardrails, comprehensive logging and monitoring, automated batch processing, and tight integration with business workflows through automated reporting. The team’s explicit framing of their journey toward becoming “AI-native” suggests organizational commitment beyond this single use case.

However, several areas warrant closer examination. The case study doesn’t discuss prompt engineering practices, version control for prompts, or how prompt iterations are tested before deployment. It’s unclear whether A/B testing is performed on different prompt strategies or model configurations. The evaluation methodology, while mentioned (95% accuracy on a labeled test dataset), isn’t described in detail—how large is the test set, how is it maintained, and how often is it refreshed?

The batch processing approach, while appropriate for the current scale, might introduce challenges if real-time feedback analysis becomes a requirement. The daily Lambda execution means insights lag by up to 24 hours, which may be acceptable for strategic product decisions but less suitable for operational incident response.

Cost management isn’t discussed—with over 15,000 feedback items monthly requiring sentiment classification and embedding generation, plus interactive chatbot queries and biweekly report generation, the inference costs could be substantial. Monitoring and optimizing these costs would be an important operational concern.

The team’s use of OpenSearch for both full-text and vector search is pragmatic, but the case study doesn’t detail embedding strategies, retrieval quality metrics, or how semantic search performance is validated. In RAG systems, retrieval quality directly impacts final output quality, so this would be a critical area for ongoing measurement and tuning.

Broader AI Strategy and Future Direction

Hapag-Lloyd positions this feedback analysis solution as one example within a broader AI-Native Umbrella Program, which serves as a single source of truth for AI adoption across the organization. Their next focus is establishing a shared, robust AI foundation with Amazon Bedrock, providing standardized infrastructure, security, and guardrails that enable every role—engineering, product, delivery, UX/design, and operations/support—to create their own AI “spaces” safely and independently while accessing best-in-class foundation models.

This strategic vision reflects mature thinking about organizational AI adoption: rather than one-off projects, building reusable platforms and guardrails that enable safe experimentation at scale. The emphasis on lowering barriers to experimentation while maintaining consistency, responsibility, and scalability demonstrates awareness of the tension between innovation velocity and governance that many organizations face with generative AI.

Production Impact and Business Outcomes

The quantitative and qualitative impacts reported are significant. Teams that previously spent hours reviewing raw feedback now receive structured summaries in seconds. Decision cycles compressed from weeks to days. In several areas, actions based on AI insights resulted in more positive comments and reduced negative feedback. The Preview functionality example shows a complete loop: negative feedback identified by AI, feature prioritized and built, AI reports validated improvement.

During review sessions, stakeholders see top positive and negative comments in real time alongside AI-generated recommendations, creating more informed and productive discussions. The reduction in manual effort for repetitive analysis work frees product managers to focus on strategy, innovation, and user experience improvements rather than data wrangling.

That said, the case study is published on the AWS blog and co-authored by AWS solution architects alongside Hapag-Lloyd engineers, which means it serves partly as promotional content for AWS services. While there’s no reason to doubt the reported outcomes, the presentation emphasizes successes and doesn’t discuss challenges, failures, or limitations encountered during development and deployment. Real-world LLM projects invariably face issues with prompt reliability, hallucinations, retrieval quality, cost overruns, or user adoption challenges that aren’t surfaced in this account.

Conclusion

This case study represents a well-executed production deployment of LLMs for automating analytical workflows at meaningful scale. Hapag-Lloyd demonstrates practical application of modern LLMOps practices including model orchestration with LangChain/LangGraph, responsible AI controls through guardrails, automated batch processing, integration with vector databases, comprehensive monitoring, and infrastructure as code. The tight integration with business processes through automated reporting and the chatbot interface shows how AI systems can embed into operational workflows rather than existing as standalone experiments. The team’s broader vision for an AI-native organization with shared infrastructure and governance suggests this deployment is part of a more ambitious transformation, which will be interesting to track as it matures.

More Like This

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Notion 2026

Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.

chatbot question_answering summarization +52

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec 2025

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

chatbot content_moderation customer_support +40