## Overview
ZURU Tech represents an ambitious case study in applying large language models to complex spatial and architectural design problems. The company operates in the construction technology space with their Dreamcatcher platform, which aims to democratize building design by allowing users of any technical background to collaborate in the building design and construction process. The core challenge they addressed was developing a text-to-floor plan generation system that could translate natural language descriptions like "Create a house with three bedrooms, two bathrooms, and an outdoor space for entertainment" into accurate, mathematically correct floor plans within their 3D design environment.
This case study is particularly valuable from an LLMOps perspective because it demonstrates the evolution from initial experimentation to production-ready systems, showcasing both the technical challenges of deploying LLMs for specialized domains and the operational considerations required to maintain and improve such systems over time. The collaboration with AWS Generative AI Innovation Center and AWS Professional Services also illustrates how organizations can leverage cloud provider expertise to accelerate their AI adoption journey.
## Technical Architecture and LLMOps Implementation
The project demonstrates sophisticated LLMOps practices through its multi-faceted approach to model optimization and deployment. The team implemented two primary optimization strategies: prompt engineering using Amazon Bedrock and fine-tuning using Amazon SageMaker, each requiring different operational considerations and infrastructure approaches.
For the prompt engineering approach, ZURU built a comprehensive architecture leveraging multiple AWS services in a serverless, API-driven configuration. The system uses Amazon Bedrock as the primary inference endpoint, specifically utilizing Mistral 7B for feature extraction tasks and Claude 3.5 Sonnet for the main floor plan generation. This multi-model approach reflects mature LLMOps thinking, where different models are optimized for specific subtasks rather than attempting to use a single model for all operations.
The architecture implements dynamic few-shot prompting, which represents an advanced LLMOps pattern that goes beyond static prompt templates. The system retrieves the most relevant examples at runtime based on input prompt characteristics, using Amazon Bedrock Knowledge Bases backed by Amazon OpenSearch Serverless for vector search capabilities. This approach requires careful operational consideration around latency, as the system must perform feature extraction, vector search, data retrieval, generation, and reflection in sequence while maintaining acceptable response times for end users.
The data architecture splits search and retrieval operations across purpose-built services, with Amazon OpenSearch handling low-latency search operations and Amazon DynamoDB providing key-value retrieval of actual floor plan data. This separation demonstrates mature system design thinking that considers both performance optimization and operational maintenance, as each service can be scaled and optimized independently based on its specific usage patterns.
## Model Selection and Experimentation
The case study reveals a methodical approach to model selection that evolved from initial experimentation with generative adversarial networks to settling on LLM-based approaches. The team initially experimented with GPT-2 and found promising results, which informed their decision to pursue LLM-based solutions rather than traditional computer vision or GAN approaches. This decision process illustrates important LLMOps considerations around model architecture selection based on empirical results rather than theoretical preferences.
The prompt engineering experiments focused on Claude 3.5 Sonnet, chosen specifically for its strong benchmarks in deep reasoning and mathematical capabilities. This model selection demonstrates domain-specific considerations in LLMOps, where mathematical accuracy and spatial reasoning capabilities were critical success factors. The team implemented prompt decomposition, breaking complex tasks into smaller, manageable components, which not only improved accuracy but also created more maintainable and debuggable systems.
For fine-tuning experiments, the team compared full parameter fine-tuning against Low-Rank Adaptation (LoRA) approaches using Llama 3.1 8B and Llama 3 8B models. The full fine-tuning approach required significant computational resources, utilizing an ml.p4.24xlarge instance with 320 GB GPU memory and taking 25 hours to complete training with 200,000 samples over eight epochs. In contrast, the LoRA approach demonstrated operational efficiency, requiring only 2 hours of training time and producing a much smaller 89 MB checkpoint, though with some performance trade-offs.
## Data Operations and Quality Management
The case study demonstrates sophisticated data operations practices that are critical for successful LLMOps implementations. The team gathered floor plans from thousands of houses from publicly available sources, but implemented a rigorous quality control process involving in-house architects. They built a custom application with a simple yes/no decision mechanism to accelerate the architectural review process while maintaining clear decision criteria.
The data preparation process included filtering out 30% of low-quality data by evaluating metric scores against ground truth datasets, removing data points that didn't achieve 100% accuracy on instruction adherence. This aggressive quality filtering approach improved training efficiency and model quality by more than 20%, demonstrating the critical importance of data quality in LLMOps workflows.
The team also addressed data leakage concerns through careful analysis of their dataset structure. They discovered that the dataset contained prompts that could match multiple floor plans and floor plans that could match multiple prompts. To prevent data leakage, they moved all related prompt and floor plan combinations to the same data split, ensuring robust evaluation and preventing artificially inflated performance metrics.
## Evaluation Framework and Metrics
One of the most sophisticated aspects of this LLMOps implementation is the comprehensive evaluation framework developed to measure model performance across multiple dimensions. The framework focuses on two key criteria: semantic understanding (how well the model understands rooms, their purposes, and spatial relationships) and mathematical correctness (adherence to specific dimensions and floor space requirements).
The evaluation system implements prompt deduplication to identify and consolidate duplicate instructions in test datasets, reducing computational overhead and enabling faster iteration cycles. This operational efficiency consideration is crucial for maintaining development velocity in LLMOps environments where multiple experiments and iterations are common.
The framework uses distribution-based performance assessment that filters unique test cases and promotes representative sampling through statistical analysis. This approach allows the team to project results across the full dataset while maintaining statistical validity, balancing thorough evaluation with computational efficiency.
## Production Deployment Considerations
The prompt engineering approach demonstrates production-ready architecture patterns with its serverless, API-driven design using Amazon Bedrock. This approach provides natural scalability and reduces operational overhead compared to managing custom model deployments. The system implements reflection techniques, where the model is asked to self-assess and correct its generated content, adding an additional quality control layer that's particularly important for specialized domains like architectural design.
The fine-tuning approach requires more complex operational considerations, particularly around model versioning, deployment, and resource management. The team's experience with different instance types and training configurations provides valuable insights for organizations considering similar approaches. The 25-hour training time for full fine-tuning presents operational challenges around training pipeline management and resource scheduling.
## Results and Business Impact
The quantitative results demonstrate significant improvements over baseline performance, with instruction adherence accuracy improving by 109% for both prompt engineering and fine-tuning approaches. Mathematical correctness showed more variation, with fine-tuning achieving a 54% improvement while prompt engineering showed minimal improvement over baseline. These results highlight the importance of selecting optimization approaches based on specific performance requirements and domain characteristics.
The LoRA-based approach showed trade-offs typical of efficient fine-tuning methods, achieving 20% lower instruction adherence scores and 50% lower mathematical correctness compared to full fine-tuning, but with dramatically reduced computational requirements. This trade-off analysis is valuable for LLMOps practitioners making decisions about resource allocation and performance requirements.
## LLMOps Lessons and Best Practices
This case study demonstrates several critical LLMOps practices that extend beyond the specific use case. The importance of domain-specific evaluation frameworks cannot be overstated, as traditional language model metrics would not capture the spatial and mathematical accuracy requirements crucial for architectural applications.
The multi-model approach, using different models optimized for different subtasks, represents mature system architecture thinking that balances performance optimization with operational complexity. The careful consideration of data quality and preprocessing demonstrates that successful LLMOps implementations often require as much attention to data operations as to model operations.
The comparison between prompt engineering and fine-tuning approaches provides valuable insights for organizations deciding between these optimization strategies. While both approaches achieved similar improvements in instruction adherence, the fine-tuning approach was superior for mathematical correctness, suggesting that domain-specific requirements should drive optimization strategy selection.
The operational efficiency considerations, from prompt deduplication to service architecture choices, demonstrate that successful LLMOps implementations require careful attention to system performance and cost optimization alongside model accuracy improvements. The team's experience with different training configurations and resource requirements provides practical guidance for similar implementations in specialized domains.