An art institution implemented a sophisticated multimodal search system for their collection of 40 million art assets using vector databases and LLMs. The system combines text and image-based search capabilities, allowing users to find artworks based on various attributes including style, content, and visual similarity. The solution evolved from using basic cloud services to a more cost-effective and flexible approach, reducing infrastructure costs to approximately $1,000 per region while maintaining high search accuracy.
This case study comes from a presentation by a team member at Actum Digital discussing their implementation of a multimodal search system for an art collection. The transcript quality is poor (likely due to automatic transcription issues), but key technical details can be extracted about how they approached building a production AI search system for a large art catalog containing approximately 40 million assets.
The system appears to serve the needs of art collection managers, auctioneers, and potentially end users who need to discover artworks through various search modalities including text-based search and image similarity search. The use case is particularly interesting because art assets have unique characteristics including multilingual metadata (Spanish, French, etc.), varying titles (including self-portraits with generic naming), materials, techniques, and visual similarities that traditional keyword search cannot adequately capture.
The team faced several challenges when building a search system for a massive art collection:
The solution is built on AWS infrastructure with several key components:
AWS Bedrock: Used for accessing embedding models that transform both text and images into vector representations. Bedrock’s managed service approach was chosen because the team is small and didn’t want to manage model infrastructure themselves. The presentation mentions they “leverage interface” through Bedrock as a service that is “able to manage” their embedding needs.
OpenSearch: Serves as the vector database for storing and querying embeddings. The choice of OpenSearch allows for both traditional keyword search and vector similarity search to be combined in hybrid queries.
Data Pipeline: They built data pipelines and process functions to handle the ingestion and transformation of art assets into searchable vector representations.
The system supports multiple search modalities that can be used independently or in combination:
The presentation demonstrates searching for paintings with specific visual characteristics like “the blaze in the middle” and finding artworks with similar visual elements from different angles or styles. This multimodal capability appears to be a core differentiator of their approach.
Cost optimization was a major focus of this implementation, which is a critical LLMOps concern for production systems:
The presenter discusses the cost structure which has two main components:
With 40 million assets requiring embedding, the initial indexing cost was substantial - mentioned as “thousands of dollars” for full reindexing. The ongoing infrastructure cost is approximately $1,000 per month per region.
Several optimization strategies were discussed:
Reducing re-indexing: Since images don’t change frequently in an art collection, they optimized to avoid unnecessary re-embedding of assets that haven’t changed. This is a practical insight - the team recognized that unlike dynamic content, art images are relatively static, so they can reduce their embedding costs significantly.
Fine-tuning considerations: The team explored using fine-tuned models as a potential optimization strategy. While this offers more cost-effective inference, they acknowledged drawbacks including increased model management complexity and the need for more sophisticated infrastructure.
Architectural simplification: They mention reducing logic and simplifying their architecture to lower costs while maintaining functionality.
A technical challenge highlighted in the presentation involves score normalization when combining results from different search modalities. When mixing text-based search scores with image similarity scores, the raw scores may be on different scales.
The presenter mentions normalizing scores to a range of 0 to 1, and references an article explaining the details of how to approach this. This is a common challenge in hybrid search systems where different retrieval methods produce scores that aren’t directly comparable.
Different normalization techniques may be more appropriate depending on the use case - the presentation suggests that the choice of technique depends on whether you’re doing “ranking documents recommendation” or “more classical problems like financial” applications.
An interesting operational concern raised in the Q&A portion relates to infrastructure reliability. The presenter mentions that they run infrastructure in multiple regions for redundancy, noting that when their “first infrastructure went down” they had a backup. This highlights the importance of high availability in production AI systems, especially for business-critical search functionality.
The presenter expresses some frustration that even services marketed as “very reliable” can have issues, emphasizing the need for redundancy and not relying on a single infrastructure deployment for critical business operations.
Several practical insights emerge from this case study:
Start simple, iterate: The team appears to have started with a simpler architecture leveraging Bedrock’s managed services, then evolved their approach as they learned more about their specific requirements and cost structure.
Balance managed services vs. control: Using AWS Bedrock provided simplicity for a small team, but they’re exploring more controlled approaches (like fine-tuning) as their needs mature.
Domain-specific considerations matter: Art collections have unique characteristics (static images, multilingual metadata, importance of visual similarity) that influenced their technical choices.
Cost awareness from day one: With large asset collections, understanding the cost implications of different embedding and search strategies is essential before committing to an architecture.
The presentation included a live demonstration showing:
The demo illustrated searching for artworks with specific visual characteristics like hand positions and finding similar items across the collection with different angles but similar compositional elements.
It’s important to note some limitations of this case study:
Despite these limitations, the case study provides valuable insights into the practical considerations of building production multimodal search systems at scale, particularly around cost optimization and infrastructure reliability.
Notion AI, serving over 100 million users with multiple AI features including meeting notes, enterprise search, and deep research tools, demonstrates how rigorous evaluation and observability practices are essential for scaling AI product development. The company uses Brain Trust as their evaluation platform to manage the complexity of supporting multilingual workspaces, rapid model switching, and maintaining product polish while building at the speed of AI industry innovation. Their approach emphasizes that 90% of AI development time should be spent on evaluation and observability rather than prompting, with specialized data specialists creating targeted datasets and custom LLM-as-a-judge scoring functions to ensure consistent quality across their diverse AI product suite.
LinqAlpha, a Boston-based AI platform serving over 170 institutional investors, developed Devil's Advocate, an AI agent that systematically pressure-tests investment theses by identifying blind spots and generating evidence-based counterarguments. The system addresses the challenge of confirmation bias in investment research by automating the manual process of challenging investment ideas, which traditionally required time-consuming cross-referencing of expert calls, broker reports, and filings. Using a multi-agent architecture powered by Claude Sonnet 3.7 and 4.0 on Amazon Bedrock, integrated with Amazon Textract, Amazon OpenSearch Service, Amazon RDS, and Amazon S3, the solution decomposes investment theses into assumptions, retrieves counterevidence from uploaded documents, and generates structured, citation-linked rebuttals. The system enables investors to conduct rigorous due diligence at 5-10 times the speed of traditional reviews while maintaining auditability and compliance requirements critical to institutional finance.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.