Wealthsimple developed an internal LLM Gateway and suite of generative AI tools to enable secure and privacy-preserving use of LLMs across their organization. The gateway includes features like PII redaction, multi-model support, and conversation checkpointing. They achieved significant adoption with over 50% of employees using the tools, primarily for programming support, content generation, and information retrieval. The platform also enabled operational improvements like automated customer support ticket triaging using self-hosted models.
Wealthsimple is a Canadian fintech company focused on helping Canadians achieve financial independence through a unified app for investing, spending, and saving. This case study, presented by a member of their team, describes how the company built internal LLM infrastructure and tools to boost employee productivity while maintaining strong security and privacy standards. The presentation covers their LLM Gateway, internal tools ecosystem, build vs. buy philosophy, and lessons learned from adoption.
Wealthsimple organizes their LLM efforts into three streams: employee productivity (the original thesis for LLM value), operations optimization (using LLMs to improve client experience), and an LLM platform that acts as an enablement function supporting the first two pillars. Their philosophy centers on three themes: accessibility, security, and optionality. The team wanted to make the secure path the path of least resistance, enabling freedom to explore while protecting company and customer data. They also recognized that no single model or technique would be best for all tasks, so they aimed to provide optionality across different foundation models and techniques.
The LLM Gateway is Wealthsimple’s central internal tool for interacting with LLMs. It was developed in response to concerns about fourth-party data sharing when ChatGPT first became popular, as many companies inadvertently overshared sensitive information with OpenAI. The Gateway sits between all LLMs (both external providers like OpenAI, Cohere, and Google Gemini, as well as self-hosted models) and Wealthsimple employees.
The Gateway was initially built in just five days, though significant iteration followed. Key features include:
The engagement metrics show strong adoption: daily, weekly, and monthly active users have all been trending upward since tracking began. Over half the company uses the Gateway. The team noted interesting patterns such as lower usage during December holidays followed by increased adoption after New Year evangelism efforts.
The team has deployed four open-source LLMs within their own cloud environment (primarily AWS). They’ve built platform support for fine-tuning and model training with hardware acceleration, though at the time of the presentation they hadn’t yet shipped fine-tuned models. The ability to self-host models addresses concerns about PII masking—employees who need to work with PII can use self-hosted models without redaction since data never leaves the company’s VPC.
For code-related use cases, they leverage both GitHub Copilot (with special licensing agreements) and self-hosted code-specialized models. They also use Whisper, OpenAI’s voice transcription model, self-hosted within their cloud environment for converting audio to text.
Beyond the Gateway, Wealthsimple developed “Booster Pack,” a popular internal tool built on their data applications platform. Booster Pack uses Retrieval Augmented Generation (RAG) to ground conversations against uploaded context. Users can create three types of knowledge bases:
This differs from the multimodal approach in that rather than enriching conversations with various input types, it grounds conversations against specific uploaded context for more reliable and relevant responses.
The presentation highlighted a concrete example of LLMs in production for operations optimization. Wealthsimple previously had a Transformer-based ML model that automated routing of customer support tickets to appropriate teams based on topic and subtopic classification. This was a significant improvement over the previous manual process handled by dedicated agents.
They extended this system with LLM capabilities in two ways:
The team emphasized that they approach LLM integration organically rather than forcing it into every workflow—looking for natural extensions where the technology adds clear value.
Wealthsimple’s decision framework for building vs. buying LLM tools considers three factors:
They observed two industry trends influencing their decisions: more vendors are offering GenAI integrations (requiring more strategic purchasing decisions), and general security/privacy awareness among vendors is improving (making buying more attractive). While they might not choose to build the same tools again today given these trends, the internal learnings, expertise, and guardrails developed have been valuable.
From surveys and user interviews, the team gathered several insights:
Two key behavioral lessons emerged:
When asked about guardrails against hallucination, the team acknowledged they haven’t integrated specific technical checks within their retrieval systems yet. Their approach focuses on education about appropriate use cases, best practices for structuring problems to get reliable answers, grounding through RAG, and the future potential of fine-tuning for more control. They also incorporated prompt engineering to inform the model about PII masking behavior.
Notably, the ML engineering team responsible for all GenAI tools and production use cases, as well as the broader ML platform, consists of only three people. This small team size necessitates careful prioritization and extensive automation. For example, while the first self-hosted model deployment took two weeks, process optimization reduced subsequent deployments to 20 minutes. Roadmap decisions are driven by stakeholder conversations, end-user interviews, team interests, and alignment with higher business priorities.
Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.
Yahoo! Finance built a production-scale financial question answering system using multi-agent architecture to address the information asymmetry between retail and institutional investors. The system leverages Amazon Bedrock Agent Core and employs a supervisor-subagent pattern where specialized agents handle structured data (stock prices, financials), unstructured data (SEC filings, news), and various APIs. The solution processes heterogeneous financial data from multiple sources, handles temporal complexities of fiscal years, and maintains context across sessions. Through a hybrid evaluation approach combining human and AI judges, the system achieves strong accuracy and coverage metrics while processing queries in 5-50 seconds at costs of 2-5 cents per query, demonstrating production viability at scale with support for 100+ concurrent users.
Telus developed Fuel X, an enterprise-scale LLM platform that provides centralized management of multiple AI models and services. The platform enables creation of customized copilots for different use cases, with over 30,000 custom copilots built and 35,000 active users. Key features include flexible model switching, enterprise security, RAG capabilities, and integration with workplace tools like Slack and Google Chat. Results show significant impact, including 46% self-resolution rate for internal support queries and 21% reduction in agent interactions.