ZenML

AI-Powered Hyper-Personalized Email Marketing System

Hubspot 2023
View original source

Hubspot developed an AI-powered system for one-to-one email personalization at scale, moving beyond traditional segmented cohort-based approaches. The system uses GPT-4 to analyze user behavior, website data, and content interactions to understand user intent, then automatically recommends and personalizes relevant educational content. The implementation resulted in dramatic improvements: 82% increase in conversion rates, 30% improvement in open rates, and over 50% increase in click-through rates.

Industry

Tech

Technologies

Overview

This case study comes from a podcast discussion featuring HubSpot’s CMO Kip Bodner and VP of Marketing Emmy Jonathan, where they discuss real-world AI experiments being conducted within HubSpot’s marketing organization. The primary use case examined is the transformation of their “first conversion nurturing” email flow—a high-volume automated email sequence sent to leads who download educational content—from traditional cohort-based personalization to true one-to-one personalization powered by large language models.

HubSpot, as a CRM and marketing automation platform, has significant resources and existing infrastructure that gave them advantages in implementing this solution, including a large library of educational content (courses, guides, templates) and robust data collection on user behavior. The case study provides valuable insights into both the organizational approach to prioritizing AI initiatives and the technical architecture of a production LLM system.

Organizational Framework for AI Prioritization

Before diving into the technical implementation, it’s worth noting HubSpot’s approach to prioritizing AI use cases, as this represents a practical framework for LLMOps project selection. The team received over 100 AI project ideas from across the marketing organization and needed a systematic way to evaluate them.

They used a 2x2 matrix framework with two axes:

This framework helped them balance revenue-impacting use cases against internal efficiency improvements. The email personalization use case scored highly on the demand impact axis due to the massive volume of leads flowing through the first conversion nurturing system—representing their largest cohort of prospects (those with educational intent, which they estimate is at least 10x larger than the cohort actively evaluating software).

The prioritization process was kept deliberately lightweight and agile, using Slack messages and Google Forms for idea collection. They maintained bi-weekly review meetings to allow for rapid reprioritization as new technologies or market conditions emerged. This “perfect is the enemy of good” philosophy extended to their implementation approach as well.

The Problem: Limitations of Cohort-Based Personalization

HubSpot’s existing first conversion nurturing workflow used traditional segmentation-based personalization. When a lead downloaded an educational content offer (ebook, template, guide), they would be placed into a segment based on:

The system would then send emails with content tailored to that segment—for example, marketing-related content for leads who downloaded marketing resources. However, after years of A/B testing and optimization, they had reached a plateau with only incremental gains possible. This is a common pattern in conversion optimization where initial tests yield significant improvements but returns diminish over time.

The fundamental limitation was that they were doing “group guessing”—placing people into cohorts and making assumptions about what the group might want, rather than understanding individual needs.

Technical Architecture

The AI-powered solution uses GPT-4 from OpenAI combined with a vector database to achieve one-to-one personalization at scale. The architecture follows a RAG (Retrieval-Augmented Generation) pattern with several distinct processing stages:

Data Collection and Context Building

When a lead enters the system, the following data is collected:

This multi-source data collection provides rich context for personalization that goes far beyond simple demographic segmentation.

Job-to-be-Done Inference

The LLM’s first task is to synthesize all available information and generate a summary of what the person is likely trying to accomplish—their “job to be done.” This is a critical insight from the case study: the key to success was accurately inferring intent, not just personalizing surface-level copy.

An example provided in the discussion shows how the system analyzed a small online coffee company whose user downloaded influencer marketing content and subsequently showed interest in content calendars. The LLM generated a summary interpreting this as preparation for seasonal promotions and a strategic approach to brand growth, connecting the dots between different behavioral signals.

Ideal Content Generation

Rather than immediately searching the existing content library, the LLM first imagines what a “perfect course” would look like to help this specific person accomplish their inferred goal—regardless of whether such content exists. This is an interesting approach that allows the system to reason about ideal outcomes before constraining to available resources.

Vector Database Retrieval

The hypothetical ideal course description is then sent to a vector database containing embeddings of all HubSpot’s actual courses and their relationships. The database returns the top 10 most similar real courses based on semantic similarity to the ideal course.

Final Recommendation Selection

The LLM reviews the candidate courses in the context of everything it knows about the user and selects the single best option. This multi-stage filtering approach (generate ideal → retrieve candidates → select best match) appears more sophisticated than a simple single-pass retrieval.

Personalized Copy Generation

Finally, the system generates personalized email copy that:

The example shown generated copy like “Turn every sip into a story that captivates and converts” for the coffee company prospect—demonstrating genuine personalization rather than simple mail-merge style token replacement.

Results and Iteration Process

The system achieved impressive results:

The team emphasized that these results took approximately two months of iteration to achieve. A critical learning was that their initial hypothesis was incorrect. They first assumed the personalized email copy would drive conversion improvements, but discovered that the real value came from accurately inferring the job-to-be-done and recommending truly relevant content. The personalized copy was “icing on the cake” but not the primary driver.

This finding has important implications for LLMOps practitioners: it suggests that investing in better retrieval and recommendation logic may yield higher returns than investing in more sophisticated text generation.

Key LLMOps Learnings

Several practical LLMOps lessons emerge from this case study:

Ship early and iterate: The team repeatedly emphasized that AI models cannot be perfected in isolation—they need real user feedback to improve. Waiting to launch until the system is “perfect” is counterproductive because perfection is impossible without real-world data.

Combine domain expertise with AI expertise: The project paired Josh Bliss (AI/technical expertise) with Jordan Douglas (email automation and persona domain expertise). This pairing of subject matter experts with AI practitioners was highlighted as essential to success.

Start with the right problem: By focusing on a high-volume, high-impact use case (first conversion nurturing), the team ensured that even modest percentage improvements would translate to significant absolute gains.

Infrastructure matters: HubSpot’s existing library of educational content gave them a significant advantage. The more content available for matching, the more likely the system can find something truly relevant for each user. Organizations considering similar implementations should assess their content assets.

Measurement and validation are critical: The team took “double takes and triple takes” to validate their results, recognizing that 82% conversion improvements sound almost too good to be true.

Caveats and Considerations

While the results are impressive, several contextual factors should be considered when evaluating this case study:

The discussion is from a podcast where HubSpot is promoting their own AI capabilities, so there may be selection bias toward highlighting successful experiments. The specific definitions of “conversion” and baseline performance levels are not provided, which makes it difficult to fully contextualize the improvements.

Additionally, HubSpot has significant advantages that may not be available to smaller organizations: a large content library, substantial first-party behavioral data, dedicated AI resources, and existing marketing automation infrastructure. The transferability of these results to organizations without similar assets is unclear.

The compute costs, latency considerations, and operational complexity of running this system at scale are not discussed. For organizations considering similar implementations, these operational factors would be important to evaluate.

Team and Resources

The implementation was led by a centralized team within HubSpot’s Marketing Technology group, headed by Mark. Dave G. was brought on to lead initial AI efforts, and the team grew over time. The core implementation appeared to involve:

This relatively lean resourcing suggests that similar implementations may be achievable for mid-sized organizations with the right technical talent.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Building AI-Native Platforms: Agentic Systems, Infrastructure Evolution, and Production LLM Deployment

Delphi / Seam AI / APIsec 2025

This panel discussion features three AI-native companies—Delphi (personal AI profiles), Seam AI (sales/marketing automation agents), and APIsec (API security testing)—discussing their journeys building production LLM systems over three years. The companies address infrastructure evolution from single-shot prompting to fully agentic systems, the shift toward serverless and scalable architectures, managing costs at scale (including burning through a trillion OpenAI tokens), balancing deterministic workflows with model autonomy, and measuring ROI through outcome-based metrics rather than traditional productivity gains. Key technical themes include moving away from opinionated architectures to let models reason autonomously, implementing state machines for high-confidence decisions, using tools like Pydantic AI and Logfire for instrumentation, and leveraging Pinecone for vector search at scale.

chatbot content_moderation customer_support +40