ZenML

Production Skills Framework for Agentic LLM Workflows

WorkOS 2026
View original source

WorkOS developed a comprehensive approach to productionizing LLM workflows through "skills" - reusable, composable units of work that encapsulate specific tasks, constraints, and domain knowledge in markdown files with optional scripts. The problem addressed was the repetitive nature of LLM interactions where context must be reloaded from scratch in every conversation, leading to inconsistent outputs and wasted time. Their skills framework enables teams to codify workflows once, share them across team members and projects, and achieve more consistent, deterministic results. The solution has been applied across multiple use cases including code installation automation, content generation, image/video creation, and internal tooling, with WorkOS shipping production tools like their CLI that leverage skills to automate developer onboarding and authentication setup.

Industry

Tech

Technologies

Overview

This case study documents WorkOS’s approach to operationalizing LLM-based workflows through a “skills” framework. The company’s Applied AI team, represented by developer experience engineers Nick Nissi and Zach Proer, presented their methodology for creating reusable, shareable units of LLM context and instructions that can be deployed across different models, environments, and team members. The framework addresses fundamental LLMOps challenges around context management, consistency, team collaboration, and the integration of deterministic components into non-deterministic LLM conversations.

The Core Problem

The fundamental challenge identified was that every LLM conversation starts from zero context. Users must repeatedly provide the same information about coding conventions, project structures, preferences, and domain knowledge with each new interaction. This creates several issues in production scenarios:

The presenters noted they hadn’t written code without AI assistance in six to eight months, making these productivity issues particularly acute for teams heavily invested in agentic development.

The Skills Framework Solution

Architecture and Design

Skills are discrete, portable units of work defined primarily through markdown files with YAML front matter. At their simplest, a skill consists of a single skill.md file containing:

The framework supports progressive disclosure, where skills can reference additional markdown files that are only loaded when specific conditions are met. This prevents context window bloat while maintaining access to detailed information when needed.

Implementation Patterns

Script Interpolation: Skills can use bang-backtick syntax to execute scripts and inject their output directly into the LLM context. For example, instead of asking the LLM to “find the last 10 commits,” a skill might execute a specific git command and provide the formatted results directly. This approach:

Confidence Scoring: Skills can enforce iterative refinement by requiring the LLM to assess its confidence before proceeding. The ideation skill demonstrated in the workshop uses a 100-point rubric across dimensions like problem clarity, goal definition, success criteria, scope boundaries, and consistency. The LLM only proceeds when confidence exceeds a threshold (typically 90-95%), otherwise it asks clarifying questions. While the math is “fuzzy” as acknowledged, the value lies in forcing iterative clarification.

Progressive Disclosure: Skills can conditionally load additional context. For example, the WorkOS skill router only loads framework-specific documentation when working with that framework. A testing skill might only load detailed testing rubrics when actually performing test analysis. This pattern is expressed through simple conditional statements in the skill markdown.

Routing and Descriptions: The description field is critical for automatic skill selection. When multiple skills are available, LLMs use descriptions to route tasks to appropriate skills. Teams can have framework-specific image generation skills, with descriptions that trigger on context like “personal blog uses pixel art” versus “work projects use S3-hosted professional imagery.”

Distribution and Loading

Skills can be loaded from multiple locations:

This flexibility allows skills to be shared at appropriate scopes - individual, team, organization, or public - while maintaining version control through standard git workflows.

Production Applications

WorkOS CLI Installation Tool

The flagship production implementation is WorkOS’s CLI tool, which automates authentication setup in web applications. The tool runs npx workos install to:

The entire intelligence layer runs on the Claude Agent SDK (programmatic Claude Code) with all domain knowledge encoded in skills stored in the public WorkOS skills repository. This architecture provides several benefits:

Content and Media Generation

Skills enable non-technical team members to perform complex tasks through simple prompts:

Recruiting workflows: The recruiting team uses skills that aggregate data from Slack, Notion, and recruiting software to generate candidate reports. Skills understand the specific format and criteria the team needs without requiring manual data collection.

Video generation: A remotion skill demonstrates the power of chaining API calls within a skill. For a simple text prompt like “child running through field,” the skill:

This same workflow was used to generate all interstitial scenes for a 32-minute film, with the skill even launching a localhost browser-based editor for real-time refinement.

Image generation wrapper: A Python-based skill wraps the Imagen API with intelligent prompt enhancement, reducing generation time to under 7 seconds and enabling consistent style application across a project.

Developer Productivity Tools

Code review automation: Skills that load framework-specific best practices and organizational conventions provide consistent code review across team members. One engineer noted using a Codex skill to have OpenAI’s model review Claude’s output automatically, eliminating manual copy-paste between tools.

Git analysis and repository health: The workshop’s example “repo roast” skill demonstrates analyzing repositories for:

Slack-to-Linear automation: A skill monitors Slack for work requests, checks Linear for duplicate tickets, and creates new tickets automatically, eliminating context-switching overhead for developers trying to maintain flow.

Evaluation and Quality Assurance

WorkOS implements formal evaluation frameworks for production skills, particularly those in their public repository. The evaluation approach includes:

Baseline comparison: Running tasks with and without the skill to measure improvement. Skills must demonstrate measurable benefit over the model’s baseline capabilities.

Rubric-based scoring: Skills are evaluated against confidence thresholds (typically 80-90% success rate) across multiple test runs. If a skill fails to meet the threshold or performs worse than baseline, it fails the evaluation.

Model evolution tracking: As new models release, skills are re-evaluated to ensure they still provide value. In one case, evaluation revealed that a Next.js skill was actually degrading performance because Claude had become inherently proficient with Next.js and the skill’s prescriptive instructions were counterproductive.

Continuous improvement loop: The recommended workflow is:

The evaluation framework is described as “fuzzy math” - not scientifically rigorous but providing directional guidance similar to fitness tracking. The value is in establishing baselines and tracking relative improvement.

Multi-Model and Multi-Environment Support

A key advantage emphasized is that skills work across multiple LLM providers and tools:

This portability is achieved through the simple markdown-based format with minimal provider-specific dependencies. Skills become organizational knowledge artifacts that transcend specific tooling choices.

Team Collaboration Challenges

The workshop surfaced significant real-world challenges in scaling skills across organizations:

Version control and governance: With 60 engineers potentially creating skills, questions arise about:

Context bloat: As skill libraries grow, loading all available skills becomes counterproductive. Questions include:

Maintenance over time: As models evolve, skills may need updating for:

WorkOS’s approach involves multiple skill repositories at different scopes (public, internal organization-wide, team-specific, individual), with plugin-based versioning similar to npm packages. However, this remains an evolving area without fully mature tooling.

Technical Implementation Details

Skill structure: A skill is actually a folder containing skill.md plus any referenced files, scripts, and resources. The folder can be versioned, distributed via git, or packaged as a .skill file.

Marketplace ecosystem: Three primary marketplaces were mentioned:

Script execution: When a skill contains script interpolation (!command“), the LLM executes the script and replaces the reference with output before continuing. This enables hybrid deterministic-generative workflows.

Skill calling skills: While technically possible, the general recommendation is to avoid this pattern as it can lead to unpredictable behavior and maintenance challenges.

Sub-agents and context management: For complex workflows, skills can spawn sub-agents with isolated context windows to perform specific tasks (like code review) without bloating the main conversation context. Results are then summarized back to the primary agent.

Lessons and Best Practices

Constraints over prescription: Effective skills provide boundaries and never-do-this rules rather than step-by-step instructions, allowing the LLM flexibility in execution while maintaining quality guardrails.

Ask the LLM: When uncertain about skill design, routing logic, or improvements, asking the LLM to analyze and suggest improvements is often the fastest path to better results. The skill-builder skill in Claude is specifically designed for this meta-task.

Minimal viable skills: Skills can be as simple as 30 lines of markdown and still provide significant value. Starting small and iterating based on usage is more effective than trying to create comprehensive skills upfront.

Progressive disclosure is key: Rather than front-loading all possible context, use conditional loading to provide detailed information only when the specific subtask requires it.

Context is gold: All conversation transcripts, including failures and frustrations, represent valuable data for skill refinement. The development workflow should preserve and analyze this data rather than discarding it.

Voice and workflow integration: While not strictly part of skills, the presenters demonstrated integration with tools like Whisper Flow for voice-based coding, showing how skills fit into broader agentic development workflows.

Organizational Impact

The skills framework represents a shift in how LLM knowledge is managed within organizations:

The workshop format itself (teaching skills creation through building a “repo roast” skill) demonstrates the pedagogical value of the framework for sharing LLMOps knowledge across teams.

Future Directions

Several emerging patterns and potential developments were discussed:

Agent teams vs. sub-agents: The ecosystem is developing multiple abstractions for multi-agent coordination, with skills potentially serving as the reusable building blocks

Memory and dreaming: Integration with long-term memory systems (like Claude’s autodream feature) could enable skills to evolve autonomously based on usage patterns

CLI as skill delivery: The pattern of wrapping skills in zero-friction CLI tools that handle authentication and model access could become a standard distribution mechanism

Biometric integration: Experimental work monitoring developer biometrics to provide health and wellness interventions suggests skills might expand beyond purely technical domains

Skill marketplaces maturation: As version control, scoping, and distribution tooling improves, skill marketplaces may become as central to LLMOps as package managers are to traditional development

This case study illustrates how WorkOS has moved beyond point-in-time prompt engineering to create a systematic framework for managing LLM capabilities in production, addressing key LLMOps challenges around consistency, team collaboration, evaluation, and the integration of deterministic and generative components.

More Like This

Building Custom Agents at Scale: Notion's Multi-Year Journey to Production-Ready Agentic Workflows

Notion 2026

Notion, a knowledge work platform serving enterprise customers, spent multiple years (2022-2026) iterating through four to five complete rebuilds of their agent infrastructure before shipping Custom Agents to production. The core problem was enabling users to automate complex workflows across their workspaces while maintaining enterprise-grade reliability, security, and cost efficiency. Their solution involved building a sophisticated agent harness with progressive tool disclosure, SQL-like database abstractions, markdown-based interfaces optimized for LLM consumption, and a comprehensive evaluation framework. The result was a production system handling over 100 tools, serving majority-agent traffic for search, and enabling workflows like automated bug triaging, email processing, and meeting notes capture that fundamentally changed how their company and customers operate.

chatbot question_answering summarization +52

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57