Formal Verification and Verified AI for Mathematical Reasoning at Scale

Axiom Math 2025
View original source

Axiom Math is building AI systems for superhuman mathematical reasoning by combining formal verification with large language models. Their approach uses Lean, a formal proof verification language, to ground AI-generated mathematical proofs and code, achieving verified generation that offers better sample efficiency than informal approaches. The company achieved a perfect score on the Putnam exam in December 2025, scoring 120/120 points compared to the best human's 110 and the best informal LLM's 103. Their system, Axiom Prover, uses post-trained foundation models with reinforcement learning on Lean data, enabling recursive decomposition of proof goals and learning to backtrack. Beyond mathematics, they view formal verification as foundational infrastructure for verified reasoning across software and hardware domains, positioning it as critical for AI collaboration and super intelligence rather than merely a compliance mechanism.

Industry

Research & Academia

Technologies

Overview

Axiom Math represents a distinctive approach to productionizing LLMs for mathematical reasoning and formal verification. Founded in early 2025, the seven-to-eight-month-old company raised a $200 million Series A at a $1.6 billion valuation, reflecting strong investor conviction in formal verification as critical infrastructure for AI systems. The company’s core thesis is that verified AI generation—combining informal reasoning with formal proof verification—provides superior performance characteristics and sample efficiency compared to purely informal approaches, making it feasible for a startup with limited compute resources to match or exceed frontier lab capabilities on superhuman tasks.

The company views verification not as a tax or compliance burden for closed industries, but rather as a mechanism for scaling and compounding intelligence. This perspective draws on historical mathematical practice where formalization helped brilliant mathematicians like Ramanujan extend their capabilities. Axiom positions formal verification as enablement for human-AI and future AI-AI collaboration, moving from a defensive posture about hallucinations to an offensive strategy about performance gains.

Production System Architecture

Axiom Prover is an ensemble system composed of multiple models that undergo post-training. The architecture starts with off-the-shelf foundation models, with a preference for open-source base models that have coding and natural language capabilities. These base models are then subjected to continued pre-training or fine-tuning, followed by reinforcement learning specifically for formal mathematics.

The system heavily relies on Lean data—formal mathematical proofs written in the Lean programming language. Lean serves as both a functional programming language and a formal verification system based on the Curry-Howard correspondence, which establishes an equivalence between proofs and programs. This dual nature is critical: mathematicians can use Lean purely for coding (one candidate reportedly implemented autograd in Lean during an interview), purely for formal mathematics, or for the integrated approach Axiom employs.

The production pipeline involves standard techniques from the RL for formal math community, but with significant innovations. The team found that scaling inference has almost no wall—they can recursively decompose proof goals into many sub-goals and learn to backtrack effectively. This recursive decomposition is central to handling complex mathematical problems at scale.

Verification Infrastructure and Tooling

Axiom recently released Axiom Lean Engine (AXL), a suite of 14 proof validation and manipulation tools built in Lean’s meta-programming language. These tools are provided free to the community and have been rapidly adopted. Key capabilities include:

The infrastructure handles massive proof trees, with Axiom Prover demonstrated to scale from 40 nodes to 4,000 nodes in tree complexity. On the Code Murina benchmark—designed to evaluate code generation with formal proof—Axiom achieved 99% accuracy (187 of 189 problems solved with both code and proof), compared to their competitor’s 96% on proof-only tasks and 11-12% for other formal math systems on the combined task.

Performance Characteristics and Results

The production system achieved several notable benchmarks demonstrating superhuman performance:

These results validate the core LLMOps hypothesis that verified generation offers performance gains through better sample efficiency. With orders of magnitude less data than informal systems, the formal approach achieved superior results on superhuman tasks.

Data Strategy and Synthetic Data Generation

Axiom maintains what they describe as a “really massive database” of Lean proofs, much of it synthetically generated. This represents a time-based competitive moat rather than a permanent advantage—the company executes rapidly to maintain a buffer, but acknowledges that proprietary data accumulation is fundamentally about execution speed rather than exclusive access.

The data strategy includes:

The interplay between informal and formal approaches is evident in the data generation strategy. While some advocate for pure self-play approaches (like AlphaZero for math), Axiom pursues a hybrid path where informal reasoning and formal verification work together, with ongoing work on recursive self-improvement where Axiom Prover’s daily mathematical work feeds back into model improvement.

Deployment Model and API Strategy

Axiom released a verification API that developers can integrate into their workflows. Early adopters report using Claude combined with Axiom’s tools as their standard setup for Lean development. This represents a strategic positioning: rather than compete directly with frontier labs on general coding capabilities, Axiom offers specialized verification as an API service that complements coding-focused LLMs.

The deployment philosophy treats verification capability as infrastructure that other AI systems can call, similar to how frontier labs partner with search startups like Exa and Perplexity for specialized search capabilities. The value proposition for frontier labs is to “call Axiom API for verification” rather than spinning up dedicated formal verification teams—especially given the specialized talent requirements (meta-programming expertise in Lean is particularly rare) and strategic focus needed.

For enterprise applications, the vision extends to verified code generation where AI generates both programs and formal proofs that the programs satisfy specifications. This decomposition approach envisions:

Challenges and Limitations in Production

Several significant challenges emerged from deploying formal verification systems:

Specification Problem: Even with perfect verification capability, specifying what needs to be proven remains fundamentally difficult. Humans struggle to fully specify requirements, and anything unspecified cannot be proven. The company views this as an interactive process where informal reasoning helps generate specification proposals, with AI suggesting test cases and edge cases (“have you thought about this case?”) to refine specifications.

Auto-formalization Grounding: Converting informal problem statements (like competition problems written in English) to formal Lean specifications lacks grounding signals when the problem hasn’t been solved yet. Test cases and input-output pairs provide some grounding, but formalizing statements remains harder than formalizing proofs.

Proof Size Overhead: Current systems require approximately 20 lines of Lean proof code for each line of program code. The scaling relationship between program complexity and proof complexity remains unclear—it’s an open question whether proof length grows linearly, super-linearly, or hits fundamental bounds based on base model capabilities.

Computational Complexity Concerns: Rice’s theorem establishes that non-trivial properties of programs cannot be verified for all programs. While theoretical limits exist, the practical stance is that formally verifying the majority of useful programs provides substantial value even if universal verification is impossible.

Context Window and Summarization: For very large systems with massive Lean code bases, fitting everything into context windows becomes challenging. The company addresses this through auto-informalization—converting Lean proofs back to informal natural language summaries, which can be verified for correctness through round-trip formalization and equivalence checking.

Distribution Shift: Systems trained on number theory may not perform well on topology if the domain lacks definitional infrastructure. However, across domains with adequate MathLib coverage (algebra, number theory, commutative algebra, algebraic geometry, discrete math, probability), the system shows reasonable generalization.

Model Training and Post-Training Approach

The post-training pipeline reflects standard practices with domain-specific innovations:

The system learns to navigate between high-level intuition spaces and low-level rigorous deduction. Lean’s tactics (like the grind tactic for low-level proofs) handle mechanical aspects, freeing the AI to operate at higher abstraction levels. This mirrors how human mathematicians use proof assistants to manage low-level details while maintaining focus on high-level insights.

Knowledge Management and Retrieval

Knowledge graphs and knowledge bases represent critical but under-discussed components. Axiom faced challenges with literature search and problem provenance:

The knowledge infrastructure challenge extends beyond simple retrieval to understanding equivalences, extensions, and indirect solutions—problems that affect both human mathematicians and AI systems.

Business Model and Market Positioning

Axiom’s market thesis challenges conventional thinking about formal verification:

The $200 million raise (approaching the annual US math research budget of $250 million) reflects investor belief that formal verification becomes critical infrastructure as AI systems become more capable and autonomous. The vision includes scenarios where autonomous AI agents handling regulated operations require verifiable behavior—not as mandatory compliance, but as a choice enabled by sufficiently good verification performance.

Future Directions and Technical Roadmap

Several technical frontiers emerged from the discussion:

Elegance and Taste as Alignment: Training models to generate elegant proofs—conceptually orthogonal approaches to the same theorem—represents an alignment problem. Human mathematical taste guides which proofs deserve attention and compute resources, suggesting a role for human judgment even in automated systems.

Blueprint Autogeneration: Large formalization projects currently rely on human-written blueprints that decompose problems and assign subtasks. Automating blueprint generation for complex theorems represents a significant technical bottleneck that multiple teams are pursuing.

Self-Improvement Design: The recursive self-improvement vision suggests that Axiom Prover’s daily mathematical work should feed back into model improvement, creating a virtuous cycle. The company is exploring how proof systems can learn from their own verified outputs.

Hardware Verification: Extending to GPU and chip verification represents a domain where partial credit doesn’t exist—verification must be perfect. This “hardcore verification” market has hundreds of humans and thousands of software licenses dedicated to single verification problems, with design-to-verification ratios of 1:3 or 1:4 in team size and duration.

Multi-domain Expansion: Beyond mathematics, the roadmap includes software verification, hardware verification, and potentially AI for science applications. The key question is whether to pursue breadth (transfer learning across domains) versus depth (recursive self-improvement toward AGI).

Organizational and Execution Philosophy

At approximately 30 people, Axiom emphasizes execution speed and singular focus as competitive advantages against frontier labs where direction changes frequently due to organizational dynamics. The observation that formal math teams at Google DeepMind (AlphaProof) and other frontier labs dissolved or shifted focus due to non-technical reasons suggests that startup structures enable sustained focus on long-horizon problems.

The team composition is deliberately interdisciplinary: expert mathematicians who are users of the systems they build, MathLib contributors and Lean language experts, applied ML practitioners from organizations like Meta and specific compiler and code generation expertise. This mix enables rapid iteration between mathematical insights and engineering implementation.

The company views fragmentation as a major bottleneck for the broader AI landscape, though AI for math specifically has avoided this through team consolidation. The ability to attract talents like contributors to Frontier Math benchmarks and AI for mathematical discovery pioneers onto a single team creates combinatorial advantages—proving and construction capabilities working together synergistically.

Lessons for LLMOps Practitioners

Several principles emerge from Axiom’s production deployment:

Verification as Performance Tool: Rather than viewing formal methods as overhead, the system demonstrates that constraints can improve generation through better reward signals and sample efficiency.

Hybrid Formal-Informal Approaches: Pure formal or pure informal approaches have limitations; the production system bridges between high-level informal reasoning and low-level formal verification.

Infrastructure as Competitive Moat: Building robust verification infrastructure (like AXL tools) that handles edge cases, achieves high performance, and integrates smoothly creates value beyond model weights alone.

Domain Knowledge Integration: Having domain experts (mathematicians) as builders and users accelerates iteration and ensures the system solves real problems rather than proxy benchmarks.

API-First Deployment: Rather than building monolithic systems, offering verification as a composable service enables integration with existing LLM workflows and creates partnership opportunities with frontier labs.

The case study illustrates how specialized AI capabilities can achieve superhuman performance through domain-specific architectures and training approaches, offering an alternative to scaling general-purpose models. The verified generation paradigm may represent a blueprint for other domains where correctness and reliability are paramount, from legal reasoning to scientific computation to autonomous systems.

More Like This

Agentic AI Copilot for Insurance Underwriting with Multi-Tool Integration

Snorkel 2025

Snorkel developed a specialized benchmark dataset for evaluating AI agents in insurance underwriting, leveraging their expert network of Chartered Property and Casualty Underwriters (CPCUs). The benchmark simulates an AI copilot that assists junior underwriters by reasoning over proprietary knowledge, using multiple tools including databases and underwriting guidelines, and engaging in multi-turn conversations. The evaluation revealed significant performance variations across frontier models (single digits to ~80% accuracy), with notable error modes including tool use failures (36% of conversations) and hallucinations from pretrained domain knowledge, particularly from OpenAI models which hallucinated non-existent insurance products 15-45% of the time.

healthcare fraud_detection customer_support +90

Building Economic Infrastructure for AI with Foundation Models and Agentic Commerce

Stripe 2025

Stripe, processing approximately 1.3% of global GDP, has evolved from traditional ML-based fraud detection to deploying transformer-based foundation models for payments that process every transaction in under 100ms. The company built a domain-specific foundation model treating charges as tokens and behavior sequences as context windows, ingesting tens of billions of transactions to power fraud detection, improving card-testing detection from 59% to 97% accuracy for large merchants. Stripe also launched the Agentic Commerce Protocol (ACP) jointly with OpenAI to standardize how agents discover and purchase from merchant catalogs, complemented by internal AI adoption reaching 8,500 employees daily using LLM tools, with 65-70% of engineers using AI coding assistants and achieving significant productivity gains like reducing payment method integrations from 2 months to 2 weeks.

fraud_detection chatbot code_generation +57

Fine-Tuning LLMs for Multi-Agent Orchestration in Code Generation

Cosine 2025

Cosine, a company building enterprise coding agents, faced the challenge of deploying high-performance AI systems in highly constrained environments including on-premise and air-gapped deployments where large frontier models were not viable. They developed a multi-agent architecture using specialized orchestrator and worker models, leveraging model distillation, supervised fine-tuning, preference optimization, and reinforcement fine-tuning to create smaller models that could match or exceed the performance of much larger models. The result was a 31% performance increase on the SWE-bench Freelancer benchmark, 3X latency improvement, 60% reduction in GPU footprint, and 20% fewer errors in generated code, all while operating on as few as 4 H100 GPUs and maintaining full deployment flexibility across cloud, VPC, and on-premise environments.

code_generation high_stakes_application regulatory_compliance +33