Building an Interpretability Playground with Activation Steering

Ramp 2026
View original source

Ramp built Steer, an interpretability playground that uses activation steering to modify LLM behavior at inference time without retraining. The system allows users to steer models toward specific concepts by adding steering vectors to activations at targeted layers during the forward pass. After initially struggling with Qwen 2.5 7B's instability and limited capacity, they migrated to Gemma 3 27B-IT and conducted extensive experiments across 1,280 configurations to determine optimal layer selection and multiplier calibration. The final implementation targets sparse global attention layers with carefully calibrated multipliers, deployed on Modal's serverless GPU infrastructure with snapshot capabilities to reduce cold starts from 60-120 seconds to 5-12 seconds.

Industry

Tech

Technologies

Overview

Ramp developed Steer, an interpretability playground that demonstrates production-ready activation steering for large language models. This case study illuminates a sophisticated approach to modifying LLM behavior at inference time without fine-tuning or retraining, offering insights into both the technical challenges of deploying such systems and the architectural differences between modern transformer models. The project serves as both an experimental platform for understanding model internals and a practical demonstration of cost-effective LLM serving for non-revenue-generating research projects.

The core innovation centers on activation steering, a technique that modifies a model’s internal representations during the forward pass by adding concept-specific steering vectors at targeted layers. Rather than altering model weights through fine-tuning, this approach enables lightweight, reversible interventions that reveal how models organize information internally. The system accepts arbitrary concepts as input—from “Existentialism” to “Elon Musk” to “Rick Sanchez”—generates corresponding steering vectors, and produces steered models that can be interacted with conversationally. When steered toward a concept like expense management, the model will connect any topic back to receipt reconciliation, demonstrating how deeply the steering affects the model’s reasoning patterns.

Initial Implementation Challenges with Qwen 2.5 7B

The project initially deployed Qwen 2.5 7B Instruct, a 7.61-billion-parameter model from Alibaba featuring 28 transformer layers and pretraining on up to 18 trillion tokens across 29+ languages. This choice exposed three critical production challenges that any team working with activation steering would need to address.

The first and most striking issue was the pretraining reversion problem. Qwen’s multilingual pretraining corpus appears to have substantial Chinese representation, with English fluency achieved through instruction tuning layered atop this foundation. When steering vectors pushed the model’s internal representations too aggressively, the instruction-tuned behavior destabilized, causing the model to revert to its pretraining distribution. In practice, this meant a model steered toward “pizza” would suddenly generate Mandarin mid-response, treating the language switch as a stress response analogous to a system reverting to default state. This behavior represents a significant operational risk in production systems where output reliability and language consistency are critical requirements.

The second challenge was the size problem. With only 7 billion parameters, Qwen had limited redundancy in its internal representations, constraining how much those representations could be perturbed before coherence collapsed. The model would become incoherent relatively quickly under moderate steering, particularly for abstract or complex concepts. This finding has important implications for LLMOps: smaller models, while more efficient to serve, offer less headroom for advanced inference-time interventions.

The third issue was the calibration problem. The team initially conducted a magnitude sweep on the “Ramp” concept, selected three thresholds (low, medium, strong) that produced good results, and attempted to reuse those exact thresholds across all concepts. This approach failed because different concepts occupy different regions of the model’s representation space and respond differently to identical magnitudes. A multiplier that gently nudges the model toward “expense management” might completely destroy coherence when applied to “17th-century Dutch painting.” This discovery highlights a broader challenge in LLMOps: hyperparameters that work for one use case often don’t generalize, requiring concept-specific or category-specific calibration.

Through experimentation with Qwen’s layer architecture, the team discovered that layer selection matters enormously. Steering early layers corrupted syntax and grammar, suggesting these layers handle low-level language processing that shouldn’t be interfered with. Steering late layers disrupted output fluency without meaningfully changing the model’s reasoning, indicating these layers translate internal representations into tokens and perturbations break the generation process. The effective zone was in the mid-layers where semantic concepts appeared to be encoded, but this usable window was narrow—a small band of effective layers combined with a tight multiplier range created a fragile operating envelope where pushing past boundaries would trigger the Mandarin reversion behavior.

Migration to Gemma 3 27B-IT

To address these fundamental limitations, Ramp migrated to Gemma 3 27B-IT, a 27-billion-parameter model from Google with 62 transformer layers and a distinctive alternating attention pattern: every 5 local sliding window attention layers (with a 1,024-token window) alternates with 1 global self-attention layer attending to full context. Despite being multilingual across 140+ languages, Gemma does not exhibit the language-switching behavior observed in Qwen under heavy steering.

The results were immediately positive. The bilingual instability vanished entirely—when Gemma degraded under excessive steering, it produced incoherent but English-language output that remained legible and on-topic rather than switching languages. The jump to 27 billion parameters provided substantially more capacity to absorb steering vector perturbations without losing coherence, and baseline response quality improved dramatically. However, Gemma introduced a new challenge: it proved significantly more sensitive to steering vector application than Qwen. The magnitudes and layer selections effective for Qwen were far too aggressive for Gemma, producing almost unusably over-steered models. This required completely re-solving both layer selection and multiplier calibration.

Comprehensive Layer Selection Experiments

With Gemma’s 62 layers (compared to Qwen’s 28), the team faced a more complex optimization space. Based on their accumulated understanding of layer function, they scoped experiments to layers 16-53, representing 38 layers spanning from early concept formation through late-stage reasoning. Layers below 16 appeared to handle syntax and embedding, while layers above 53 handled output formatting and token generation. Steering either end corrupted basic language abilities rather than reshaping thinking, so these were excluded from training consideration.

The team conducted a rigorous experimental sweep testing eight different layer configurations, five multiplier values (0.05 to 0.75), four distinct concepts, and eight diverse prompts, generating 1,280 total outputs. Each generation was evaluated by an LLM judge on coherence, keyword density, and a composite quality score. This represents a production-grade approach to hyperparameter optimization: systematic evaluation across the full parameter space rather than manual tuning or intuition-based selection.

The configurations ranged from sparse (5 evenly spaced global attention layers) to dense (contiguous blocks of 12-19 layers), targeting different network depths. The team also tested steering all 38 candidate layers simultaneously. Their hypothesis was that sparse global layers would provide maximal steering influence with minimal risk.

The experimental results validated this hypothesis. The sparse 5-layer global configuration maintained coherence across the entire multiplier range. At multiplier 0.75—the most aggressive value tested—it produced zero degenerate outputs (defined as responses so broken they’re unusable: repetition loops, gibberish, or complete topic failure) with a coherence score of 0.858. By contrast, the dense late 12-layer configuration at multiplier 0.55 achieved only 0.113 coherence with 83% degenerate outputs.

Late layers proved catastrophically sensitive. The dense 19-layer mid-to-late configuration at multiplier 0.55 resulted in 100% degenerate outputs. Steering all 38 candidate layers at multiplier 0.35 produced 73% degeneracy. The later in the network steering is applied, the faster output quality collapses. This aligns with architectural understanding: late layers are responsible for token generation, and perturbing them directly breaks the sampling process.

The degeneration cliff is steep and layer-dependent. The dense 12-layer mid-network configuration transitions from 0% degenerate at multiplier 0.15 to 13% at 0.75—a gradual slope that’s uncomfortable but usable. However, the dense 13-layer mid-to-late configuration jumps from 0% at 0.15 to 57% at 0.55, representing a massive cliff. The more layers being steered, especially late layers, the faster the margin for error disappears.

In these experiments, global attention layers outperformed nearby local attention layers. Steering mid-to-late global layers produces thematic reframing where the model genuinely weaves concepts into reasoning, rather than surface-level word substitution. Global attention at these depths provides a wider receptive field for concept integration, helping models absorb and express steered concepts without degenerating. Local attention layers at the same depth performed measurably worse because they only see a 1,024-token sliding window rather than full sequence context.

For the tested concepts and prompts, layer 41 emerged as the best single-layer target, positioned at approximately 66% network depth. It’s late enough to have abstract semantic information but early enough that perturbations don’t immediately break token generation. Layer 41 demonstrated the widest operating range before hitting degeneration.

Based on these findings, the team selected the sparse 5-layer global configuration targeting layers 23, 29, 35, 41, and 47. They appropriately acknowledge these results come from a relatively small sweep and represent a practical default rather than a universal law of steering behavior. This scientific humility is important in LLMOps contexts where generalization claims often outpace empirical validation.

Architectural Insights and Model Differences

The migration from Qwen to Gemma revealed fundamental differences in how models organize internal representations. Qwen’s 28 layers are uniform with standard self-attention throughout and semantic concepts concentrated in a narrow mid-layer band. Gemma’s 62 layers are structurally heterogeneous, alternating between local and global attention with concept encoding distributed across a much wider range. Despite tackling the same task with the same steering approach, the models require completely different targeting strategies. This finding has significant implications for LLMOps: architecture strongly affects where and how inference-time interventions are effective, and strategies optimized for one model architecture may fail completely on another.

The team also addressed the effective strength problem. Different concepts have different natural magnitudes in representation space, meaning a “mild” steering strength for one concept might be catastrophic for another. The steering vector’s magnitude depends on how concepts are distributed across the model’s activations. In their experience, abstract concepts like “absurdism” tend to produce smaller vectors than concrete ones like “marine biology,” so the same multiplier produces different effective strengths. To handle this in production, they expose three presets (low, medium, strong) with multipliers calibrated through testing across a range of concept types. The default is tuned so concepts are present and clear without degrading reasoning. Lower presets keep concepts subtle for light thematic influence, while the strong preset pushes closer to the coherence boundary where the model’s concept obsession becomes aggressive but output quality begins to trade off.

Production Deployment and Cost Optimization

Deploying Gemma 27B in production presented significant infrastructure challenges. As an experimental project rather than revenue-generating product, Ramp couldn’t justify keeping GPU fleets warm and idle around the clock. The primary challenge was cold starts: loading Gemma’s weights into GPU memory from scratch takes 60-120 seconds before the first token, creating an unacceptable user experience.

The team’s solution leverages Modal’s snapshot capability, snapshotting both container memory and GPU memory with Gemma already loaded. When a cold function spins up, it restores from this snapshot, with the only remaining work being loading the user’s specific steering vector. This optimization reduces cold starts to 5-12 seconds, a 10-20x improvement that makes the interactive experience viable. This represents sophisticated LLMOps engineering: understanding the deployment platform’s capabilities deeply enough to design around fundamental constraints without requiring always-on infrastructure.

The training pipeline for generating steering vectors runs on Modal’s serverless GPU infrastructure. Users submit concepts, the system extracts steering vectors, and steered models become available for conversational interaction in short order. This architecture separates the compute-intensive vector generation from the inference serving, allowing efficient resource utilization.

Interpretability and Future Implications

The broader significance of this work lies in its interpretability implications for production LLM systems. As more organizations fine-tune models for domain-specific use cases, the standard approach to failures—adding training data or adjusting hyperparameters—remains essentially blind. Interpretability techniques like activation steering let teams peer inside models to understand which representations do useful work and where leverage points exist.

The team positions this work within the broader context of advancing interpretability research, referencing Anthropic’s work on activation oracles and applied interpretability from startups like Goodfire AI. They argue that organizations building interpretability muscle now will be better positioned as LLMs shift from black boxes to systems we understand and engineer. Steer serves as Ramp’s platform for building that muscle.

From a critical LLMOps perspective, this case study demonstrates both the promise and challenges of advanced inference-time interventions. The technique works and produces interesting results, but requires extensive experimentation to calibrate properly, behaves differently across model architectures, and demands concept-specific tuning. The engineering work to make it production-viable—cold start optimization, serverless deployment, systematic evaluation—is substantial. Organizations considering similar approaches should expect significant upfront investment in experimentation and infrastructure optimization, with the understanding that insights may not generalize across model families or even model sizes within the same family.

The systematic evaluation approach using LLM judges across 1,280 configurations represents mature LLMOps practice: treating model behavior as something to be measured quantitatively rather than assessed subjectively. However, the reliance on LLM judges rather than human evaluation introduces its own questions about evaluation validity that the case study doesn’t address. The practical realities of cost-constrained deployment—snapshot-based cold start optimization, serverless infrastructure—demonstrate creative engineering within constraints rather than unlimited-budget solutions, making the lessons more transferable to other organizations with similar constraints.

More Like This

Post-Training and Production LLM Systems at Scale

OpenAI 2025

This case study explores OpenAI's approach to post-training and deploying large language models in production environments, featuring insights from a post-training researcher working on reasoning models. The discussion covers the operational complexities of reinforcement learning from human feedback at scale, the evolution from non-thinking to thinking models, and production challenges including model routing, context window optimization, token efficiency improvements, and interruptability features. Key developments include the shopping model release, improvements from GPT-4.1 to GPT-5.1, and the operational realities of managing complex RL training runs with multiple grading setups and infrastructure components that require constant monitoring and debugging.

code_generation question_answering chatbot +34

Multi-Company Panel on Building Production-Grade AI Agent Systems

Abridge / Replit / Hebbia 2026

This panel discussion features engineering leaders from Abridge, Replit, and Hebbia discussing their experiences building sophisticated AI agent systems at production scale. Abridge tackles clinical documentation by recording and summarizing doctor-patient conversations for over 250 healthcare systems, addressing challenges around clinical compliance and trust. Replit builds autonomous coding agents that can plan, design, write, test, and debug software with increasingly long-running capabilities. Hebbia creates AI tooling for major financial institutions like KKR and Morgan Stanley, managing extremely spiky workloads with hundreds of thousands of agents processing high-value questions worth hundreds of millions of dollars. All three companies leverage Temporal for durable execution, have moved beyond proof-of-concept to production systems with high stakes, and share common challenges around reliability, cost optimization, model selection, and the evolving balance between agent autonomy and human control.

healthcare code_generation data_analysis +44

Building Production AI at Scale with Internal Tooling and Agent-Based Systems

Shopify 2026

Shopify's CTO discusses how the company has achieved near-universal AI adoption internally, with nearly 100% of employees using AI tools daily as of December 2025. The company has developed sophisticated internal platforms including Tangle (an ML experimentation framework), Tangent (an auto-research loop for automatic optimization), and SimGym (a customer simulation platform using historical data). These systems have enabled dramatic productivity improvements including 30% month-over-month PR merge growth, significant code quality improvements through critique loops, and the ability to run hundreds of automated experiments. The company provides unlimited token budgets to employees and emphasizes quality token usage over quantity, focusing on efficient agent architectures with critique loops rather than many parallel agents. They've also implemented Liquid AI models for low-latency applications, achieving 30-millisecond response times for search queries.

code_generation customer_support chatbot +48