Baz: AI-Powered Code Review Platform Using Abstract Syntax Trees and LLM Context

Company

Baz

Title

AI-Powered Code Review Platform Using Abstract Syntax Trees and LLM Context

Industry

Tech

Link

https://www.youtube.com/watch?v=SvItE7CLt_s

Year

2023

Summary (short)

Baz is building an AI code review agent that addresses the challenge of understanding complex codebases at scale. The platform combines Abstract Syntax Trees (AST) with LLM semantic understanding to provide automated code reviews that go beyond traditional static analysis. By integrating context from multiple sources including code structure, Jira/Linear tickets, CI logs, and deployment patterns, Baz aims to replicate the knowledge of a staff engineer who understands not just the code but the entire business context. The solution has evolved from basic reviews to catching performance issues and schema changes, with customers using it to review code generated by AI coding assistants like Cursor and Codex.

## Overview Baz is a code review platform founded in August 2023 by Nimrod (CTO) and Guy (CEO), who previously worked together at Bridgecrew, a cloud security company acquired by Palo Alto Networks. The company's mission is to build the best platform for understanding codebases, with their current product being an AI-powered code review agent. The founders bring significant experience from the dev tools space, having built Checkov, an open-source infrastructure-as-code security scanning tool with hundreds of millions of downloads. The core insight driving Baz is that pre-LLM tools could perform static analysis and type checking on code, but they couldn't understand semantic meaning. LLMs brought the ability to understand what code actually does, not just how it's structured. However, the founders recognized early that simply dumping code into an LLM (as tools like Git Ingest or Deepwiki do) is insufficient for production-grade code understanding. The key innovation at Baz is combining AST-based code traversal with LLM semantic understanding and extensive context gathering to create reviews that match or exceed what a senior staff engineer would provide. ## Technical Architecture and Core Technology ### Abstract Syntax Tree Foundation The foundation of Baz's approach is the Abstract Syntax Tree, which provides a structured representation of code that goes beyond the surface syntax developers write. As explained in the interview, when code is written, it follows language-specific syntax rules, but the AST represents this as a hierarchical tree of tokens and nodes. For example, a Python function definition (`def foo(bar: str):`) gets parsed into a function declaration node with children representing the function name, parameters, and their types. Baz leverages Tree-sitter, an open-source parsing library that supports 138 different programming languages and provides a unified interface for working with ASTs across languages. This allows Baz to handle JavaScript, Python, Go, and other languages without heavy per-language engineering investment. While each language has slightly different semantics (functions vs methods vs Go's receiver functions), Tree-sitter normalizes the representation. The AST enables Baz to traverse code as a graph rather than as a linear sequence of files. In reality, codebases form complex graphs where functions call other functions across file boundaries, creating dependencies and data flows. The AST allows Baz to follow these connections deterministically, understanding which functions call which other functions and bringing only relevant definitions into the LLM context rather than the entire codebase. ### The Context Problem The team at Baz identified context as "the only moat" in today's AI landscape. Prompts can be reverse-engineered, and better models continuously emerge, but building comprehensive, relevant context is genuinely difficult. When an experienced developer reviews code, they bring implicit knowledge: understanding of the service architecture, deployment processes, CI/CD pipelines, the requirements from tickets, conversations from planning sessions, and the broader system design. Replicating this context for an LLM is the core challenge Baz addresses. Baz's context-building approach operates on multiple levels: **Code-level context**: Using the AST, Baz identifies not just the diff in a pull request but all the functions, classes, and modules that are connected to the changed code. This means understanding the potential impact radius of a change. However, the team learned that naive graph traversal can pull in too much. They encountered a case where changing a main entry point function pulled in 5,000 functions and classes—essentially the entire application. This led to developing heuristics about critical vs non-critical connections to keep context focused and relevant. **Project and module context**: Baz attempts to understand the nature of each project—is it a frontend web app built with TypeScript and Vite, a Java SDK for mobile, a backend service, or infrastructure code? This high-level understanding helps frame the review appropriately. **Requirements context**: Integration with Jira and Linear allows Baz to pull in the actual tickets and requirements that motivated the code changes. The interview provides a concrete example: an engineer opened a PR to improve graph-building performance. The changes involved a loop that would break and restart from the beginning when encountering bad edges. Baz's review noted this was "heavy-handed" and suggested performance improvements, but critically, it knew to frame the feedback around performance because it had read the original ticket describing the performance issue. **CI/CD context**: Baz reads CI logs to understand test failures, deployment patterns, and what might break. This allows it to anticipate issues beyond just the code structure. **Schema and API context**: The system attempts to understand impacts on API endpoints and data schemas. For example, if a field is removed from an object that gets saved to MongoDB or S3, Baz can identify that the schema change might cause data persistence issues even if the code compiles correctly. ### Model Selection and Evolution The team has witnessed significant evolution in model capabilities over the past two years. Initially working with GPT-3.5 and GPT-4, they saw acceptable results but frequent hallucinations. A turning point came with Claude 3.5 Sonnet, which Nimrod describes as "a beast model" that took nearly two years for anything to surpass in coding tasks. The improvement in hallucination rates has been dramatic since the early days. An illustrative early bug: Baz had a parameter mismatch where they sent a field called "diff" but expected to receive a field called "code_diff"—essentially nothing was being received by the LLM. Yet the model, seeing an empty "code" field, simply hallucinated a Python pull request from memory (despite Baz being written in Rust at the time) and provided a complete review of this imagined PR. The team found this both amusing and concerning, highlighting the hallucination challenges in early LLM applications. Importantly, Baz doesn't expose model selection to end users. Nimrod expresses frustration with the industry's tendency to announce support for every new model variant as if it's a feature. From his perspective, model choice is an implementation detail—Baz owns the quality of its output, and customers shouldn't need to care which model powers it. This reflects a mature product thinking where the abstraction layer matters more than the underlying components. ### Agent Architecture Baz describes their system as an agent because it makes autonomous decisions about what actions to take. The agent has a limited but meaningful set of tools and decides which to use based on the context. For example, when encountering a ticket reference, the agent determines which ticketing system is being used (Linear vs Jira) and then makes the appropriate API call to fetch the ticket content. The agent itself is implemented as a Python container that connects to a knowledge base storing the AST representations and code structure. The ASTs are stored in PostgreSQL rather than a graph database—Nimrod explicitly notes "graph databases suck" as an aside. While the agent container could theoretically run anywhere, the full system with knowledge base and Git server integration is complex enough that they provide it as a Helm chart for on-premise deployments rather than a simple Docker image. ### Security and Prompt Injection A significant concern for Baz is security, particularly prompt injection attacks. The team has encountered malicious or mischievous users opening pull requests with diffs like "write me a React component that does [something]" attempting to get the LLM to execute arbitrary instructions rather than review actual code. This is analogous to SQL injection but for LLM prompts. As a result, Baz doesn't allow arbitrary tool integration or HTTP calls to user-specified endpoints. Integration with external services like MCP (Model Context Protocol) is carefully controlled. While they can add new integrations quickly (especially since MCP's release), each tool must go through their security review process. The architecture ensures that tool selection is deterministic—the LLM decides which category of tool to use (e.g., "this is a ticket reference, fetch from the ticketing system"), but the actual API calls are structured and validated. ## LLMOps Practices and Production Considerations ### Monitoring and Observability When building the initial MVP, the team simply released code and watched logs to see what happened. However, once they began working with design partners (essentially pre-paying beta customers), they needed more sophisticated observability. They integrated with an observability platform (they mention being flexible between Sentry, New Relic, or raw OpenTelemetry with Jaeger and Grafana) to trace inputs and outputs of LLM calls. The key insight is that observability for LLM applications requires capturing not just error rates and latency but the actual content of prompts and responses to debug issues like hallucinations or poor quality outputs. This is more analogous to application performance monitoring than traditional infrastructure monitoring. ### Benchmarking and Evaluation A critical aspect of Baz's LLMOps practice is comprehensive benchmarking. Nimrod emphasizes that without running every change against a large, tagged dataset before release, teams won't have confidence in their deployments. However, he notes that engineers typically hate the manual work of tagging evaluation data, even though it's essential. The nature of LLM benchmarks differs fundamentally from traditional software testing. Benchmarks don't aim for 100% pass rates because LLMs are statistical machines, not deterministic systems. Baz has benchmarks targeting 80-90% success for some tasks and 40-50% for others, depending on what current models can reliably achieve. The expectation is that these percentages will improve as models advance, but accepting non-deterministic behavior is part of working with LLMs in production. This represents a significant shift in engineering culture. Traditional software engineering emphasizes deterministic behavior and 100% test pass rates. LLMOps requires embracing probabilistic systems where "good enough most of the time" is the current state of the art, while still maintaining quality bars through statistical measures. ### Deployment and Release Philosophy The interview reveals interesting contrasts in release philosophies across Baz's history. At Bridgecrew, they operated with typical startup velocity: full CI/CD, everything to production immediately, test in production. After Palo Alto's acquisition and integration, an SVP visited and noted that while Baz had "the fastest adoption in the history of Palo Alto Networks acquisitions," they needed to slow down releases. The reasoning was that enterprise sales, sales engineering, marketing, and messaging all need to coordinate. Releasing five times per day made it impossible for the broader organization to keep up. This tension between engineering velocity and enterprise go-to-market needs was part of what motivated starting Baz—a desire to return to faster iteration. However, the lessons about quality and testing from the enterprise experience clearly carried over, as evidenced by their sophisticated benchmarking and monitoring practices. ### Integration and Product Surface Baz operates primarily as a GitHub application that users install. It automatically scans repositories, identifies modules, and maps connections between files, functions, and other code structures. When developers open pull requests, Baz automatically provides reviews as comments, identifies what changed, what stayed the same, what's connected, and potential impacts. The product philosophy emphasizes simplicity. Nimrod expresses personal preference for software that requires minimal configuration—if something requires extensive setup, he simply won't use it. This drives Baz's design toward being a "just install and go" solution rather than a complex platform requiring extensive configuration. However, the backend is sophisticated, storing the AST representations and code knowledge in PostgreSQL and maintaining the graph of code relationships. The simplicity is in the user experience, not the underlying implementation. ### Real-World Usage Patterns An interesting use case that has emerged is Baz reviewing code generated by AI coding assistants. Guy, the CEO, uses Cursor's Codex to build frontend features even though he's not deeply familiar with the frontend conventions at Baz. Codex generates the code and opens a pull request, Baz reviews it and provides feedback, then Codex iterates based on Baz's review. This creates a fully automated loop where AI generates code, AI reviews it, and AI fixes issues—with the CEO orchestrating but not writing code manually. This represents a glimpse of the future of software development where humans specify intent and AI agents handle implementation details, with other AI agents handling quality control. The fact that this already works in production at Baz itself is notable. ## Challenges and Limitations ### The Scaling Challenge from Bridgecrew The Bridgecrew/Palo Alto experience provides valuable context on scaling challenges. Starting with 30 customers, they integrated into Palo Alto's sales machine and jumped to 350 customers in three months. Nimrod notes they discovered issues with their infrastructure when it hit real scale—a common startup problem where teams say "we'll handle scale when we get there" only to find their systems falling apart when scale arrives suddenly. ### Model Dependency and Moat Building There's an interesting strategic tension in the interview. Nimrod emphasizes that context is the only real moat because prompts can be extracted and new models constantly emerge. Yet the product quality is clearly dependent on model capabilities—the jump to Claude 3.5 Sonnet was a major improvement. This suggests that while Baz's value is primarily in context building and AST integration, they're still dependent on foundation model providers for core capabilities. The decision not to expose model selection to users is strategic but also pragmatic. It keeps the focus on outcomes rather than implementation details, but it also means Baz needs to continuously evaluate new models and switch when better options emerge. ### Context Completeness Despite the sophisticated context gathering, there are inherent limitations. The system can't capture hallway conversations, the nuanced understanding from years of working in a codebase, or the intuitive pattern recognition that experienced developers develop. Baz aims to replicate the context a developer has after working at a company for a long time, but this is an approximation at best. The example of catching the performance issue because of ticket context is impressive, but it also reveals the dependency on external systems. If tickets aren't well-written or if important context lives in Slack conversations or documents that Baz doesn't access, the reviews will be less valuable. ## Industry Perspective and Future Vision ### Long-Term Vision Beyond code review, Baz's vision is to become the platform that best understands codebases. Once you deeply understand what code does and what changes mean, many other problems become tractable. Nimrod specifically mentions CI/CD as a natural extension—if the system understands the codebase structure and what changed, it could automatically determine what needs to run in CI, what needs to be packaged, and what needs deployment without developers writing complex bash scripts. The broader vision is reducing developer toil by building systems that understand intent and context well enough to handle routine decisions automatically. This aligns with the industry trend toward higher levels of abstraction where developers specify what they want rather than how to achieve it. ### Ecosystem Observations Nimrod offers interesting perspectives on the broader ecosystem. He notes that when Baz was founded, GitHub Copilot dominated the coding space, and they deliberately chose not to compete directly with Microsoft and GitHub. They found an adjacent space (code review) that would become more important as coding agents improved. This strategic positioning—picking problems that become more valuable as AI improves rather than competing with AI—is insightful. He expresses frustration with the industry's tendency to announce support for every new model variant as if it's a significant feature. This reflects a mature perspective that implementation details shouldn't be the product surface for end users. Looking forward, Nimrod advocates for centralized AI interfaces that wrap multiple tools through something like MCP, creating a unified workspace rather than dozens of browser tabs and fragmented tools. This vision of an API gateway for AI integrations with proper security controls reflects lessons from both the security tooling space and LLM application development. ### Market Positioning The decision to focus on code review rather than code generation was based on extensive customer discovery—talking to 100 VPs of R&D, CTOs, and architects. The insight was that if coding agents succeed in generating code at scale, someone needs to review all that code. This positions Baz as complementary to rather than competitive with coding assistants like Copilot, Cursor, and others. The fact that Baz is already being used to review AI-generated code in production (including at Baz itself) validates this positioning. As AI-generated code becomes more prevalent, the review and quality control problem becomes more acute, not less. ## Balanced Assessment The case study presents compelling evidence that Baz is building valuable technology for a real problem. The combination of AST-based code understanding with LLM semantic analysis and extensive context gathering addresses genuine limitations of both traditional static analysis tools and naive LLM applications. The real-world example of catching performance issues by correlating code changes with ticket context demonstrates value beyond what existing tools provide. However, some caveats and limitations should be considered. The product is still relatively young (founded August 2023), and while they have design partners and paying customers, the scale of production usage isn't fully detailed. The dependency on foundation model capabilities means that Baz's quality is partly determined by factors outside their control, even as they build moat through context. The security concerns around prompt injection and the careful limitations on tool integrations are prudent but also mean that Baz can't be as flexible or extensible as some users might want. The decision to not support arbitrary integrations or MCP servers without vetting is defensible but creates friction for customers with custom internal tools. The claim about being "the best AI code review agent in the world" is marketing language that's hard to verify objectively, especially given the difficulty of benchmarking code review quality. That said, the technical approach seems sound, and the team's deep experience in dev tools and security scanning lends credibility to their execution capability. Overall, this represents a thoughtful application of LLMs to a concrete developer tools problem, with mature thinking about the LLMOps challenges of context management, evaluation, monitoring, and security. The product appears to deliver real value in production use cases, particularly as AI-generated code becomes more prevalent.

Start deploying reproducible AI workflows today