← All posts

Multi-Agent AI Systems: Moving From Demos to Production

Multi-Agent AI Systems: Moving From Demos to Production

2026 is the year multi-agent AI systems finally move from impressive demos to real production workloads. Agent control planes, orchestration dashboards, and observability tooling are becoming first-class infrastructure. Gartner predicts that by 2028, 80% of enterprise workplace applications will embed AI agents — and the foundation for that future is being laid right now.

The single-agent paradigm — one LLM doing everything — is hitting its limits. Complex tasks demand specialization, coordination, and resilience. That is exactly what multi-agent systems deliver. But building them for production is a fundamentally different challenge than building a demo. This article covers the architecture, implementation, and hard-won lessons of shipping multi-agent systems at scale.

What Is a Multi-Agent System?

A multi-agent system (MAS) is an architecture where multiple AI agents — each with specialized capabilities — work together to accomplish tasks that no single agent could handle alone. Each agent has a defined role, a focused set of tools, and a specific area of expertise.

Think of the microservices analogy. Just as monolithic applications were broken into focused services communicating over APIs, monolithic agents are being decomposed into specialized agents communicating over protocols. A single "do everything" agent is like a monolithic application: it works for simple cases but becomes brittle, hard to debug, and impossible to scale.

A multi-agent system typically consists of:

  • Specialized agents — each focused on a narrow domain (planning, coding, reviewing, testing)
  • A communication protocol — how agents share information and pass work
  • An orchestration layer — something that decides which agent runs when
  • Shared state — a common memory or context that agents can read from and write to
  • Error handling — strategies for when individual agents fail

The key insight is that smaller, focused agents outperform large, general-purpose ones. A code review agent with a carefully tuned system prompt and a narrow set of tools will catch more bugs than a general agent asked to "review this code." Specialization improves quality, reduces hallucination, and makes debugging tractable.

Architectural Patterns

There are four primary architectural patterns for multi-agent systems, each suited to different problem types. Understanding these patterns is essential for choosing the right approach for your use case.

1. Sequential Pipeline

Agents pass work in a strict order, like an assembly line. Each agent processes the output of the previous one and passes its result to the next. This is the simplest pattern and works well when tasks have a clear linear flow.

sequential-pipeline.ts
// Sequential Pipeline: Planner → Coder → Reviewer → Tester
interface AgentResult {
  content: string;
  metadata: Record<string, unknown>;
  status: 'success' | 'failure' | 'needs_revision';
}

interface PipelineAgent {
  name: string;
  process(input: AgentResult): Promise<AgentResult>;
}

class SequentialPipeline {
  private agents: PipelineAgent[];

  constructor(agents: PipelineAgent[]) {
    this.agents = agents;
  }

  async execute(initialInput: string): Promise<AgentResult> {
    let result: AgentResult = {
      content: initialInput,
      metadata: {},
      status: 'success',
    };

    for (const agent of this.agents) {
      console.log(`  Pipeline stage: ${agent.name}`);
      result = await agent.process(result);

      if (result.status === 'failure') {
        console.error(`  Pipeline failed at ${agent.name}`);
        break;
      }
    }

    return result;
  }
}

// Usage
const pipeline = new SequentialPipeline([
  new PlannerAgent(),   // Decomposes the task into steps
  new CoderAgent(),     // Writes the implementation
  new ReviewerAgent(),  // Reviews for bugs and style
  new TesterAgent(),    // Generates and runs tests
]);

const result = await pipeline.execute(
  "Add a rate limiter middleware to the Express API"
);

The sequential pipeline is ideal for code generation workflows, document processing, and any task where each stage clearly depends on the previous one. The downside is that a failure in any stage blocks the entire pipeline.

2. Supervisor/Worker Pattern

One orchestrator agent delegates tasks to specialized worker agents. The supervisor decides which agents to invoke, in what order, and how to combine their results. This pattern offers more flexibility than a strict pipeline.

supervisor-worker.ts
// Supervisor/Worker: One orchestrator delegates to specialists
interface WorkerAgent {
  name: string;
  capabilities: string[];
  execute(task: string, context: SharedContext): Promise<AgentResult>;
}

interface SharedContext {
  goal: string;
  history: Array<{ agent: string; result: AgentResult }>;
  artifacts: Map<string, string>;
}

class SupervisorAgent {
  private workers: WorkerAgent[];
  private llm: LLMClient;

  constructor(workers: WorkerAgent[], llm: LLMClient) {
    this.workers = workers;
    this.llm = llm;
  }

  async solve(goal: string, maxRounds: number = 10): Promise<string> {
    const context: SharedContext = {
      goal,
      history: [],
      artifacts: new Map(),
    };

    for (let round = 0; round < maxRounds; round++) {
      // Ask the supervisor LLM which worker to invoke next
      const decision = await this.llm.decide({
        goal,
        availableWorkers: this.workers.map(w => ({
          name: w.name,
          capabilities: w.capabilities,
        })),
        history: context.history,
      });

      if (decision.action === 'finish') {
        return decision.finalAnswer;
      }

      // Dispatch to the chosen worker
      const worker = this.workers.find(w => w.name === decision.workerName);
      if (!worker) throw new Error(`Unknown worker: ${decision.workerName}`);

      console.log(`  Supervisor delegating to: ${worker.name}`);
      const result = await worker.execute(decision.task, context);
      context.history.push({ agent: worker.name, result });

      // Store any artifacts the worker produced
      if (result.metadata.artifact) {
        context.artifacts.set(
          result.metadata.artifactName as string,
          result.content
        );
      }
    }

    throw new Error('Supervisor exceeded max rounds');
  }
}

The supervisor pattern is the most common in production systems because it provides centralized control, clear delegation, and straightforward debugging. The supervisor acts as a single point of coordination, making it easy to add logging, rate limiting, and error handling.

3. Peer-to-Peer Pattern

Agents negotiate and collaborate as equals, without a central orchestrator. Each agent can initiate communication with any other agent. This pattern is powerful for scenarios where no single agent has a complete picture of the problem.

peer_to_peer.py
# Peer-to-Peer: Agents negotiate as equals
from dataclasses import dataclass, field
from typing import Callable
import asyncio

@dataclass
class Message:
    sender: str
    receiver: str
    content: str
    msg_type: str  # "request", "response", "broadcast"

class PeerAgent:
    def __init__(self, name: str, capabilities: list[str], handler: Callable):
        self.name = name
        self.capabilities = capabilities
        self.handler = handler
        self.inbox: asyncio.Queue[Message] = asyncio.Queue()
        self.peers: dict[str, "PeerAgent"] = {}

    def register_peer(self, peer: "PeerAgent"):
        self.peers[peer.name] = peer

    async def send(self, receiver: str, content: str, msg_type: str = "request"):
        msg = Message(self.name, receiver, content, msg_type)
        await self.peers[receiver].inbox.put(msg)

    async def broadcast(self, content: str):
        for peer_name, peer in self.peers.items():
            msg = Message(self.name, peer_name, content, "broadcast")
            await peer.inbox.put(msg)

    async def run(self):
        """Main loop: process messages from inbox."""
        while True:
            msg = await self.inbox.get()
            response = await self.handler(self, msg)
            if response and msg.msg_type == "request":
                await self.send(msg.sender, response, "response")

# Example: design review with peer agents
async def architect_handler(agent: PeerAgent, msg: Message) -> str:
    if "review architecture" in msg.content:
        return "Architecture looks solid. Consider adding a cache layer."
    return None

async def security_handler(agent: PeerAgent, msg: Message) -> str:
    if "security review" in msg.content:
        return "Found 2 issues: SQL injection risk in query builder, missing rate limit."
    return None

architect = PeerAgent("architect", ["design", "architecture"], architect_handler)
security = PeerAgent("security", ["security", "compliance"], security_handler)
architect.register_peer(security)
security.register_peer(architect)

Peer-to-peer systems excel in scenarios like collaborative design reviews, debate-style reasoning, and consensus-building. The downside is complexity: without a coordinator, it is harder to ensure convergence and prevent infinite loops.

4. Hierarchical Pattern

Multi-level agent trees for complex task decomposition. A top-level agent breaks a problem into subproblems, each delegated to a mid-level agent, which may further decompose and delegate. This mirrors how large organizations structure their work.

hierarchical.ts
// Hierarchical: Multi-level agent trees
interface HierarchicalAgent {
  name: string;
  level: number;
  children: HierarchicalAgent[];
  execute(task: string): Promise<AgentResult>;
}

class ManagerAgent implements HierarchicalAgent {
  name: string;
  level: number;
  children: HierarchicalAgent[];
  private llm: LLMClient;

  constructor(name: string, level: number, children: HierarchicalAgent[]) {
    this.name = name;
    this.level = level;
    this.children = children;
  }

  async execute(task: string): Promise<AgentResult> {
    // Decompose task into subtasks for children
    const subtasks = await this.llm.decompose(task, this.children.map(c => c.name));

    const results: AgentResult[] = [];
    for (const subtask of subtasks) {
      const child = this.children.find(c => c.name === subtask.assignee);
      console.log(`  [${this.name}] Delegating to ${child.name}: ${subtask.description}`);
      const result = await child.execute(subtask.description);
      results.push(result);
    }

    // Synthesize child results
    const synthesis = await this.llm.synthesize(task, results);
    return { content: synthesis, metadata: { level: this.level }, status: 'success' };
  }
}

// Build the hierarchy
const system = new ManagerAgent("CTO", 0, [
  new ManagerAgent("Backend Lead", 1, [
    new LeafAgent("API Developer"),
    new LeafAgent("Database Engineer"),
  ]),
  new ManagerAgent("Frontend Lead", 1, [
    new LeafAgent("UI Developer"),
    new LeafAgent("UX Specialist"),
  ]),
  new ManagerAgent("QA Lead", 1, [
    new LeafAgent("Test Engineer"),
    new LeafAgent("Performance Tester"),
  ]),
]);

await system.execute("Build a user dashboard with real-time analytics");

Hierarchical systems are ideal for large-scale project planning, enterprise workflows, and any problem that naturally decomposes into a tree of subproblems. The tradeoff is latency — deep hierarchies mean many sequential LLM calls.

Building a Multi-Agent System

Let us walk through building a production multi-agent system step by step. We will build a code review pipeline that takes a pull request and produces a comprehensive review.

Step 1: Define Agent Roles and Capabilities

Start by clearly defining what each agent does, what tools it has access to, and what it produces. This is the most important design decision — get this wrong and the whole system falls apart.

agent-definitions.ts
// Define clear agent contracts
interface AgentDefinition {
  name: string;
  role: string;
  systemPrompt: string;
  tools: ToolDefinition[];
  inputSchema: z.ZodSchema;
  outputSchema: z.ZodSchema;
}

const agents: AgentDefinition[] = [
  {
    name: "diff-analyzer",
    role: "Analyze the PR diff and categorize changes",
    systemPrompt: `You are a code diff analyst. Your job is to:
      1. Parse the git diff
      2. Categorize each change (new feature, bug fix, refactor, config)
      3. Identify files with the highest risk
      4. Output a structured analysis`,
    tools: [readFileTool, gitDiffTool],
    inputSchema: z.object({ prNumber: z.number() }),
    outputSchema: z.object({
      files: z.array(z.object({
        path: z.string(),
        changeType: z.enum(["added", "modified", "deleted"]),
        riskLevel: z.enum(["low", "medium", "high"]),
        summary: z.string(),
      })),
    }),
  },
  {
    name: "bug-detector",
    role: "Find potential bugs and logic errors",
    systemPrompt: `You are a bug detection specialist. Analyze code changes for:
      - Null/undefined handling issues
      - Off-by-one errors
      - Race conditions
      - Missing error handling
      - Type mismatches
      Report each issue with severity, file, line, and explanation.`,
    tools: [readFileTool, searchCodeTool],
    inputSchema: z.object({ files: z.array(z.string()), diff: z.string() }),
    outputSchema: z.object({
      issues: z.array(z.object({
        severity: z.enum(["critical", "warning", "info"]),
        file: z.string(),
        line: z.number(),
        description: z.string(),
        suggestion: z.string(),
      })),
    }),
  },
  {
    name: "security-scanner",
    role: "Identify security vulnerabilities",
    systemPrompt: `You are a security specialist. Scan for:
      - SQL injection, XSS, CSRF vulnerabilities
      - Hardcoded secrets or credentials
      - Insecure dependencies
      - Missing input validation
      - Authentication/authorization gaps`,
    tools: [readFileTool, dependencyCheckTool],
    inputSchema: z.object({ files: z.array(z.string()) }),
    outputSchema: z.object({
      vulnerabilities: z.array(z.object({
        severity: z.enum(["critical", "high", "medium", "low"]),
        type: z.string(),
        file: z.string(),
        description: z.string(),
        remediation: z.string(),
      })),
    }),
  },
  {
    name: "review-synthesizer",
    role: "Combine all findings into a final review",
    systemPrompt: `You are a senior engineer writing a PR review. Synthesize
      findings from the analysis, bug detection, and security scan into a
      clear, actionable review. Prioritize critical issues first.`,
    tools: [],
    inputSchema: z.object({
      analysis: z.unknown(),
      bugs: z.unknown(),
      security: z.unknown(),
    }),
    outputSchema: z.object({
      summary: z.string(),
      verdict: z.enum(["approve", "request_changes", "comment"]),
      comments: z.array(z.object({
        file: z.string(),
        line: z.number(),
        body: z.string(),
      })),
    }),
  },
];

Step 2: Design the Communication Protocol

Agents need a structured way to exchange information. Define clear message types and schemas. This prevents the "telephone game" problem where information degrades as it passes between agents.

message-bus.ts
// Structured inter-agent communication
interface AgentMessage {
  id: string;
  from: string;
  to: string;
  type: 'task' | 'result' | 'error' | 'clarification';
  payload: unknown;
  timestamp: number;
  traceId: string;  // For distributed tracing
}

class MessageBus {
  private handlers: Map<string, (msg: AgentMessage) => Promise<void>> = new Map();
  private messageLog: AgentMessage[] = [];

  register(agentName: string, handler: (msg: AgentMessage) => Promise<void>) {
    this.handlers.set(agentName, handler);
  }

  async send(message: AgentMessage): Promise<void> {
    this.messageLog.push(message);
    console.log(`  [${message.from} → ${message.to}] ${message.type}`);

    const handler = this.handlers.get(message.to);
    if (!handler) throw new Error(`No handler for agent: ${message.to}`);
    await handler(message);
  }

  getTrace(traceId: string): AgentMessage[] {
    return this.messageLog.filter(m => m.traceId === traceId);
  }
}

Step 3: Build the Orchestrator

The orchestrator manages the lifecycle of the multi-agent workflow. It decides when to invoke each agent, handles retries, and manages the overall execution flow.

orchestrator.ts
// Production orchestrator with retries and timeouts
import Anthropic from "@anthropic-ai/sdk";

interface OrchestratorConfig {
  maxRetries: number;
  timeoutMs: number;
  tokenBudget: number;
}

class ReviewOrchestrator {
  private client: Anthropic;
  private config: OrchestratorConfig;
  private tokenUsage: number = 0;

  constructor(config: OrchestratorConfig) {
    this.client = new Anthropic();
    this.config = config;
  }

  async reviewPR(prNumber: number): Promise<ReviewResult> {
    const traceId = crypto.randomUUID();
    console.log(`Starting review for PR #${prNumber} [trace: ${traceId}]`);

    // Stage 1: Analyze the diff
    const analysis = await this.runAgent("diff-analyzer", {
      prNumber,
    }, traceId);

    // Stage 2: Run bug detection and security scan in parallel
    const [bugs, security] = await Promise.all([
      this.runAgent("bug-detector", {
        files: analysis.files.map(f => f.path),
        diff: analysis.rawDiff,
      }, traceId),
      this.runAgent("security-scanner", {
        files: analysis.files.map(f => f.path),
      }, traceId),
    ]);

    // Stage 3: Synthesize into final review
    const review = await this.runAgent("review-synthesizer", {
      analysis,
      bugs,
      security,
    }, traceId);

    console.log(`Review complete. Tokens used: ${this.tokenUsage}`);
    return review;
  }

  private async runAgent(
    agentName: string,
    input: unknown,
    traceId: string,
    attempt: number = 0
  ): Promise<unknown> {
    if (this.tokenUsage > this.config.tokenBudget) {
      throw new Error(`Token budget exceeded: ${this.tokenUsage}/${this.config.tokenBudget}`);
    }

    try {
      const result = await Promise.race([
        this.executeAgent(agentName, input, traceId),
        this.timeout(this.config.timeoutMs),
      ]);
      return result;
    } catch (error) {
      if (attempt < this.config.maxRetries) {
        console.warn(`  Retrying ${agentName} (attempt ${attempt + 1})`);
        return this.runAgent(agentName, input, traceId, attempt + 1);
      }
      throw error;
    }
  }

  private timeout(ms: number): Promise<never> {
    return new Promise((_, reject) =>
      setTimeout(() => reject(new Error("Agent timeout")), ms)
    );
  }
}

Step 4: Implement Shared Memory and State

Agents need access to shared context — the current state of the workflow, artifacts produced by other agents, and decisions already made. A well-designed shared memory layer prevents redundant work and ensures consistency.

workflow-memory.ts
// Shared state management for multi-agent workflows
interface WorkflowState {
  id: string;
  status: 'running' | 'completed' | 'failed';
  agents: Record<string, AgentState>;
  artifacts: Map<string, Artifact>;
  context: Record<string, unknown>;
}

interface AgentState {
  status: 'pending' | 'running' | 'completed' | 'failed';
  startedAt?: number;
  completedAt?: number;
  tokensUsed: number;
  result?: unknown;
  error?: string;
}

interface Artifact {
  key: string;
  producedBy: string;
  content: string;
  contentType: string;
  createdAt: number;
}

class WorkflowMemory {
  private state: WorkflowState;

  constructor(workflowId: string) {
    this.state = {
      id: workflowId,
      status: 'running',
      agents: {},
      artifacts: new Map(),
      context: {},
    };
  }

  // Agents write artifacts for other agents to consume
  setArtifact(agentName: string, key: string, content: string, contentType: string) {
    this.state.artifacts.set(key, {
      key,
      producedBy: agentName,
      content,
      contentType,
      createdAt: Date.now(),
    });
  }

  getArtifact(key: string): Artifact | undefined {
    return this.state.artifacts.get(key);
  }

  // Shared context that any agent can read or update
  setContext(key: string, value: unknown) {
    this.state.context[key] = value;
  }

  getContext(key: string): unknown {
    return this.state.context[key];
  }

  // Track agent lifecycle
  markAgentStarted(name: string) {
    this.state.agents[name] = {
      status: 'running',
      startedAt: Date.now(),
      tokensUsed: 0,
    };
  }

  markAgentCompleted(name: string, result: unknown, tokensUsed: number) {
    this.state.agents[name] = {
      ...this.state.agents[name],
      status: 'completed',
      completedAt: Date.now(),
      tokensUsed,
      result,
    };
  }

  // Snapshot for observability
  getSnapshot(): WorkflowState {
    return structuredClone(this.state);
  }
}

Step 5: Add Error Handling and Fallbacks

Production multi-agent systems must handle failures gracefully. Individual agents will fail — the system must continue. Implement circuit breakers, fallback agents, and degraded-mode operation.

error-handling.ts
// Production error handling for multi-agent systems
class CircuitBreaker {
  private failures: number = 0;
  private lastFailure: number = 0;
  private state: 'closed' | 'open' | 'half-open' = 'closed';

  constructor(
    private threshold: number = 3,
    private resetTimeMs: number = 60_000
  ) {}

  async execute<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure > this.resetTimeMs) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit breaker is open');
      }
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }

  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }

  private onFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.threshold) {
      this.state = 'open';
    }
  }
}

// Fallback strategy: if the primary agent fails, use a simpler fallback
class AgentWithFallback {
  constructor(
    private primary: AgentDefinition,
    private fallback: AgentDefinition,
    private breaker: CircuitBreaker = new CircuitBreaker()
  ) {}

  async execute(input: unknown): Promise<AgentResult> {
    try {
      return await this.breaker.execute(() =>
        runAgent(this.primary, input)
      );
    } catch (error) {
      console.warn(
        `Primary agent ${this.primary.name} failed, using fallback ${this.fallback.name}`
      );
      return runAgent(this.fallback, input);
    }
  }
}

Production Challenges

Building a multi-agent demo is straightforward. Shipping one to production is a different beast entirely. Here are the challenges that will define your success or failure.

Cost Management

The math is brutal: N agents times M calls per agent times cost per call. A four-agent pipeline where each agent makes five LLM calls costs twenty times what a single LLM call costs. Multiply that by request volume and costs can spiral quickly. A system handling 1,000 requests per day at $0.05 per multi-agent run is $1,500 per month — and that is a modest example.

Latency

Sequential agent chains multiply latency. If each agent takes two seconds and you have four agents in sequence, that is eight seconds minimum. Users will not wait eight seconds for most interactions. Parallelizing where possible and using streaming are essential strategies.

Error Cascading

When Agent A produces slightly wrong output, Agent B builds on that wrong output, and Agent C amplifies the error further. By the time you reach the final agent, the result can be completely wrong. This "cascading hallucination" problem is one of the hardest challenges in multi-agent systems.

Debugging Agent Conversations

When a multi-agent system produces a wrong answer, where did it go wrong? Was it the planner that misunderstood the task? The coder that introduced a bug? The reviewer that missed it? Tracing through multi-agent conversations to find the root cause is significantly harder than debugging a single agent.

Testing Multi-Agent Interactions

Unit testing individual agents is straightforward. Testing their interactions — the emergent behavior of the system as a whole — is much harder. You need integration tests that verify the entire pipeline produces correct results, and you need to test edge cases where agents disagree or produce conflicting outputs.

Frameworks for Multi-Agent Systems

Several frameworks have emerged to simplify building multi-agent systems. Each takes a different approach to orchestration, communication, and agent definition.

LangGraph

LangGraph models multi-agent workflows as directed graphs. Nodes are agents or functions, edges define the flow of control. It supports cycles, branching, and conditional routing — making it ideal for complex workflows with feedback loops.

langgraph_review.py
# LangGraph: Multi-agent workflow as a graph
from langgraph.graph import StateGraph, START, END
from typing import TypedDict, Annotated
import operator

class ReviewState(TypedDict):
    pr_diff: str
    analysis: str
    bugs: list[str]
    security_issues: list[str]
    final_review: str

def analyze_diff(state: ReviewState) -> ReviewState:
    """Agent 1: Analyze the diff and categorize changes."""
    analysis = llm.invoke(
        f"Analyze this PR diff and categorize the changes:\n{state['pr_diff']}"
    )
    return {"analysis": analysis.content}

def detect_bugs(state: ReviewState) -> ReviewState:
    """Agent 2: Find potential bugs."""
    bugs = llm.invoke(
        f"Find bugs in this code analysis:\n{state['analysis']}"
    )
    return {"bugs": [bugs.content]}

def scan_security(state: ReviewState) -> ReviewState:
    """Agent 3: Security scan."""
    issues = llm.invoke(
        f"Find security vulnerabilities:\n{state['analysis']}"
    )
    return {"security_issues": [issues.content]}

def synthesize_review(state: ReviewState) -> ReviewState:
    """Agent 4: Combine into final review."""
    review = llm.invoke(
        f"""Synthesize a PR review from:
        Analysis: {state['analysis']}
        Bugs: {state['bugs']}
        Security: {state['security_issues']}"""
    )
    return {"final_review": review.content}

def route_after_analysis(state: ReviewState) -> list[str]:
    """Run bug detection and security scan in parallel."""
    return ["detect_bugs", "scan_security"]

# Build the graph
graph = StateGraph(ReviewState)
graph.add_node("analyze_diff", analyze_diff)
graph.add_node("detect_bugs", detect_bugs)
graph.add_node("scan_security", scan_security)
graph.add_node("synthesize_review", synthesize_review)

graph.add_edge(START, "analyze_diff")
graph.add_conditional_edges("analyze_diff", route_after_analysis)
graph.add_edge("detect_bugs", "synthesize_review")
graph.add_edge("scan_security", "synthesize_review")
graph.add_edge("synthesize_review", END)

app = graph.compile()
result = app.invoke({"pr_diff": diff_content})

CrewAI

CrewAI models agents as team members with roles, goals, and backstories. It emphasizes the social metaphor — agents collaborate like a team of specialists. This makes it intuitive for workflows that mirror human team structures.

crewai_review.py
# CrewAI: Agents as team members
from crewai import Agent, Task, Crew, Process

analyst = Agent(
    role="Senior Code Analyst",
    goal="Thoroughly analyze code changes and identify patterns",
    backstory="You have 15 years of experience reviewing code at top tech companies.",
    verbose=True,
    llm="claude-sonnet-4-20250514",
)

security_expert = Agent(
    role="Application Security Engineer",
    goal="Find security vulnerabilities before they reach production",
    backstory="Former penetration tester turned security engineer. You have seen every attack vector.",
    verbose=True,
    llm="claude-sonnet-4-20250514",
)

reviewer = Agent(
    role="Tech Lead",
    goal="Make the final review decision based on all findings",
    backstory="Staff engineer who balances velocity with quality. You know when to block and when to approve.",
    verbose=True,
    llm="claude-sonnet-4-20250514",
)

analyze_task = Task(
    description="Analyze the PR diff: {pr_diff}",
    expected_output="Structured analysis of all changes with risk levels",
    agent=analyst,
)

security_task = Task(
    description="Scan for security vulnerabilities in the changes",
    expected_output="List of security issues with severity and remediation",
    agent=security_expert,
)

review_task = Task(
    description="Write the final PR review combining all findings",
    expected_output="Complete PR review with verdict and inline comments",
    agent=reviewer,
)

crew = Crew(
    agents=[analyst, security_expert, reviewer],
    tasks=[analyze_task, security_task, review_task],
    process=Process.sequential,
    verbose=True,
)

result = crew.kickoff(inputs={"pr_diff": diff_content})

AutoGen

Microsoft AutoGen focuses on conversational multi-agent systems. Agents talk to each other in a chat-like format, negotiating and refining their work through dialogue. This is powerful for tasks where iterative refinement through debate produces better results.

autogen_review.py
# AutoGen: Conversational multi-agent collaboration
from autogen import AssistantAgent, UserProxyAgent, GroupChat, GroupChatManager

coder = AssistantAgent(
    name="Coder",
    system_message="""You are an expert programmer. Write clean, well-tested code.
    When given a task, implement it and explain your design decisions.""",
    llm_config={"model": "claude-sonnet-4-20250514"},
)

reviewer = AssistantAgent(
    name="Reviewer",
    system_message="""You are a code reviewer. Examine code for bugs, style issues,
    and architectural problems. Be thorough but constructive.""",
    llm_config={"model": "claude-sonnet-4-20250514"},
)

tester = AssistantAgent(
    name="Tester",
    system_message="""You write comprehensive test cases. Cover edge cases,
    error conditions, and performance scenarios.""",
    llm_config={"model": "claude-sonnet-4-20250514"},
)

executor = UserProxyAgent(
    name="Executor",
    human_input_mode="NEVER",
    code_execution_config={"work_dir": "workspace"},
)

group_chat = GroupChat(
    agents=[coder, reviewer, tester, executor],
    messages=[],
    max_round=12,
)

manager = GroupChatManager(groupchat=group_chat)
executor.initiate_chat(
    manager,
    message="Build a rate limiter middleware for Express.js with sliding window algorithm"
)

Claude Agent SDK

Anthropic's Claude Agent SDK provides a streamlined approach to building agents with built-in tool use, safety features, and structured outputs. It is designed for production use with a focus on reliability and cost control.

claude-agent-sdk.ts
// Claude Agent SDK: Production-focused agent building
import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

// Define a specialized agent with scoped tools
async function createReviewAgent(
  role: string,
  systemPrompt: string,
  tools: Anthropic.Tool[]
) {
  return async function execute(task: string): Promise<string> {
    const messages: Anthropic.MessageParam[] = [
      { role: "user", content: task },
    ];

    let totalTokens = 0;

    while (true) {
      const response = await anthropic.messages.create({
        model: "claude-sonnet-4-20250514",
        max_tokens: 4096,
        system: systemPrompt,
        tools,
        messages,
      });

      totalTokens += response.usage.input_tokens + response.usage.output_tokens;

      if (response.stop_reason === "end_turn") {
        const text = response.content.find(b => b.type === "text");
        console.log(`  [${role}] completed (${totalTokens} tokens)`);
        return text?.text ?? "";
      }

      if (response.stop_reason === "tool_use") {
        const toolUses = response.content.filter(b => b.type === "tool_use");
        messages.push({ role: "assistant", content: response.content });

        const toolResults = await Promise.all(
          toolUses.map(async (tu) => ({
            type: "tool_result" as const,
            tool_use_id: tu.id,
            content: await executeTool(tu.name, tu.input),
          }))
        );

        messages.push({ role: "user", content: toolResults });
      }
    }
  };
}

// Orchestrate multiple agents
async function multiAgentReview(prDiff: string) {
  const analyzer = await createReviewAgent(
    "Analyzer",
    "You analyze code diffs and categorize changes by risk level.",
    [readFileTool, gitLogTool]
  );

  const bugHunter = await createReviewAgent(
    "Bug Hunter",
    "You find bugs, logic errors, and edge cases in code.",
    [readFileTool, searchTool]
  );

  const analysis = await analyzer(`Analyze this diff:\n${prDiff}`);

  const bugs = await bugHunter(
    `Given this analysis:\n${analysis}\nFind all potential bugs.`
  );

  return { analysis, bugs };
}

Cost Optimization Strategies

Cost is the number one concern when deploying multi-agent systems to production. Here are proven strategies for keeping costs under control without sacrificing quality.

Model Routing

Not every agent needs the most powerful model. Use smaller, cheaper models for simpler agents and reserve the frontier model for the hardest tasks. A diff analyzer can use a fast model, while the bug detector might need the most capable model.

model-routing.ts
// Model routing: match model capability to task complexity
const MODEL_TIERS = {
  simple: "claude-haiku-4-20250514",       // Fast, cheap — formatting, routing, classification
  standard: "claude-sonnet-4-20250514",   // Balanced — most agent tasks
  complex: "claude-opus-4-20250514",     // Most capable — complex reasoning, synthesis
} as const;

interface AgentConfig {
  name: string;
  modelTier: keyof typeof MODEL_TIERS;
  maxTokens: number;
  tokenBudget: number;
}

const agentConfigs: AgentConfig[] = [
  { name: "router",           modelTier: "simple",   maxTokens: 256,  tokenBudget: 500 },
  { name: "diff-analyzer",    modelTier: "simple",   maxTokens: 1024, tokenBudget: 2000 },
  { name: "bug-detector",     modelTier: "standard", maxTokens: 2048, tokenBudget: 5000 },
  { name: "security-scanner", modelTier: "standard", maxTokens: 2048, tokenBudget: 5000 },
  { name: "synthesizer",      modelTier: "complex",  maxTokens: 4096, tokenBudget: 8000 },
];

function getModelForAgent(agentName: string): string {
  const config = agentConfigs.find(c => c.name === agentName);
  return MODEL_TIERS[config?.modelTier ?? "standard"];
}

Caching

Many agent calls produce identical results for identical inputs. A semantic cache that stores results keyed by input hash can dramatically reduce costs, especially for agents that perform static analysis or classification.

agent-cache.ts
// Semantic caching for agent results
import { createHash } from "crypto";

class AgentCache {
  private cache: Map<string, { result: unknown; timestamp: number }> = new Map();
  private ttlMs: number;

  constructor(ttlMs: number = 3600_000) {  // 1 hour default
    this.ttlMs = ttlMs;
  }

  private hashInput(agentName: string, input: unknown): string {
    const content = JSON.stringify({ agentName, input });
    return createHash("sha256").update(content).digest("hex");
  }

  get(agentName: string, input: unknown): unknown | null {
    const hash = this.hashInput(agentName, input);
    const entry = this.cache.get(hash);

    if (!entry) return null;
    if (Date.now() - entry.timestamp > this.ttlMs) {
      this.cache.delete(hash);
      return null;
    }

    console.log(`  Cache hit for ${agentName}`);
    return entry.result;
  }

  set(agentName: string, input: unknown, result: unknown): void {
    const hash = this.hashInput(agentName, input);
    this.cache.set(hash, { result, timestamp: Date.now() });
  }
}

Parallel Execution

Whenever agents do not depend on each other, run them in parallel. This does not save on token costs, but it dramatically reduces latency — which often matters more for user experience.

Early Termination

If an early agent determines the task is trivial or impossible, skip the remaining agents. A router agent that classifies the complexity of a request can short-circuit simple cases to a single fast agent instead of running the full pipeline.

Token Budgets Per Agent

Set a maximum token budget for each agent. If an agent exceeds its budget, terminate it and use whatever partial result is available. This prevents runaway agents from consuming your entire budget on a single request.

Observability and Debugging

You cannot improve what you cannot measure. Multi-agent systems require purpose-built observability tooling that goes beyond standard application monitoring.

Tracing Agent Conversations

Every message between agents should be logged with a trace ID that links the entire workflow. This lets you reconstruct exactly what happened when something goes wrong.

agent-tracer.ts
// Multi-agent observability layer
interface AgentSpan {
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  agentName: string;
  startTime: number;
  endTime?: number;
  input: unknown;
  output?: unknown;
  tokensUsed: number;
  model: string;
  toolCalls: ToolCallRecord[];
  status: 'success' | 'failure' | 'timeout';
  error?: string;
}

interface ToolCallRecord {
  tool: string;
  input: unknown;
  output: unknown;
  durationMs: number;
}

class AgentTracer {
  private spans: AgentSpan[] = [];

  startSpan(traceId: string, agentName: string, input: unknown, parentSpanId?: string): string {
    const spanId = crypto.randomUUID();
    this.spans.push({
      traceId,
      spanId,
      parentSpanId,
      agentName,
      startTime: Date.now(),
      input,
      tokensUsed: 0,
      model: '',
      toolCalls: [],
      status: 'success',
    });
    return spanId;
  }

  endSpan(spanId: string, output: unknown, tokensUsed: number, status: 'success' | 'failure') {
    const span = this.spans.find(s => s.spanId === spanId);
    if (span) {
      span.endTime = Date.now();
      span.output = output;
      span.tokensUsed = tokensUsed;
      span.status = status;
    }
  }

  // Generate a dashboard-ready summary
  getTraceSummary(traceId: string) {
    const traceSpans = this.spans.filter(s => s.traceId === traceId);
    return {
      totalDuration: Math.max(...traceSpans.map(s => s.endTime ?? 0)) -
                     Math.min(...traceSpans.map(s => s.startTime)),
      totalTokens: traceSpans.reduce((sum, s) => sum + s.tokensUsed, 0),
      agentCount: traceSpans.length,
      failedAgents: traceSpans.filter(s => s.status === 'failure').map(s => s.agentName),
      timeline: traceSpans.map(s => ({
        agent: s.agentName,
        duration: (s.endTime ?? Date.now()) - s.startTime,
        tokens: s.tokensUsed,
        status: s.status,
      })),
    };
  }
}

Key Metrics to Track

For each agent in your system, measure and alert on the following:

  • Success rate — percentage of invocations that produce valid output
  • Latency — p50, p95, and p99 execution time per agent
  • Token usage — average and maximum tokens consumed per invocation
  • Error rate — how often each agent fails, and the failure modes
  • Cost per request — total dollar cost of the full multi-agent pipeline
  • Retry rate — how often agents need to be retried, indicating instability

Dashboards

Build dashboards that show the health of your multi-agent system at a glance. Group metrics by agent, by workflow, and by time period. Alert on anomalies — a sudden spike in token usage for one agent usually means its prompt is hitting an edge case that causes verbose output.

Real-World Use Cases

Multi-agent systems are not theoretical. They are running in production today across a range of industries and applications.

Code Review Pipelines

One of the most mature use cases. A diff analyzer agent categorizes changes, a bug detector looks for issues, a security scanner checks for vulnerabilities, and a synthesizer writes the final review. Companies using this pattern report catching 40% more bugs than single-model reviews while reducing human review time by 60%.

Customer Support Escalation

A front-line agent handles common questions using a knowledge base. When it detects a complex issue, it escalates to a specialist agent — billing, technical, or compliance. If the specialist cannot resolve it, the system prepares a detailed summary and routes to a human agent with full context. This reduces average handle time and ensures customers get the right expertise.

Data Processing Workflows

Extract-transform-load pipelines where each stage is handled by a specialized agent. An extractor agent parses raw documents (PDFs, emails, spreadsheets). A normalizer agent standardizes formats and units. A validator agent checks data quality. A loader agent writes to the target system. Each agent can be optimized and tested independently.

Content Creation Pipelines

A researcher agent gathers information from multiple sources. A writer agent produces a first draft. An editor agent checks for accuracy, tone, and style. A fact-checker agent verifies claims against trusted sources. The result is content that is both fast to produce and reliable — far better than what a single agent can achieve.

Getting Started: A Practical Checklist

If you are ready to build a multi-agent system for production, here is a practical checklist to guide your implementation.

  1. Start with two agents — do not over-engineer from day one. A supervisor and one worker is enough to validate the pattern.
  2. Define clear input/output schemas for every agent using a tool like Zod or Pydantic. Unstructured agent communication is the leading cause of cascading failures.
  3. Implement token budgets from the start. It is much harder to add cost controls retroactively.
  4. Build observability first, not last. You will spend more time debugging agent interactions than writing agent prompts.
  5. Test with recorded traces. Capture real multi-agent conversations and replay them in tests to verify behavior.
  6. Use model routing from the beginning. Default to the cheapest model that works and upgrade only where quality demands it.
  7. Plan for graceful degradation. Every agent should have a fallback — even if the fallback is returning a reasonable default.

Conclusion

Multi-agent AI systems represent the next evolution of AI-powered software. Just as microservices replaced monoliths and containers replaced virtual machines, specialized agents working in concert will replace monolithic AI systems.

The patterns are proven. The frameworks are maturing. The economics work for high-value workflows. What remains is the hard engineering work of building systems that are reliable, observable, and cost-effective at scale.

The developers and teams who master multi-agent orchestration today will have a significant competitive advantage as AI agents become the standard building block of modern software. The question is no longer whether multi-agent systems will move into production — it is whether your team will be ready when they do.

The future of AI is not a single brilliant agent. It is a system of specialized agents, each doing one thing exceptionally well, orchestrated into something greater than the sum of its parts.

Related Posts

Running Local LLMs in 2026: The Complete Hardware and Setup Guide

Running Local LLMs in 2026: The Complete Hardware and Setup Guide

Local LLMs have gone from hobby to production-ready. Save $300-500/month in API costs with a setup that takes 10 minutes. Here is everything you need to know.

AI Writes the Code Now. What Is Left for Software Engineers?

AI Writes the Code Now. What Is Left for Software Engineers?

With 51,000+ tech layoffs in 2026 and AI writing production code, the future of software engineering is being redefined. Here is what actually matters now.

I Audited Vibe-Coded Applications: Here Are the Security Nightmares I Found

I Audited Vibe-Coded Applications: Here Are the Security Nightmares I Found

Vibe coding — accepting AI-generated code without review — has a 24.7% security flaw rate and 2.74x more vulnerabilities. Here is what I found when I looked under the hood.