Chapter 10: The Coordinator Pattern — Multi-Agent Enterprise Orchestration
Learning Objectives:
- Understand the “coordinator-worker” architecture of the Coordinator pattern and its design motivations
- Master the complete workflow of multi-agent collaboration: from requirements analysis to delivery verification
- Gain a deep understanding of task allocation, fault recovery, and the Scratchpad collaboration space mechanisms
- Be able to compare the Coordinator pattern with the Fork pattern and make the correct choice for given scenarios
When a single agent cannot handle complex engineering tasks, Claude Code provides the Coordinator pattern — a centralized multi-agent orchestration solution. Unlike the Fork pattern’s peer-to-peer parallelism, the Coordinator pattern employs a “coordinator-worker” architecture, where a dedicated coordinator manages the lifecycle and task allocation of multiple parallel workers.
This is like how a construction site operates: the project manager (Coordinator) doesn’t need to lay bricks, run wiring, or install pipes personally, but they need to know which workers (Workers) are good at what, which tasks can run in parallel, which have dependencies, and how to coordinate shared resources (Scratchpad). When a worker encounters a problem, the project manager needs to decide whether to reassign the task or adjust the overall plan.
This chapter will dive deep into the source code design, revealing the design philosophy behind this enterprise-grade orchestration pattern.
10.1 Coordinator Architecture
The coordinatorMode Core Module
The core code of the Coordinator pattern resides in the coordinator module. Although this module is only about 370 lines long, it defines the entire interaction model for multi-agent collaboration. The conciseness of the code is not accidental — the coordinator’s responsibility is “orchestration” rather than “execution,” and it needs to stay lean to avoid becoming a performance bottleneck or single point of failure in the system.
The module’s entry point is the isCoordinatorMode() function, which reveals the Coordinator pattern’s dual gating mechanism:
flowchart TD subgraph CompileTime["Compile Time: Feature Gate"] CG["Does the code include Coordinator functionality?"] CG -->|"No"| CN["Related code is not compiled into the binary"] CG -->|"Yes"| RT end subgraph Runtime["Runtime: Environment Variable"] RT["Is CLAUDE_CODE_COORDINATOR_MODE<br/>set?"] RT -->|"No"| NN["Use normal mode (may activate Fork)"] RT -->|"Yes"| YY["Coordinator mode activated"] end style CompileTime fill:#e8f4fd,stroke:#2196F3,color:#1565C0 style Runtime fill:#fff3e0,stroke:#f39c12,color:#e67e22 style YY fill:#2ecc71,stroke:#27ae60,color:#fff style NN fill:#bdc3c7,stroke:#7f8c8d,color:#333
- Feature gate: Determines at compile time whether to include the feature’s code
- Environment variable:
CLAUDE_CODE_COORDINATOR_MODEcontrols activation at runtime
Design Insight: Why use dual gating instead of a single switch?
The feature gate is a compile-time optimization — deployments that don’t need Coordinator functionality (such as lightweight SDK embedding scenarios) can completely exclude the related code, reducing binary size and attack surface. The environment variable is a runtime control — even in builds that include the feature, it must be explicitly enabled. This “compile-time exclusion + runtime explicit enable” pattern is common in enterprise software, satisfying both flexibility requirements and the principle of least privilege.
Activation Conditions and Mutual Exclusion
The Coordinator pattern interacts with other patterns in multiple places. First is its mutual exclusion with the Fork pattern: when both satisfy their conditions simultaneously, the Coordinator pattern takes priority. This is because the Coordinator already has its own task delegation model and doesn’t need the Fork pattern’s implicit parallelism capabilities.
flowchart TD A{"isCoordinatorMode?"} -->|"Yes"| C["Coordinator Mode<br/>(Fork disabled)"] A -->|"No"| B{"isForkModeEnabled?"} B -->|"Yes"| D["Fork Mode"] B -->|"No"| E["Normal Mode<br/>(synchronous sub-agents)"] style C fill:#2ecc71,stroke:#27ae60,color:#fff style D fill:#3498db,stroke:#2471a3,color:#fff style E fill:#bdc3c7,stroke:#7f8c8d,color:#333
Cross-Reference: The Fork pattern’s cache sharing mechanism is discussed in detail in Chapter 9. The core difference between the two is: Fork is “centerless parallelism” (all sub-agents are equal and share the same context), while Coordinator is “centered orchestration” (the coordinator controls the global view, and workers only see the tasks assigned to them).
The Coordinator pattern also affects the agent registry. When Coordinator mode is activated, the built-in agent registration function no longer returns the normal list of built-in agents but instead introduces Coordinator-specific worker agent definitions through lazy loading. This lazy loading approach is intentional, aimed at avoiding circular dependencies between the coordinator module and the tool module.
Pattern Matching for Session Recovery
The matchSessionMode() function handles pattern consistency during session recovery: when the current mode doesn’t match the session’s recorded mode, the system automatically flips the environment variable to match the session’s mode. This ensures that when a user recovers a session created in Coordinator mode, the system automatically activates Coordinator mode, even if the current startup configuration doesn’t have the environment variable set.
This design solves a practical problem: a user might set the Coordinator environment variable during one launch and create a session, but forget to set it when recovering that session later. Without automatic mode matching, the session would be restored in the wrong mode, leading to inconsistent behavior or even errors.
The Coordinator’s System Prompt
The Coordinator’s role is defined in the system prompt generation function — a carefully crafted system prompt that specifies the complete behavioral norms for the coordinator. Key points include:
Role Definition: The coordinator is not an executor but an orchestrator. It directly answers simple questions and delegates complex tasks to workers.
Tool Set: The coordinator has only four core tools — the Agent tool for spawning workers, the TaskStop tool for stopping workers, the SendMessage tool for sending messages to workers, and the structured output tool.
graph LR subgraph Coordinator["Coordinator -- Management Authority"] T1["Agent Tool<br/>Create and assign tasks to workers"] T2["TaskStop Tool<br/>Stop running workers"] T3["SendMessage Tool<br/>Send messages to workers"] T4["Structured Output Tool<br/>Output structured results"] end NOTE["Note: The coordinator does not have Read/Write/Edit/Bash or other execution tools<br/>It cannot directly modify code or files<br/>All actual work must be done indirectly through workers"] style T1 fill:#e74c3c,stroke:#c0392b,color:#fff style T2 fill:#e74c3c,stroke:#c0392b,color:#fff style T3 fill:#e74c3c,stroke:#c0392b,color:#fff style T4 fill:#e74c3c,stroke:#c0392b,color:#fff style NOTE fill:#fff3e0,stroke:#f39c12,color:#333
Key Constraints: The system prompt explicitly prohibits the coordinator from “using one worker to inspect another worker,” “using a worker to simply report file contents,” or “predicting or fabricating agent results.” These constraints ensure that the coordinator manages all communication directly, preventing overly long information-passing chains.
Anti-Pattern Warning: Why is “worker inspecting worker” prohibited?
Allowing Worker A to inspect Worker B’s results creates an “information chain”: Worker B completes its task → Worker A reads B’s results → Worker A reports to the coordinator. This chain has two serious problems:
Information decay: Each transmission loses details. Like a game of telephone, information after multiple rounds of relay may differ greatly from the original result.
Debugging difficulty: When the final result is wrong, you need to trace back layer by layer to find which link caused the problem.
The correct pattern is: the coordinator directly receives each worker’s results, understands them itself, and then writes the next set of instructions.
10.2 Worker Tool Allocation
INTERNAL_WORKER_TOOLS
Under the Coordinator pattern, workers’ tool allocation is controlled through two sets. The internal worker tools set defines the tools that workers should not see — these are the coordinator’s exclusive tools, including team creation, team deletion, message sending, and structured output.
This forms a clear boundary of authority:
graph LR subgraph Coordinator["Coordinator -- Management Authority"] C1["Agent Tool"] C2["TaskStop Tool"] C3["SendMessage Tool"] C4["Structured Output"] end subgraph Worker["Worker -- Execution Authority"] W1["Read / Write"] W2["Edit / Bash"] W3["Grep / Glob"] W4["WebSearch"] W5["Skill / MCP"] end Coordinator x-- Management vs Execution --x Worker style C1 fill:#e74c3c,stroke:#c0392b,color:#fff style C2 fill:#e74c3c,stroke:#c0392b,color:#fff style C3 fill:#e74c3c,stroke:#c0392b,color:#fff style C4 fill:#e74c3c,stroke:#c0392b,color:#fff style W1 fill:#3498db,stroke:#2471a3,color:#fff style W2 fill:#3498db,stroke:#2471a3,color:#fff style W3 fill:#3498db,stroke:#2471a3,color:#fff style W4 fill:#3498db,stroke:#2471a3,color:#fff style W5 fill:#3498db,stroke:#2471a3,color:#fff
Simple Mode and Full Mode Tool Sets
The getCoordinatorUserContext() function returns different tool descriptions based on the mode:
- Simple Mode: Workers only have Bash, Read, and Edit tools, suitable for resource-constrained environments
- Full Mode: Workers have all whitelisted tools except internal tools, including Read, Write, Edit, Bash, Grep, Glob, WebSearch, WebFetch, NotebookEdit, Skill, ToolSearch, and more
| Mode | Tool Set | Applicable Scenarios |
|---|---|---|
| Simple | Bash, Read, Edit | CI/CD environments, resource-constrained containers, quick validation |
| Full | All whitelisted tools | Local development, full IDE integration, complex refactoring |
In Full mode, workers can also use MCP tools and Skill tools. The system prompt informs the coordinator of the available worker tool list through user context.
Practical Scenario: When to Use Simple Mode?
Simple mode is suitable for the following scenarios:
- Automated tasks in CI/CD pipelines: Build servers don’t need web search or file discovery
- Quick fix tasks: Simple fixes that only require reading, editing, and running tests
- Security-restricted environments: Minimizing the tool set reduces potential security risks
- Resource-constrained containers: Reduces tool initialization overhead
Independent Assembly of the Tool Pool
Workers’ tool pools are assembled independently from the parent level, ensuring that workers always get the complete tool set, unaffected by parent-level tool restrictions. Workers default to the acceptEdits permission mode (automatically accept file edits), unless the agent definition specifies another mode.
This design decision reflects an important principle: workers are executors and should not be hindered by permission issues. If a worker needed user confirmation every time it edited a file, the advantages of multi-agent collaboration would be completely negated. Of course, this requires the coordinator to assign tasks correctly — if given the wrong modification task, a worker will execute it without hesitation.
Best Practice: Pre-confirm Task Scope Under Coordinator Mode
Since workers automatically accept edits, users should review the overall plan before the Coordinator accepts a task. It’s recommended to add a prompt in CLAUDE.md requiring the coordinator to present the complete task allocation plan before starting the Implementation phase.
10.3 Team Management
TeamCreateTool / TeamDeleteTool
The Coordinator pattern shares team infrastructure with Agent Teams (multi-agent swarms). The team creation tool is responsible for creating teams, and its core process includes: checking whether already in a team (a leader can only manage one team), generating a unique team name, creating a team file (containing team name, leader ID, session ID, member list, etc.), then writing the team file, updating global state, and setting up the task list.
flowchart TD A["Check if already in a team"] -->|"Yes"| ERR["Error: Leader can only manage one team"] A -->|"No"| B["Generate unique team name<br/>Format: team-{random-ID}"] B --> C["Create team file<br/>Contains name, leader, member list"] C --> D["Update global state<br/>Register team to global manager"] D --> E["Set up task list<br/>Initialize task tracking structure"] style ERR fill:#e74c3c,stroke:#c0392b,color:#fff style A fill:#3498db,stroke:#2471a3,color:#fff style B fill:#2ecc71,stroke:#27ae60,color:#fff style C fill:#f39c12,stroke:#d68910,color:#fff style D fill:#9b59b6,stroke:#7d3c98,color:#fff style E fill:#1abc9c,stroke:#16a085,color:#fff
The team deletion tool handles cleanup: it first checks whether there are still active members, only allowing cleanup after all members have completed their work, then clears the team directory, worktree, and team context.
Safety Guarantees for Team Deletion
Team deletion is not a simple “delete everything” operation but a process with multiple safety checks:
- Active member check: If workers are still running, deletion is refused and the list of still-running members is returned
- Resource cleanup order: Clean team directory (Scratchpad, etc.) first, then worktree, then team context
- Error tolerance: Failure to clean up a single resource does not prevent cleanup of other resources
Anti-Pattern Warning: Do Not Delete a Team Before Work Is Complete
If the coordinator forcefully deletes a team while workers are still running, the workers will lose their association with the team, which may lead to:
- Worker task notifications failing to reach the coordinator
- Scratchpad files being deleted while in-use workers read empty data
- Worktrees being cleaned up, causing workers’ file modifications to be lost
This is why team deletion requires the precondition that “all members have completed.”
SendMessageTool Message Passing
The message sending tool is the core communication channel for team collaboration. It supports four message types: close requests, close responses, and plan approval responses.
Addressing modes for message passing:
| Addressing Mode | Format | Use Case | Communication Scope |
|---|---|---|---|
| Point-to-point | to: "agent-name" | Send specific instructions to a designated worker | Single worker |
| Broadcast | to: "*" | Publish public information to all workers | All workers |
| UDS | to: "uds:<socket-path>" | Cross-process communication (different CLI instances) | Cross-process |
| Bridge | to: "bridge:<session-id>" | Cross-session/cross-machine communication | Cross-session/remote |
graph TD COORD["Coordinator"] WA["Worker A<br/>(Research)"] WB["Worker B<br/>(Implementation)"] WC["Worker C<br/>(Verification)"] COORD -->|"Point-to-point or broadcast"| WA COORD -->|"Point-to-point or broadcast"| WB COORD -->|"Point-to-point or broadcast"| WC WA -->|"Task notification (automatic)"| COORD WB -->|"Task notification (automatic)"| COORD WC -->|"Task notification (automatic)"| COORD WA -.-x|"No direct communication"| WB WB -.-x|"No direct communication"| WC style COORD fill:#e74c3c,stroke:#c0392b,color:#fff style WA fill:#3498db,stroke:#2471a3,color:#fff style WB fill:#2ecc71,stroke:#27ae60,color:#fff style WC fill:#f39c12,stroke:#d68910,color:#fff
The message sending tool’s intelligent routing mechanism is particularly noteworthy: when the coordinator sends a message to a running worker, the system queues the message for the next round of tool invocation; when sending to a stopped worker, the system automatically resumes that worker and delivers the message as a new prompt. This “stop-resume” pattern allows the coordinator to efficiently manage workers’ lifecycles.
Design Insight: The Economy of the Stop-Resume Pattern
Workers are not “always online.” A research worker stops after completing its investigation, releasing API connections and memory resources. But the coordinator may later need to give this worker an additional task (e.g., “For module X you investigated earlier, look deeper into the dependency relationships”). The stop-resume pattern allows workers to be reactivated when needed without rebuilding context from scratch. This both saves resources and preserves the previous analysis state.
10.4 Collaboration Space
Scratchpad Collaboration Space Design
The Coordinator pattern introduces the Scratchpad concept — a shared temporary file space across workers. In the system prompt, when the Scratchpad feature is enabled, descriptive information is appended to the coordinator, informing it that workers can freely read and write to this directory without permission prompts, and suggesting its use for durable cross-worker knowledge storage.
The Scratchpad’s physical location is in a session-specific subdirectory under the project’s temporary directory, with the path format /tmp/claude-{uid}/{sanitized-cwd}/{sessionId}/scratchpad/. Each session has an independent scratchpad directory.
The Scratchpad’s design principles are:
- No permission prompts: Workers can freely read and write to the scratchpad directory without user confirmation
- Persistent cross-worker knowledge: One worker can write findings to the scratchpad, and another worker can read them
- Session isolation: Each session’s scratchpad is independent, avoiding cross-session contamination
- Structural freedom: The system does not prescribe a file structure; workers organize it as needed
In the Coordinator’s system prompt, the scratchpad is described as “durable cross-worker knowledge,” hinting at its core purpose: serving as shared memory between workers.
Typical Scratchpad Usage Patterns
flowchart TD subgraph P1["Phase 1: Research (Parallel)"] WA["Worker A -> /scratchpad/api-analysis.md<br/>'Found 3 REST endpoints, 2 require OAuth2'"] WB["Worker B -> /scratchpad/db-schema.md<br/>'User table has 12 fields, password is bcrypt hash'"] WC["Worker C -> /scratchpad/deps-analysis.md<br/>'Project uses Express 4.x, no existing OAuth library'"] end subgraph P2["Phase 2: Synthesis (Coordinator)"] SYN["Coordinator reads all scratchpad files<br/>-> Writes implementation spec<br/>-> /scratchpad/implementation-spec.md"] end subgraph P3["Phase 3: Implementation (Worker)"] IMPL["Worker D reads implementation-spec.md<br/>-> Modifies code per spec<br/>-> /scratchpad/changes-made.md"] end subgraph P4["Phase 4: Verification"] VER["Worker E reads changes-made.md<br/>-> Verifies modifications are complete and correct<br/>-> /scratchpad/verification-results.md"] end P1 --> P2 --> P3 --> P4 style P1 fill:#e8f4fd,stroke:#2196F3 style P2 fill:#fff3e0,stroke:#f39c12 style P3 fill:#e8f8e8,stroke:#2ecc71 style P4 fill:#fce8e8,stroke:#e74c3c
Why not let workers pass messages to each other directly?
Message passing is ephemeral — once consumed, it’s gone. The Scratchpad, on the other hand, is persistent and can be read repeatedly by any number of workers. During the research phase, Worker A’s findings are needed not only by the coordinator but also potentially by subsequent verification workers. With message passing, the coordinator would need to forward the findings to every worker that needs them; with Scratchpad, workers can read on demand.
Additionally, the Scratchpad naturally supports “incremental building” — Worker A writes a foundational analysis, and Worker B can append supplementary information on top of it, without needing to consolidate everything into a single message.
The Coordinator’s Task Workflow
The Coordinator system prompt defines four phases of the standard task workflow:
| Phase | Executor | Purpose | Typical Output |
|---|---|---|---|
| Research | Workers (parallel) | Investigate the codebase, discover files, understand the problem | Scratchpad analysis documents |
| Synthesis | Coordinator | Read findings, understand the problem, write implementation spec | Implementation specification document |
| Implementation | Workers | Make precise modifications according to the spec | Code changes |
| Verification | Workers | Test whether modifications are correct | Test results and issue list |
flowchart TD subgraph R["Phase 1: Research"] W1["Worker 1: API Analysis"] W2["Worker 2: Database Analysis"] W3["Worker 3: Dependency Analysis"] end S["Scratchpad<br/>(Shared findings space)"] subgraph SY["Phase 2: Synthesis"] CO["Coordinator reads all findings<br/>Writes spec with specific file paths and modification instructions"] end subgraph I["Phase 3: Implementation"] WI["Worker implements per spec<br/>Note: file conflict control"] end subgraph V["Phase 4: Verification"] WV["Worker verifies correctness of modifications"] end W1 --> S W2 --> S W3 --> S S --> CO CO --> WI WI --> WV style R fill:#e8f4fd,stroke:#2196F3 style S fill:#fff9c4,stroke:#f9a825 style SY fill:#fff3e0,stroke:#f39c12 style I fill:#e8f5e9,stroke:#2ecc71 style V fill:#fce4ec,stroke:#e74c3c
The key constraint of this workflow is that the coordinator must understand the research findings before it can write the implementation spec. The system prompt uses strong language to emphasize:
“Never write ‘based on your findings’ or ‘based on the research.’ These phrases delegate understanding to the worker instead of doing it yourself. You never hand off understanding to another worker.”
This means the coordinator cannot simply forward a worker’s findings to another worker — it must first digest those findings, then write an implementation spec that includes specific file paths, line numbers, and modification instructions.
Design Philosophy: Why Can’t “Understanding” Be Delegated?
This is one of the most core design principles of the Coordinator pattern. In traditional master-slave architectures, the master node can simply forward a slave node’s results to another slave node. But in an AI agent system, this forwarding leads to severe context loss issues:
- Format inconsistency: Different workers produce output in different formats; direct forwarding causes the recipient to be unable to understand
- Coexistence of information redundancy and gaps: Worker A’s report may contain大量 irrelevant details while missing critical information
- Lack of global perspective: Each worker only sees the portion it investigated and cannot make globally optimal decisions
Requiring the coordinator to “digest” all findings before writing the spec ensures that workers in the implementation phase receive unified, precise, and contextualized instructions.
Concurrency Strategy
The Coordinator system prompt defines a clear concurrency strategy:
| Task Type | Concurrency Strategy | Reason |
|---|---|---|
| Read-only tasks (Research) | Free parallelism | No conflicts will arise |
| Write-heavy tasks (Implementation) | One at a time per file set | Prevent file write conflicts |
| Verification | Sometimes parallel with implementation | Safe when operating on different file regions |
--- config: gantt: leftPadding: 180 --- gantt title Concurrency Strategy Example (OAuth2 Integration Task) dateFormat X axisFormat %s section Research (All Parallel) Worker A: API Analysis :a1, 0, 4 Worker B: Database Analysis :a2, 0, 4 Worker C: Auth Analysis :a3, 0, 4 section Implementation (Exclusive Files) Worker D: route/auth.ts :b1, 3, 6 Worker E: db/migration.ts :b2, 6, 9 section Verification (Parallel Verification) Worker F: Test auth :c1, 9, 12 Worker G: Test DB :c2, 9, 12
The system prompt also encourages the coordinator to “fan out” — initiating multiple parallel worker calls in a single message. This leverages Claude’s parallel tool invocation capability, allowing multiple workers to start simultaneously.
Task Notification Protocol
When workers complete, results are delivered to the coordinator in <task-notification> XML format:
<task-notification>
<task-id>{agentId}</task-id>
<status>completed|failed|killed</status>
<summary>{human-readable status summary}</summary>
<result>{agent's final text response}</result>
<usage>
<total_tokens>N</total_tokens>
<tool_uses>N</tool_uses>
<duration_ms>N</duration_ms>
</usage>
</task-notification>This format is designed to be embedded in user-role messages. The coordinator identifies <task-notification> tags to distinguish genuine user messages from workers’ result reports. This design choice means the coordinator needs explicit instructions to differentiate message types, which is why the system prompt repeatedly emphasizes “Worker results are internal signals, not conversation partners.”
Why XML format instead of JSON?
XML tags have better recognizability in LLM contexts than JSON. Claude models parse XML tags very reliably (related to extensive XML usage in training data), while JSON requires strict quote and comma matching. Additionally, explicit tag names like
<task-notification>allow the model to quickly identify message types through simple pattern matching without full JSON parsing.
10.5 Complete Case Study: From Requirements to Delivery
Let’s understand the Coordinator pattern’s end-to-end workflow through a complete case study.
Scenario: Adding a User Notification System to a Web Application
Requirement Description: “Add a notification system to our Express.js application that supports both email and in-app notification methods. Users should be able to manage notification preferences from the settings page.”
Phase 1: Research
The coordinator dispatches three research workers simultaneously:
Coordinator Decision:
"This task involves three dimensions: backend API, database schema, and frontend UI.
Launching three parallel researchers to investigate each."
Worker "api-researcher":
Task -> "Investigate the existing Express.js route structure, middleware chain, and API versioning strategy"
Output -> /scratchpad/api-analysis.md
Content -> "Found routes under src/routes/, using Express Router,
middleware chain includes auth, rateLimit, validate.
API versioning managed via URL prefix /api/v1/."
Worker "db-researcher":
Task -> "Investigate database schema, ORM models, and migration strategy"
Output -> /scratchpad/db-analysis.md
Content -> "Using Prisma ORM, schema in prisma/schema.prisma.
Existing User model has id, email, name fields.
Migrations managed via prisma migrate."
Worker "frontend-researcher":
Task -> "Investigate frontend framework, component structure, and state management approach"
Output -> /scratchpad/frontend-analysis.md
Content -> "Using React + TypeScript, components in src/components/.
State management uses Zustand. Settings page in SettingsPage.tsx.
UI component library uses shadcn/ui."
Phase 2: Synthesis
Coordinator Behavior:
1. Read all three scratchpad files
2. Understand the overall architecture and write implementation spec
3. Write spec to /scratchpad/implementation-spec.md
Spec Contents:
- Database: Add Notification and NotificationPreference models
- API: Add 5 endpoints (GET/POST/PUT/DELETE notifications + PUT preference settings)
- Service layer: Add NotificationService for email and in-app notifications
- Frontend: Modify SettingsPage to add notification preference component
- Dependencies: Need to add nodemailer package
Phase 3: Implementation
Coordinator Decision:
"Based on the spec, database migration and service layer implementation
can safely execute in parallel since they operate on different files.
Frontend implementation must wait until the backend is complete."
Worker "db-implementer":
Task -> Modify Prisma schema per scratchpad spec and create migration
Files -> prisma/schema.prisma (exclusive)
Worker "service-implementer": <- Parallel with db-implementer
Task -> Implement notification service layer and API routes
Files -> src/services/notificationService.ts, src/routes/notifications.ts
(Wait for both above to complete)
Worker "frontend-implementer":
Task -> Implement frontend notification preference component
Files -> src/components/NotificationSettings.tsx, modify SettingsPage.tsx
Phase 4: Verification
Worker "verifier":
Task -> Verify the complete functionality of the notification system
Checklist:
- Is the Prisma migration correct?
- Do the API endpoints conform to RESTful conventions?
- Does the frontend component correctly call the API?
- Are there any missing error handling?
- Are the default notification preference values reasonable?
10.6 Fault Recovery and Partial Completion
Handling Strategies for Worker Failures
In multi-agent collaboration, worker failures are the norm rather than the exception. The Coordinator needs to be capable of handling the following failure scenarios:
| Failure Type | Manifestation | Coordinator’s Response Strategy |
|---|---|---|
| Tool execution failure | A Bash command returns a non-zero exit code | Analyze failure cause, retry or adjust strategy |
| Model output truncation | maxTurns exhausted, task incomplete | Evaluate completed portion, decide whether to reassign |
| MCP connection dropped | External tool unavailable | Degrade to a strategy that doesn’t depend on that tool |
| Context too long | Conversation history exceeds token limit | Compress context or split task into smaller sub-tasks |
flowchart TD A["Worker reports failed status"] --> B["Analyze failure cause"] B --> C{"Failure type?"} C -->|"Retryable"| D["Create new worker to retry"] C -->|"Needs adjustment"| E["Modify task parameters and retry"] C -->|"Fatal error"| F["Mark entire task as failed"] style A fill:#e74c3c,stroke:#c0392b,color:#fff style D fill:#2ecc71,stroke:#27ae60,color:#fff style E fill:#f39c12,stroke:#d68910,color:#fff style F fill:#c0392b,stroke:#922b21,color:#fff
Handling Partial Completion
When a research-phase worker only completes a partial investigation (e.g., analyzed the API layer but not the database layer), the coordinator faces a choice: proceed with available information, or wait for a complete analysis.
The guiding principle in the system prompt is: the coordinator should make full use of completed work rather than waiting for perfect information. If Worker A completed 80% of the code investigation, the coordinator should write the implementation spec based on that 80% of information, flagging uncertain portions in the spec so the implementation worker can conduct supplementary investigation when encountering uncovered areas.
Best Practice: Use “Confidence Annotations” in the Scratchpad
It’s recommended that the coordinator use confidence annotations in the implementation spec, for example:
- [HIGH] Confirmed file paths and function signatures are correct
- [MEDIUM] Inferred dependency relationships that need verification during implementation
- [LOW] Areas not fully investigated; additional research needed before implementation
These annotations let implementation workers know which parts can be executed directly and which need confirmation first.
10.7 Coordinator Pattern vs. Fork Pattern Comparison
The two patterns represent different parallelism strategies, and choosing the correct one is critical for task success.
| Dimension | Coordinator Pattern | Fork Pattern |
|---|---|---|
| Architecture Model | Centralized (coordinator-worker) | Decentralized (peer parallelism) |
| Context Sharing | Workers only see assigned tasks | All sub-agents inherit the full parent context |
| Communication Method | Coordinator relays, Scratchpad sharing | No direct communication, each operates independently |
| Task Allocation | Explicit allocation, precise control | Implicit parallelism, each executes independent sub-tasks |
| Result Aggregation | Coordinator synthesizes all results | Primary agent collects as needed |
| Cache Efficiency | No shared cache prefix | Byte-level sharing, efficient caching |
| Applicable Scenarios | Coordinated, complex multi-step tasks | Independent parallel investigation/search tasks |
| Fault Recovery | Coordinator can reassign tasks | Primary agent decides after receiving failure notification |
| Resource Overhead | Higher (coordinator is persistent) | Lower (shared cache) |
| Mental Model | Construction site (project manager + workers) | Scout team (multiple independent scouts) |
flowchart TD A{"Does the task require coordinating<br/>multiple steps?"} A -->|"No"| B{"Do you need cache-shared<br/>parallel execution?"} A -->|"Yes"| C{"Do you need knowledge sharing<br/>between workers?"} B -->|"Yes"| D["Fork Mode"] B -->|"No"| E["Normal synchronous sub-agent"] C -->|"Yes"| F["Coordinator Mode"] C -->|"No"| G["Consider Fork Mode<br/>(if tasks are independent)"] style D fill:#3498db,stroke:#2471a3,color:#fff style E fill:#bdc3c7,stroke:#7f8c8d,color:#333 style F fill:#2ecc71,stroke:#27ae60,color:#fff style G fill:#f39c12,stroke:#d68910,color:#fff
Cross-Reference: The detailed mechanism of the Fork pattern is covered in Chapter 9. Pay special attention to the mutual exclusion between the two — when Coordinator mode is activated, Fork mode is automatically disabled.
Hands-On Exercises
Exercise 1: Design a Multi-Worker Workflow
Suppose you have a large-scale refactoring task: splitting a monolithic Express.js application into microservices. Design the workflow under the Coordinator pattern:
- Research Phase: Plan how many Research workers you need and what module each should investigate
- Hint: Consider five dimensions — routes, database, middleware, configuration, and tests
- Synthesis Phase: How should the coordinator synthesize findings and write the microservice splitting specification
- Hint: Consider service boundaries, shared databases, API Gateway, etc.
- Implementation Phase: How to allocate workers to avoid file conflicts
- Hint: Allocate by service boundary; files for the same service should be handled by the same worker
- Verification Phase: What is the verification strategy?
- Hint: Each microservice is verified independently, followed by integration testing
Exercise 2: Analyze Scratchpad Security Boundaries
Consider the following questions about Scratchpad security design:
- What does the Scratchpad directory’s permission setting (0o700) mean?
- Why isn’t the Scratchpad placed inside the project directory?
- What happens if two workers write to the same Scratchpad file simultaneously?
- How should you design Scratchpad file naming conventions to avoid conflicts?
Extended Thinking: If you were to implement a “Scratchpad version control” feature (similar to Git), what metadata would need to be recorded? How would this change workers’ write behavior?
Exercise 3: Compare Coordinator Pattern vs. Fork Pattern
Based on the content of this chapter and Chapter 9, fill in the table below and provide reasoning for each dimension’s choice:
| Dimension | Coordinator Pattern | Fork Pattern |
|---|---|---|
| Architecture Model | ? | ? |
| Context Sharing | ? | ? |
| Applicable Scenarios | ? | ? |
| Communication Method | ? | ? |
| Cache Efficiency | ? | ? |
| Fault Recovery | ? | ? |
| Resource Overhead | ? | ? |
Exercise 4: Design a Fault Recovery Strategy
Suppose in a Coordinator workflow, a worker in the Implementation phase fails while modifying the database schema (migration script execution error). Design:
- How the Coordinator detects this failure
- How the Coordinator decides whether to retry or adjust the strategy
- How already partially completed modifications are handled
- How other workers executing in parallel are affected
Exercise 5: Simulate a Complete Coordinator Workflow
Choose a project you’re familiar with and design a Coordinator workflow for the following task:
Task: “Add internationalization (i18n) support to the project, supporting both Chinese and English languages.”
Requirements:
- List the investigation directions needed for the Research phase
- Write an outline of the implementation spec for the Synthesis phase
- Design the worker allocation for the Implementation phase
- Plan the verification checklist for the Verification phase
Key Takeaways
-
The Coordinator pattern employs a “coordinator-worker” architecture, where the coordinator only manages task allocation and result synthesis without directly performing implementation work. This layered design enables complex engineering tasks to be systematically decomposed and processed in parallel.
-
The dual gating mechanism (feature gate + environment variable) and mutual exclusion with the Fork pattern ensure clarity in mode selection.
matchSessionMode()guarantees pattern consistency during session recovery. -
Tool isolation strategy: The coordinator has only four core orchestration tools, while workers have the full development toolset but exclude team management tools. Independent assembly of the tool pool ensures workers are not affected by parent-level restrictions. Simple mode and Full mode adapt to different resource environments.
-
SendMessage’s intelligent routing supports point-to-point, broadcast, cross-process, and cross-session communication, and implements the ability to automatically resume stopped workers, enabling “stop-continue” workflows.
-
The Scratchpad collaboration space provides a permission-prompt-free shared directory across workers, with each session being independent and structurally free. It serves as the bridge for persistent knowledge transfer between workers, compensating for the limitation that workers cannot communicate directly with each other.
-
The four-phase workflow (Research → Synthesis → Implementation → Verification) provides a structured task execution pattern. The core constraint is that the coordinator must digest research findings before writing the implementation spec; delegating understanding is not allowed.
-
Fault recovery is a built-in capability of the Coordinator pattern. Through the status field in task notifications, partial results in the Scratchpad, and the coordinator’s reassignment capability, the system can gracefully handle worker failures and partial completion.
-
Pattern selection: Coordinator is suitable for coordinated, complex multi-step tasks, while Fork is suitable for independent parallel search tasks. The two are mutually exclusive and cannot be used simultaneously. Understanding the strengths and limitations of each is key to making the right choice.