Most failures are signals to try a different path
What You’ll Learn
- Why most agent failures are recoverable
- How to classify errors: retry-able, path-change, and fatal
- How to implement automatic retry with escalating strategies
The Problem
Agents fail. A file is locked. A command times out. An API returns a 429. Without recovery, every minor failure kills the session.
The Solution
Classify failures and apply escalating recovery strategies:
Error types & strategies:
+------------------+ +------------------+ +------------------+
| Retry-able | | Path-change | | Fatal |
| (timeout, 429) | | (file not found) | | (permission) |
| Wait and retry | | Alternate tool | | Ask user/skip |
+------------------+ +------------------+ +------------------+
How It Works
-
Wrap tool execution in try/except that classifies errors.
-
Track retry counts per tool. Escalate strategy after N retries.
def execute_with_recovery(tool_name, handler, **kwargs):
attempt = 0
while attempt < MAX_RETRIES:
try:
return handler(**kwargs)
except TimeoutError:
attempt += 1
wait = min(2 ** attempt, 30) # exponential backoff
time.sleep(wait)
except FileNotFoundError:
return f"File not found. Try: {suggest_alternatives(kwargs)}"
except PermissionError:
return "FATAL: Permission denied. Cannot continue."
return "Failed after all retries"
- The system must know whether it’s continuing, retrying, or in recovery.
What Changed From s10
| Component | Before (s10) | After (s11) |
|---|---|---|
| Error handling | Crash on failure | Classified recovery |
| Retries | None | Exponential backoff |
| State tracking | None | Continue/retry/recover |
Try It
cd learn-claude-code
python agents/s11_error_recovery.py
Read a file that doesn't exist(should suggest alternatives)Run a command that will timeout(should retry)Try to write to a protected path(should report fatal)
Key Takeaway
Most failures aren’t true task failure — they’re signals to try a different path. Classify, retry, escalate, and know what state you’re in.