Most failures are signals to try a different path

What You’ll Learn

  • Why most agent failures are recoverable
  • How to classify errors: retry-able, path-change, and fatal
  • How to implement automatic retry with escalating strategies

The Problem

Agents fail. A file is locked. A command times out. An API returns a 429. Without recovery, every minor failure kills the session.

The Solution

Classify failures and apply escalating recovery strategies:

Error types & strategies:
+------------------+     +------------------+     +------------------+
| Retry-able       |     | Path-change      |     | Fatal            |
| (timeout, 429)   |     | (file not found) |     | (permission)     |
| Wait and retry   |     | Alternate tool   |     | Ask user/skip    |
+------------------+     +------------------+     +------------------+

How It Works

  1. Wrap tool execution in try/except that classifies errors.

  2. Track retry counts per tool. Escalate strategy after N retries.

def execute_with_recovery(tool_name, handler, **kwargs):
    attempt = 0
    while attempt < MAX_RETRIES:
        try:
            return handler(**kwargs)
        except TimeoutError:
            attempt += 1
            wait = min(2 ** attempt, 30)  # exponential backoff
            time.sleep(wait)
        except FileNotFoundError:
            return f"File not found. Try: {suggest_alternatives(kwargs)}"
        except PermissionError:
            return "FATAL: Permission denied. Cannot continue."
    return "Failed after all retries"
  1. The system must know whether it’s continuing, retrying, or in recovery.

What Changed From s10

ComponentBefore (s10)After (s11)
Error handlingCrash on failureClassified recovery
RetriesNoneExponential backoff
State trackingNoneContinue/retry/recover

Try It

cd learn-claude-code
python agents/s11_error_recovery.py
  1. Read a file that doesn't exist (should suggest alternatives)
  2. Run a command that will timeout (should retry)
  3. Try to write to a protected path (should report fatal)

Key Takeaway

Most failures aren’t true task failure — they’re signals to try a different path. Classify, retry, escalate, and know what state you’re in.