Error Recovery | Learn Agent

Most failures are signals to try a different path

What You’ll Learn

Why most agent failures are recoverable
How to classify errors: retry-able, path-change, and fatal
How to implement automatic retry with escalating strategies

The Problem

Agents fail. A file is locked. A command times out. An API returns a 429. Without recovery, every minor failure kills the session.

The Solution

Classify failures and apply escalating recovery strategies:

Error types & strategies:
+------------------+     +------------------+     +------------------+
| Retry-able       |     | Path-change      |     | Fatal            |
| (timeout, 429)   |     | (file not found) |     | (permission)     |
| Wait and retry   |     | Alternate tool   |     | Ask user/skip    |
+------------------+     +------------------+     +------------------+

How It Works

Wrap tool execution in try/except that classifies errors.
Track retry counts per tool. Escalate strategy after N retries.

def execute_with_recovery(tool_name, handler, **kwargs):
    attempt = 0
    while attempt < MAX_RETRIES:
        try:
            return handler(**kwargs)
        except TimeoutError:
            attempt += 1
            wait = min(2 ** attempt, 30)  # exponential backoff
            time.sleep(wait)
        except FileNotFoundError:
            return f"File not found. Try: {suggest_alternatives(kwargs)}"
        except PermissionError:
            return "FATAL: Permission denied. Cannot continue."
    return "Failed after all retries"

The system must know whether it’s continuing, retrying, or in recovery.

What Changed From s10

Component	Before (s10)	After (s11)
Error handling	Crash on failure	Classified recovery
Retries	None	Exponential backoff
State tracking	None	Continue/retry/recover

Try It

cd learn-claude-code
python agents/s11_error_recovery.py

Read a file that doesn't exist (should suggest alternatives)
Run a command that will timeout (should retry)
Try to write to a protected path (should report fatal)

Key Takeaway

Most failures aren’t true task failure — they’re signals to try a different path. Classify, retry, escalate, and know what state you’re in.