Bug: Critical Error Handling Gaps and Reliability Issues

## Describe the bug
Mesa-LLM lacks comprehensive error handling and recovery mechanisms, making it unreliable for production use. The system can crash unexpectedly when encountering API failures, tool execution errors, or edge cases, leaving simulations in inconsistent states without recovery options.

## Expected behavior
- API failures should be handled gracefully without crashing simulations
- Tool execution errors should be isolated and recoverable
- Agent state should remain consistent after failures
- All external inputs should be validated and sanitized
- Comprehensive error messages should provide debugging context
- Circuit breaker should prevent cascade failures
- Checkpoint/restart functionality should enable recovery from failures

## To Reproduce

### 1. LLM API Failure Scenario
```python
# Steps to reproduce API failure handling issues:
from mesa_llm import ModuleLLM

# Create LLM instance
llm = ModuleLLM(model="invalid-model")

# Try to generate - this will crash without proper error handling
try:
    response = llm.generate("test prompt")
    print("No error handling - this crashes the simulation")
except Exception as e:
    print(f"Unhandled exception: {e}")
```

### 2. Tool Execution Failure Scenario
```python
# Steps to reproduce tool execution isolation issues:
from mesa_llm.tools import ToolManager

# Create tool manager and register a faulty tool
tm = ToolManager()

@tool
def faulty_tool(arg1):
    raise ValueError("Tool execution failed")

# Execute tool - this crashes entire agent without isolation
try:
    result = tm.call("faulty_tool", {"arg1": "test"})
except Exception as e:
    print(f"Tool failure crashes entire system: {e}")
```

### 3. Agent State Inconsistency Scenario
```python
# Steps to reproduce state consistency issues:
from mesa_llm import LLMAgent

# Create agent
agent = LLMAgent(unique_id="test", model="gpt-4")

# Simulate partial failure during step
try:
    # If this fails mid-execution, agent state becomes inconsistent
    agent.step()
except Exception as e:
    print(f"Agent left in inconsistent state: {e}")
    # No recovery mechanism available
```

### 4. Input Validation Issues
```python
# Steps to reproduce input validation problems:
from mesa_llm import ModuleLLM

llm = ModuleLLM(model="gpt-4")

# No input sanitization - potential injection attacks
malicious_prompt = "test\x00\x01\x02prompt with control chars"
response = llm.generate(malicious_prompt)  # No validation

# Invalid tool parameters not validated
from mesa_llm.tools import ToolManager
tm = ToolManager()
# This will crash instead of being handled gracefully
tm.call("some_tool", {"invalid_param": None})
```

## Additional context

### Current Code Issues

#### Missing Error Handling (module_llm.py lines 144-146):
```python
response = completion(**completion_kwargs)
return response  # No error handling, validation, or recovery
```

#### Tool Execution Without Isolation (tool_manager.py):
```python
# Tool failures can crash entire agent execution
result = tool_function(**kwargs)
```

### Impact Assessment
- **Simulations crash unexpectedly** without recovery options
- **Data corruption possible** with invalid LLM responses
- **Debugging difficulties** due to insufficient error context
- **Production unsuitability** due to reliability concerns
- **User experience issues** with cryptic error messages

### Current Workarounds
Users currently have to:
1. Wrap all LLM calls in try-catch blocks manually
2. Implement their own tool isolation mechanisms
3. Create custom state validation logic
4. Add input sanitization manually
5. Build their own recovery mechanisms

### Severity: Critical
This issue prevents Mesa-LLM from being used in production environments where reliability is essential. The lack of error handling makes the system unsuitable for:
- Long-running simulations
- Production deployments
- Multi-agent systems
- Real-time applications
- Enterprise use cases

### Related Issues
- Performance degradation during API outages
- Memory leaks from failed operations
- Inconsistent agent behavior after errors
- Difficulty debugging distributed agent failures

### Acceptance Criteria for Fix
- [x] API failures handled gracefully without simulation crashes
- [x] Tool execution errors isolated and recoverable
- [x] Agent state remains consistent after failures
- [x] All external inputs validated and sanitized
- [x] Comprehensive error messages with debugging context
- [x] Circuit breaker prevents cascade failures
- [x] Checkpoint/restart functionality working
- [x] Backward compatibility maintained
- [x] Performance impact minimal (<5% overhead)
- [x] Comprehensive test coverage for error scenarios

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Critical Error Handling Gaps and Reliability Issues #240

Describe the bug

Expected behavior

To Reproduce

1. LLM API Failure Scenario

2. Tool Execution Failure Scenario

3. Agent State Inconsistency Scenario

4. Input Validation Issues

Additional context

Current Code Issues

Missing Error Handling (module_llm.py lines 144-146):

Tool Execution Without Isolation (tool_manager.py):

Impact Assessment

Current Workarounds

Severity: Critical

Related Issues

Acceptance Criteria for Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Bug: Critical Error Handling Gaps and Reliability Issues #240

Description

Describe the bug

Expected behavior

To Reproduce

1. LLM API Failure Scenario

2. Tool Execution Failure Scenario

3. Agent State Inconsistency Scenario

4. Input Validation Issues

Additional context

Current Code Issues

Missing Error Handling (module_llm.py lines 144-146):

Tool Execution Without Isolation (tool_manager.py):

Impact Assessment

Current Workarounds

Severity: Critical

Related Issues

Acceptance Criteria for Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions