A Python tool for parsing, normalizing, and clustering log messages using fingerprinting. It automatically groups similar error logs together and tracks them as incidents, making it easier to identify recurring problems in your logs.
This is a personal learning project to explore log processing, fingerprinting, and AI-powered root cause analysis.
This project helps manage large volumes of logs by automatically detecting patterns. It processes raw log messages, normalizes them by removing variable elements (like UUIDs, IP addresses, timestamps), and groups similar logs together using fingerprinting. The system then tracks these groups as incidents and can optionally analyze them with an LLM to determine root causes.
The pipeline works as follows:
- Parse logs - Converts JSON or text logs into structured data
- Normalize - Removes variable elements (UUIDs, emails, IPs, hex values, timestamps, quoted strings)
- Fingerprint - Creates a SHA-256 hash based on normalized content
- Store - Persists incidents in SQLite for tracking
- Analyze - Optionally sends to an LLM (DeepSeek R1 via Bedrock) for root cause analysis
Without normalization, logs that represent the same underlying error would be treated as different because they contain different IDs, timestamps, or addresses. For example:
User 550e8400-e29b-41d4-a716-446655440000 failed to login from 192.168.1.1
User 6ba7b810-9dad-11d1-80b4-00c04fd430c8 failed to login from 10.0.0.1
After normalization, both become:
User <uuid> failed to login from <ipv4>
This allows the system to correctly identify these as the same error pattern.
The system uses SHA-256 hashes to create fingerprints from normalized log content. There are two fingerprinting methods:
create_fingerprint()- Uses only the log message and levelcreate_fingerprint_with_stack_trace()- Also includes stack trace when available
The stack trace version is optimized to focus on relevant information:
- Only the first 5 lines of stack traces are considered (lines beyond this are typically framework code)
- For string stack traces, only the first 100 characters are used
- Both the message and stack trace are normalized before hashing
The fingerprint includes the log level, so identical messages with different severity levels produce different fingerprints.
Fingerprinted logs are stored as incidents in SQLite. Each incident contains:
id- Unique identifier (UUID7)source_id- Service that generated the logfingerprint- Hash used for groupingcount- Number of occurrencesstatus- Current state (NEW, PROCESSING, AWAIT_USER_ACTION, RESOLVED, etc.)root_cause- LLM analysis of what went wrongrecommendations- Suggested actions from the LLMtimestamps- first_seen, last_seen, and last_changed tracked separately
The database enforces uniqueness on (source_id, fingerprint) to prevent duplicate incidents for the same error from the same source.
The main.py script runs the complete pipeline:
- Reads logs from the sample file
- Parses and fingerprints each log
- Checks if an incident already exists (by source_id + fingerprint)
- If it exists, increments the occurrence count
- If it doesn't exist, creates a new incident and sends it to an LLM for analysis
- Updates the incident with LLM findings
pdm run python main.pyNote: The LLM analysis uses AWS Bedrock with the DeepSeek R1 model (us.deepseek.r1-v1:0). You will need AWS credentials configured and access to this model. Without it, the fingerprinting and clustering will still work, but LLM analysis will fail.
Bedrock isn't strictly needed, since the instructor library supports various LLM providers
The parser expects CloudWatch-style JSON logs but is flexible enough to handle various formats. It recognizes common field names:
Timestamps: timestamp, @timestamp, time, ts
Severity: level, severity, log_level, logLevel
Message: message, msg, log, event
Tracing: request_id, req_id, trace_id, correlation_id, requestId
Errors: errorMessage, error_message, errorType, error_type
Service: sourceId, source_id, service, serviceName
If JSON parsing fails, the system falls back to regex-based text parsing to extract timestamps and log levels from plain text logs.
The normalizer replaces the following patterns with placeholders:
- UUIDs →
<uuid> - Email addresses →
<email> - IPv4 addresses →
<ipv4> - IPv6 addresses →
<ipv6> - Hexadecimal values →
<hex> - Double-quoted strings →
<str1> - Single-quoted strings →
<str2>
- The IPv6 regex does not fully handle compressed notation (e.g.,
2001::1). It matches partial addresses. - The text parser does not yet extract
key=valuestyle parameters. - Other status -
RESOLVED,FIXED, etc are not yet implemented
Install dependencies using PDM:
pdm installRequires Python 3.14. For full LLM analysis functionality, you will need AWS credentials and access to AWS Bedrock with the DeepSeek R1 model (or any other model of your choice).