"Where's Waldo?" for Cybersecurity β Fleet-wide anomaly detection powered by unsupervised machine learning.
Created with Claude.ai but supervised by a human (me apparently).
IronSift is a Rust-based security analyzer that finds anomalous machines in a fleet by comparing their process (and optionally file access) behavior. It does not rely on attack signatures or threat feeds: it learns what is βnormalβ from your own data and flags machines that stand out.
- Fleet mode (default): You feed process logs from many machines (CSV, JSON, or JSONL). IronSift builds a behavioral profile per machine, turns them into vectors (TF-IDF), and runs DBSCAN clustering. Machines that end up alone (noise) or in a small minority cluster are reported as anomalies, with severity and risk factors (entropy, suspicious paths, unexpected root, etc.).
- Temporal mode: For a single machine, you can compare two or more snapshots over time. IronSift reports new processes, new or modified files, and new IP connections between snapshots β no clustering involved.
- File mode (
--files): File access logs per host are turned into file profiles (counts perFileSignature, plus per-path mtime and metadata). Fleet analysis combines TFβIDF + DBSCAN (same pattern as process mode: noise / minority cluster) with explicit cross-host rules: mtime vs fleet median, owner/group/size baselines on comparable paths, fleet-relative access outliers (e.g. root UID on a path only where most peers use non-rootβno blanket βroot readβ alerts), rare signatures seen on a single host, and configurable mtime-vs-access heuristics. Rows matchingfile_excluded_*regexes are never merged into profiles.
Input can come from CSV, JSON, or JSONL (one JSON object per line; each file can be one machine). Output is a console report and an optional JSON forensic report for integration with other tools.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β INPUTS β
β Process logs (CSV / JSON / JSONL) or File access logs or Temporal snapshots β
βββββββββββββββββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ
β FLEET ANALYSIS β β FILE ANALYSIS β β TEMPORAL β
β (process logs) β β (--files) β β (same machine β
β β β β β over time) β
βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ
β β β
β Group by machine_id β Group by machine_id β Build snapshot
β Resolve parents, β Per-path mtime + metadata β per time point
β compute entropy & paths β β
βΌ βΌ βΌ
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ
β One profile per β β One file profile β β Diff snapshots: β
β machine β β per machine β β new processes, β
β (process counts) β β (file + mtime + β β new/modified β
β β β owner/group/size)β β files, new IPs β
βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ βββββββββββββββββββββ
β β
β TF-IDF matrix β TF-IDF + mtime +
β (machines Γ features) β metadata fleet checks
βΌ βΌ
βββββββββββββββββββββ βββββββββββββββββββββ
β DBSCAN β β DBSCAN + fleet β
β Noise = outlier β β rules: mtime, β
β Small cluster = β β metadata, FLEET β
β minority β β OUTLIER, rare β
βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ
β β
ββββββββββββββββ¬βββββββββββββββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OUTPUTS β
β Console report (anomalies, severity, process/file risk factors) + optional JSON exportβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
In short: Fleet and file modes turn many machines into profiles, then use TF-IDF + DBSCAN so hosts in noise or a small cluster diverge from the dense majority. File mode also flags hosts using fleet baselines (mtime, owner/group/size) and path-level minority patterns (root vs non-root, permissions, recent-mtime signal)βnot per-row βevery root access is suspicious.β Temporal mode skips clustering and diffs snapshots of one machine.
use ironsift::{build_profiles_simple, analyze_fleet, DetectionConfig};
fn main() {
let config = DetectionConfig::default();
// Just provide (machine_id, process_name, parent_name) - PIDs handled automatically!
let processes = vec![
("server1".to_string(), "nginx".to_string(), "systemd".to_string()),
("server1".to_string(), "worker".to_string(), "nginx".to_string()),
("server2".to_string(), "miner".to_string(), "systemd".to_string()), // β οΈ Anomaly
];
let profiles = build_profiles_simple(processes, &config);
let report = analyze_fleet(&profiles, &config).unwrap();
report.print();
}use ironsift::{ProcessBuilder, ProcessEntry, build_profiles, analyze_fleet, DetectionConfig};
fn main() {
let config = DetectionConfig::default();
let mut builder = ProcessBuilder::new();
// Simple method
builder.add_process("server1", "nginx", "systemd");
// Or fluent API with full control
builder.add(
ProcessEntry::new("server1".to_string(), "worker".to_string())
.parent("nginx")
.uid(33)
.path("/usr/sbin/nginx")
.args("worker process")
);
// NEW: Automatic command line parsing!
builder.add_command("server2", "/usr/bin/postgres -D /var/lib/postgresql/data", Some("systemd"));
// NEW: Bare commands (no full path) work too!
builder.add_command("server3", "ls /etc/", Some("bash"));
// NEW: JSON log parsing!
builder.add_json(r#"{"host": "server4", "cmd": "nginx", "uid": 33}"#);
let profiles = build_profiles(builder.build(), &config);
let report = analyze_fleet(&profiles, &config).unwrap();
report.print();
}use ironsift::{RawLogEntry, build_profiles, analyze_fleet, DetectionConfig};
fn main() {
let config = DetectionConfig::default();
let entries = vec![
RawLogEntry {
machine_id: "server1".to_string(),
pid: 1, ppid: 0,
name: "systemd".to_string(),
uid: 0,
path: "/usr/lib/systemd/systemd".to_string(),
args: "--system".to_string(),
timestamp: None,
},
// ... more entries
];
let profiles = build_profiles(entries, &config);
let report = analyze_fleet(&profiles, &config).unwrap();
report.print();
}See EXAMPLES.md for complete usage examples.
Compare multiple snapshots of the same machine over time to spot new processes, new or modified files, and new IP connections β without fleet-wide clustering.
| Concept | Description |
|---|---|
| MachineSnapshot | One point-in-time view: processes + file accesses + connections for a single machine |
| TemporalDiff | Diff between two snapshots: new_processes, new_files, modified_files (mtime), new_connections |
| RawConnectionEntry | Connection log: machine_id, remote_ip, optional local_ip, remote_port, process_name, timestamp |
Example: build a baseline snapshot (e.g. Monday 10:00), then a current snapshot (Monday 14:00); compare_temporal(&baseline, ¤t) yields new processes, files, and IPs.
use ironsift::{build_machine_snapshot, compare_temporal, compare_temporal_series,
DetectionConfig, RawLogEntry, RawFileEntry, RawConnectionEntry};
let config = DetectionConfig::default();
let baseline = build_machine_snapshot("server1", "2024-01-01T10:00Z",
process_entries_t1, file_entries_t1, connection_entries_t1, &config);
let current = build_machine_snapshot("server1", "2024-01-01T14:00Z",
process_entries_t2, file_entries_t2, connection_entries_t2, &config);
let diff = compare_temporal(&baseline, ¤t);
// diff.new_processes, diff.new_files, diff.modified_files, diff.new_connections
// Or compare a series of snapshots (T1 vs T2, T2 vs T3, ...)
let diffs = compare_temporal_series(&[snap1, snap2, snap3]);Run the demo: cargo run --example temporal
- File fleet: Fleet-relative
FLEET OUTLIERsignals (root/permissions/recent-mtime per path vs majority), ingestfile_excluded_*, configurablefile_recent_mtime, stricter recent-mtime heuristic to cut false positives. - Process/file profiles: Hot strings deduplicated via interning (
Arc<str>on signatures and file maps). - π§ͺ Expanded tests for file fleet and exclusions.
- β¨ Enhanced Detailed Console Output - Rich reporting with attack categorization
- β¨ Automatic Command Line Parsing - Handles bare commands (
ls /etc/) and full paths - β¨ Native JSON Log Parsing - Docker, Kubernetes, CloudWatch, Elasticsearch support
- π Comprehensive documentation (15+ guides)
- π§ͺ Broad test coverage
- π― Three flexible APIs (Simple, Builder, Direct)
- π Automatic PID/PPID resolution
- π Reorganized project structure (CLI separated)
- π Extensive documentation
- π Core DBSCAN clustering
- π TF-IDF feature engineering
- π¨ Anomaly detection
- π Basic reporting
IronSift accepts data in various formats - choose what works for your logs:
builder.add_command("server1", "/usr/bin/nginx -c /etc/nginx.conf", Some("systemd"));
// β Automatically extracts: name="nginx", path="/usr/bin/nginx", args="-c /etc/nginx.conf"// Common in ps output, shell commands
builder.add_command("server1", "ls /etc/", Some("bash"));
builder.add_command("server1", "grep error app.log", Some("bash"));
// β Works perfectly! name="ls", path="ls", args="/etc/"// Single JSON entry
builder.add_json(r#"{"host": "server1", "cmd": "/usr/bin/nginx", "uid": 33}"#);
// Batch (JSON array or NDJSON)
builder.add_json_batch(r#"[
{"container": "web-1", "command": "nginx", "userid": 33},
{"node": "worker-1", "cmd": "python3 app.py", "uid": 1000}
]"#);Supported JSON key names:
- Machine:
machine_id,hostname,host,server,node,container,pod - Command:
command,cmd,cmdline,commandline - User:
uid,user_id,userid
See JSON_PARSING.md and COMMAND_PARSING.md for complete documentation.
| Feature | Description |
|---|---|
| Multivariate Analysis | Analyzes 6 dimensions: Process Name, Parent (auto-resolved), UID, Path, Entropy, Path Risk |
| PID Awareness | Automatically resolves parent processes from PID/PPID relationships |
| Unsupervised Learning | Zero-config detection β no signature database required |
| Scale Invariant | Works on 10 logs or 10 million logs |
| Minority Cluster Detection | Identifies coordinated attacks (botnets, APTs) |
| High Entropy Detection | Flags obfuscated commands and encoded payloads |
| Suspicious Path Analysis | Detects execution from /tmp, /dev/shm, hidden directories |
Logs can include path, mtime, permissions, owner, group, and size (CSV columns or JSON/JSONL fields). Profiles aggregate per FileSignature (path + uid + flags + optional metadata). IronSift compares hosts using several independent signals:
| Signal | What it does |
|---|---|
| DBSCAN (clustering-only) | TFβIDF over unique file signatures Γ normalized counts per host β DBSCAN. Noise (cluster_id = none) or membership in a non-largest cluster can mark a host as anomalous even if no text feature lines are attachedβpurely geometric distance from the main blob. |
| MTIME anomaly | Same path on β₯3 hosts: flags machines whose mtime is >24 hours from the fleet median for that path (MTIME ANOMALY). |
| Metadata anomaly | On comparable paths (/etc/β¦, */bin/*, */sbin/*, /usr/bin/β¦, /usr/sbin/β¦, /var/log/β¦), with β₯3 hosts and a majority value appearing β₯2 times, hosts that disagree on owner, group, or size are flagged (METADATA ANOMALY). |
| FLEET OUTLIER (path minorities) | For paths seen on β₯3 hosts with a strict majority (majority count β₯2): flags hosts in the minority class for root vs non-root access to that path, world-writable vs not, group-writable (only under paths containing /etc or /tmp), and recent mtime vs access (see file_recent_mtime in config). Avoids fleet-wide false positives when everyone behaves the same. |
| Rare file access | A full signature appears on exactly one machine in the fleet (Rare file access: β¦). |
| Ingest exclusions | file_excluded_path_regexes / file_excluded_filename_regexes: matching rows are not merged into profiles (enforced in the merge path). |
Per-signature helpers (e.g. in FileSignature::risk_factors) still describe suspicious path, system dir, root, etc. for local explanations; the fleet report does not treat βroot readβ or βsystem directoryβ as automatic anomalies without a fleet-relative or rare-signature signal above.
Config: file_recent_mtime tunes clock skew, time windows, and volatile path prefixes for the recent-mtime heuristic. Metadata comparison stays scoped so /home/β¦ variation does not dominate.
IronSift can identify:
- Cryptominers: Unusual processes with high CPU, suspicious paths
- Web Shells: PHP/Python processes with high-entropy eval() payloads
- Privilege Escalation: Normal processes suddenly running as root (UID 0)
- Lateral Movement: Unusual SSH/SCP activity with anomalous targets
- Rootkits: Processes masquerading as system services
- APT Campaigns: Small clusters of compromised machines with identical malware
- Rust 1.70+ (
rustuprecommended) - 4GB+ RAM for large datasets
cd ironsift
cargo build --releaseIronSift now includes a web server binary with REST API and a browser UI for:
- ingesting process/file datasets
- attaching tags for comparison cohorts
- running suspicious host detection (IronSift + optional AnoMark)
- visualizing fleet risk using a honeycomb-style grid
Run:
cargo run --bin ironsift-serverThen open http://localhost:8080.
API endpoints:
GET /api/healthGET/POST /api/datasetsPOST /api/datasets/purge(delete all datasets and ingested events)POST /api/datasets/upload(multipartfile, optional query:name,tags)POST /api/datasets/:id/tagsGET/POST /api/runsGET/POST /api/run-config(manage default detection config used by new runs)GET /api/runs/:idGET /api/runs/:id/detectionsGET /api/fleet/honeycomb?run_id=<id>&min_score=<0..1>&severity=<LOW|MEDIUM|HIGH|CRITICAL>POST /api/pipeline/auto(one-shot full automation)GET/POST /api/anomark/config(set model/columns from UI or API)GET /api/anomark/availability(which model files exist: platform config + saved trainings; used by βRuns & Findingsβ to enable AnoMark)POST /api/anomark/train(train AnoMark model from uploaded/local training data)GET/POST /api/sigma-zero/config(sigmazero /sigma_zerocrate in-process, rules directory, field map)POST /api/sigma-zero/check(run Sigma rules on selected process datasets or a server log path; exports JSONL withprocess_name/command_lineand evaluates in memory)
AnoMark is included on a run when enable_anomark=true and a model file exists. Optionally set anomark_train_id to a saved training id to use that model.bin instead of the platform model_path. The Web UI Runs & Findings section loads GET /api/anomark/availability, pre-enables the checkbox when any model is on disk, and offers a model picker.
Configure environment variables:
ANOMARK_MODEL_PATH(required to activate scoring)ANOMARK_BIN(only if you use external tools; the server uses theanomarkcrate in-process for training and scoring)ANOMARK_COLUMN(default:command)ANOMARK_MACHINE_FIELD(default:machine_id)
For osquery alignment, datasets can be tagged with the schema profile osquery-5.22.1.
Run server:
cargo run --bin ironsift-serverThen call:
curl -sS -X POST http://localhost:8080/api/pipeline/auto \
-H "Content-Type: application/json" \
-d '{
"directory": "./data",
"baseline_tag": "baseline",
"candidate_tag": "candidate",
"enable_anomark": true
}'Run the shell helper script (requires server already running):
scripts/auto_pipeline.sh --dir ./dataWith AnoMark:
ANOMARK_MODEL_PATH=./models/process_model.bin \
scripts/auto_pipeline.sh --dir ./data --enable-anomarkNote: on first run, server startup can take time because Rust dependencies compile.
Start the server manually (for example cargo run --bin ironsift-server) before running the script.
The script waits for /api/health and exits with guidance if server is not reachable.
If the directory has no logs yet, the script exits cleanly and tells you to use the UI to upload logs.
If no server is detected quickly (5s check), the script starts ironsift-server in foreground mode (no background/detached process).
You can now configure AnoMark directly in the Web UI:
- model path and column settings (AnoMark runs in-process, no
anomark-rsbinary) - model path
- scoring column and machine field
- train a model from a training dataset path
Equivalent API example:
curl -sS -X POST http://localhost:8080/api/anomark/config \
-H "Content-Type: application/json" \
-d '{
"model_path": "./models/process_model.bin",
"column": "command",
"machine_field": "machine_id"
}'Train example. You can omit output_model_path: the model is always written to .ironsift-platform/anomark-trains/<train_id>/model.bin and is listed for download; set output_model_path only to copy the same file to another path on disk.
curl -sS -X POST http://localhost:8080/api/anomark/train \
-H "Content-Type: application/json" \
-d '{
"training_path": "./data/train.jsonl",
"column": "command",
"order": 4,
"output_model_path": "./models/process_model.bin"
}'Train from already ingested datasets (no raw file path required; output_model_path can be ""):
curl -sS -X POST http://localhost:8080/api/anomark/train \
-H "Content-Type: application/json" \
-d '{
"dataset_ids": ["<dataset-id-1>", "<dataset-id-2>"],
"tags": ["baseline", "prod-clean"],
"column": "command",
"order": 4,
"output_model_path": ""
}'List past trainings: GET /api/anomark/trains. Download: GET /api/anomark/trains/<id>/model and GET /api/anomark/trains/<id>/training-data.
Sigma (sigmazero): the server links the sigmazero library (sigma_zero crate) at build timeβno sigma-zero binary. Set rules_dir to a folder of .yml Sigma rules, then run checks from the Sigma tab or via API. IronSift converts process datasets to JSONL with process_name, command_line, machine_id, etc. Use field_map in config or the request if your rules use other field names (e.g. Windows Image β process_name).
curl -sS -X POST http://localhost:8080/api/sigma-zero/config \
-H "Content-Type: application/json" \
-d '{
"rules_dir": "/path/to/sigma-rules",
"field_map": "",
"workers": 8
}'
curl -sS -X POST http://localhost:8080/api/sigma-zero/check \
-H "Content-Type: application/json" \
-d '{
"dataset_ids": ["<id>"],
"tags": [],
"log_path": "",
"filter_tags": [],
"filter_levels": ["high", "critical"]
}'In the UI, AnoMark Training now supports:
- training file path, or
- dataset IDs, or
- tags (it selects matching process datasets)
- optional on-disk
output_model_path(not required; use the βSaved AnoMark trainingsβ table to download the model and training JSONL)
The Runs tab now supports:
- dataset picker checkboxes (plus optional manual ID input)
- baseline/candidate tag filters for run creation
- readable findings table (machine, severity, score, detectors, top reason)
- selecting a run and propagating it automatically to Findings (including the hex fleet view)
- editable default run config as full
DetectionConfigJSON (all fields exposed)
Behavior:
- Imports all
.csv,.json,.jsonlfiles from the directory. - Auto-detects dataset kind (process/file) from schema/header heuristics.
- Applies tags automatically (files containing
baselinein the name get baseline tag). - Runs IronSift detection and optional in-process AnoMark (
anomarkcrate). - Stores run output and exposes a honeycomb-ready fleet map via API.
curl -sS -X POST "http://localhost:8080/api/datasets/upload?name=mylogs&tags=candidate,prod" \
-F "file=@./events.jsonl"You can provide multiple files in one call:
curl -sS -X POST "http://localhost:8080/api/datasets/upload?tags=candidate" \
-F "file=@./proc_a.csv" \
-F "file=@./proc_b.jsonl" \
-F "file=@./files.json"All imported data is persisted into a SQLite database using the platform schema:
.ironsift-platform/events.db- tables:
datasetsdataset_tagsprocess_eventsfile_events
On Run Findings (not a separate tab), the Web UI renders a hexagonal honeycomb map:
- each hex cell = one machine
- color encodes risk score/severity
- tooltip shows host, severity, score
- API filters supported:
min_scorefor thresholdingseverityfor severity-only views
Example:
curl -sS "http://localhost:8080/api/fleet/honeycomb?run_id=<RUN_ID>&min_score=0.7&severity=HIGH"Create a realistic dataset with 100 machines and embedded attack scenarios:
cargo run --release --bin generatorOutput: large_dataset.csv (100,000 logs with 10 compromised machines)
The generated data includes:
- Realistic PID/PPID relationships
- systemd as PID 1 on each machine
- Normal processes as children of systemd
- Attack processes with proper parent relationships
For file datasets, run cargo run --release --bin generator -- --files. The sample CSV includes mtime and metadata scenarios (fleet baseline owner/group/size on common paths, with a few hosts intentionally diverging) so ironsift --files exercises MTIME ANOMALY and METADATA ANOMALY lines in the report.
Analyze the fleet and display results:
cargo run --release --bin ironsiftSample Output:
================================================================================
IRONSIFT ANALYSIS REPORT
================================================================================
Fleet Size: 100 machines
Detection Sensitivity: High
--- Configuration ---
DBSCAN Tolerance: 0.35
Entropy Threshold: 4.5
Minority Cluster Ratio: 10%
--- Cluster Distribution ---
Cluster 0: 90 machines (90.0%)
Noise (Outliers): 10 machines (10.0%)
================================================================================
Status: π¨ ANOMALIES DETECTED
================================================================================
Suspicious Machines: 10
π CRITICAL (3):
These machines are isolated outliers - likely compromised
π machine_013 (Distance: 1.500)
ββ Cluster: Noise (isolated outlier)
ββ Total processes: 150
ββ Suspicious processes: 50 β οΈ
ββ Rare processes (< 5% of fleet):
β β’ kworker (path: /tmp/.X11-unix/kworker)
β β’ systemd (path: /var/tmp/.cache/systemd)
ββ Suspicious processes detected:
β
β π kworker (count: 30)
β Parent: systemd
β Path: /tmp/.X11-unix/kworker
β UID: 0 (root) β οΈ
β Risk factors:
β π¨ High entropy arguments (possible obfuscation)
β π¨ Suspicious execution path: /tmp/.X11-unix/kworker
β π¨ Running as root (UID 0)
β π¨ Executing from temporary directory
ββ Activity period: 2024-01-01 10:00:00 to 2024-01-07 15:30:00
π΄ HIGH (4):
Strong deviation from baseline - investigate immediately
π΄ machine_042 (Distance: 0.823)
ββ Suspicious processes: 15 β οΈ
ββ Unusual: php-fpm (high entropy eval payloads)
...
--- Detected Attack Patterns ---
βοΈ Cryptomining (3 machines): machine_013, machine_027, machine_065
πΈοΈ Web Shells (2 machines): machine_042, machine_088
β¬οΈ Privilege Escalation (4 machines): machine_019, machine_051, ...
π Suspicious Execution Paths (5 machines): machine_013, machine_027, ...
================================================================================
Recommended Actions:
1. Review flagged machines and investigate anomalous processes
2. Check process execution paths and command arguments
3. Verify parent-child process relationships
4. Cross-reference with network logs and file access logs
5. Export detailed report: cargo run --bin ironsift -- --export-json
================================================================================
See OUTPUT_EXAMPLES.md for complete output examples.
Generate a detailed JSON report for incident response:
cargo run --release --bin ironsift -- --export-jsonOutput: forensic_report.json
For use by other tools or in scripts:
| Option | Effect |
|---|---|
-q, --quiet |
Minimal output: one-line summary only (e.g. CLEAN or ANOMALIES: 5 (Critical: 2, High: 1, β¦)). Progress and config are suppressed. |
--export-json - |
Write the JSON report to stdout (nothing else on stdout). Use 2>/dev/null to hide progress on stderr. |
| Progress messages | Loading/config/progress lines are sent to stderr so stdout can be piped or parsed. |
Examples:
# One-line result for scripting
ironsift -q --input data.csv
# JSON only on stdout (e.g. pipe to jq or another tool)
ironsift --export-json - --input data.csv 2>/dev/null | jq '.anomalies_detected'
# Quiet + export to file
ironsift -q --export-json report.json --input data.csvironsift [OPTIONS]
Options:
--config <file> Load configuration from JSON file
--export-json Export detailed forensic report
--tolerance <value> Override DBSCAN tolerance (default: from config, 0.35)
--help Show help messageOn first run, IronSift creates ironsift_config.json. Important keys:
{
"entropy_threshold": 4.5,
"minority_cluster_ratio": 0.10,
"dbscan_tolerance": 0.35,
"dbscan_min_samples": 2,
"normalize_features": true,
"suspicious_path_patterns": [
"/tmp/",
"/dev/shm/",
"/var/tmp/",
"/home/[^/]+/\\.[^/]+",
"^\\./",
"/(?:bin|sbin|usr/bin|usr/sbin)/\\.[^/]+"
],
"file_excluded_path_regexes": [],
"file_excluded_filename_regexes": [],
"file_recent_mtime": {
"clock_skew_minutes": 5,
"max_hours_critical_paths": 12,
"max_hours_system_elevated": 6,
"max_hours_suspicious_only": 3,
"volatile_path_prefixes": [
"/var/log/",
"/var/cache/",
"/var/lib/dpkg/",
"/var/lib/apt/",
"/var/tmp/",
"/tmp/",
"/run/",
"/proc/",
"/sys/",
"/dev/"
]
}
}file_excluded_path_regexes/file_excluded_filename_regexes: Rust regexes; matching file-log rows are dropped before profiling (e.g.^/proc/,^/var/cache/).file_recent_mtime: Controls the mtime vs access-time signal used in profiles and in FLEET OUTLIER comparisons (volatile prefixes, tiered hour limits).
| Parameter | Effect | Recommended Range |
|---|---|---|
dbscan_tolerance |
Detection sensitivity (process & file TFβIDF clustering) | Default 0.35; lower (e.g. 0.03β0.10) = stricter, higher = looser |
minority_cluster_ratio |
Botnet detection threshold | 0.05 - 0.15 |
entropy_threshold |
Obfuscation detection | 3.5 (sensitive) - 5.5 (strict) |
file_recent_mtime.* |
Strictness of βrecent mtime near accessβ (file mode) | Adjust max_hours_* / volatile_path_prefixes if too noisy or too quiet |
Example: Increase sensitivity for high-security environments:
cargo run --bin ironsift -- --tolerance 0.03| Level | Score | Meaning | Action |
|---|---|---|---|
| π Critical | > 1.0 | Isolated outlier, likely compromised | Immediate isolation |
| π΄ High | 0.6-1.0 | Strong deviation, investigate ASAP | Priority investigation |
| π Medium | 0.3-0.6 | Moderate anomaly, worth reviewing | Schedule review |
| π‘ Low | 0.0-0.3 | Minor deviation, may be benign | Monitor |
The JSON export includes:
{
"report_timestamp": "2024-12-10T15:30:00Z",
"fleet_size": 100,
"anomalies_detected": 10,
"config": { ... },
"investigation_targets": [
{
"machine_id": "machine_013",
"severity": "Critical",
"distance_score": 1.5,
"suspicious_processes": [
{
"name": "kworker",
"path": "/tmp/.X11-unix/kworker",
"parent": "systemd",
"risk_factors": [
"High entropy arguments (possible obfuscation)",
"Suspicious execution path: /tmp/.X11-unix/kworker",
"Running as root (UID 0)"
]
}
]
}
]
}Run the comprehensive test suite:
cargo testTo check that the generator output is correctly analyzed by the CLI (catches regressions in ingestion or reporting):
./scripts/test_generator_ironsift.shThis script builds release, generates process and file datasets, runs ironsift (and ironsift --files) on them, and verifies that anomalies are reportedβincluding at least one MTIME ANOMALY and one METADATA ANOMALY line in the file report. Run from the repo root.
- Shannon entropy calculation
- Suspicious path detection
- Clean fleet (no false positives)
- Single outlier detection
- Minority cluster detection (botnet scenario)
- Process risk factor analysis
- PID/PPID parent resolution
- Unknown parent handling
- File fleet: DBSCAN + mtime/metadata baselines +
FLEET OUTLIERpath minorities + rare signatures + regex exclusions +file_recent_mtime; JSONL/CSV streaming loaders
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β IRONSIFT PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Raw Input Profile Building Analysis
βββββββββ ββββββββββββββββ βββββββ
ββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β CSV / JSON β β Group by β β TF-IDF β
β Process Logs ββββββββββββββΊβ machine_id ββββββββββββΊβ Vectorization β
β or File β parse β β build β (rare = signal) β
β Access Logs β β Resolve PPID β β profiles β β
ββββββββββββββββ β parent names β ββββββββββ¬βββββββββ
β β β β
β β Whitelist / β βΌ
β β filter paths β βββββββββββββββββββ
ββββββββββββββββββββββΊβ β β L2 Normalize β
βββββββββββββββββββ β DBSCAN Cluster β
ββββββββββ¬βββββββββ
β
βΌ
Output βββββββββββββββββββ βββββββββββββββββββ
ββββββ β Anomaly Scoring βββββββββββββββ Noise = outlier β
β & Severity β cluster β Small cluster β
ββββββββββββββββ β (CriticalβLow) β ids β = minority β
β Console βββββββββββββ€ β β Large cluster β
β Report β print ββββββββββ¬βββββββββ β = baseline β
ββββββββββββββββ β βββββββββββββββββββ
β² β
β βΌ
ββββββββββββββββ βββββββββββββββββββ
β forensic_ βββββββββββββββ Feature reasons β
β report.json β export β Process: entropyβ
ββββββββββββββββ β path, rootβ¦ β
β File: mtime, β
β metadata, FLEET β
β OUTLIER, rare β
βββββββββββββββββββ
PROCESS MODE (default) FILE MODE (--files)
βββββββββββββββββββββ βββββββββββββββββββ
RawLogEntry RawFileEntry
β’ machine_id, pid, ppid β’ machine_id, path, uid
β’ name, path, args, uid β’ timestamp, mtime
β’ timestamp β’ permissions, owner, group, size (optional)
β β
βΌ βΌ
ProcessSignature FileSignature
β’ name + parent + uid + path β’ path + uid + metadata
β’ is_suspicious_path, entropy β’ is_suspicious_path, permissions
β β’ owner, group, size; mtime flags
βΌ β
MachineProfile MachineFileProfile
(counts per process) (counts per file signature +
latest mtime + owner/group/size per path)
β β
ββββββββββββββββ¬ββββββββββββββββββββββ
βΌ
analyze_fleet / analyze_files_fleet
β
βΌ
AnalysisReport (anomalies, severity)
- PID Resolution: Automatically maps PPID to parent process names
- TF-IDF Weighting: Boosts rare processes, reduces noise from common ones
- L2 Normalization: Ensures distance metrics work correctly across varied fleet sizes
- DBSCAN: Density-based clustering that naturally identifies outliers
- Shannon Entropy: Measures randomness in command arguments (detects obfuscation)
- File fleet baselines: Median mtime and majority owner/group/size per path (on comparable paths); path-level binary minorities (root, writable flags, recent-mtime pattern) when β₯3 hosts and a clear majority; rare signatures (single-host
FileSignature)
IronSift treats each machine as a vector in N-dimensional feature space:
- Normal machines cluster tightly (distance β 0)
- Compromised machines drift away due to:
- Rare processes not seen elsewhere
- Unusual execution paths
- High-entropy obfuscated commands
- Privilege escalation patterns
- Abnormal parent-child relationships
- File mode: DBSCAN distance, rare file signatures, mtime far from fleet median, metadata disagreements, or minority access pattern on a path vs most peers (not βevery root read is badβ)
Feature space (simplified 2D view)
βββββββββββββββββββββββββββββββββ
β’ β’ β’ β’ β’
β’ β’ β’ β’ β’ β Normal machines (tight cluster)
β’ β’ β’ β’ β’
β’ β’ β’ β’
β
β Isolated outlier (NOISE)
β π CRITICAL: likely compromised
β β β β β βΊ
small cluster
(minority) β π΄ HIGH: botnet / APT pattern
β³ β³
β³
DBSCAN: density-based clustering
β’ Points in dense regions β same cluster (baseline).
β’ Points in sparse regions β "noise" = anomaly.
β’ Small clusters β minority = coordinated deviance.
Fleet: 100 web servers running nginx, postgres, node
Anomaly: Machine #42 suddenly has:
php-fpm (PID 5432, PPID 108 [apache2]) β eval(base64_decode('aGVsbG8gd29ybGQ='))
IronSift Analysis:
Raw log Resolution TF-IDF DBSCAN
βββββββ ββββββββββ ββββββ ββββββ
machine_42 PPID 108 rare Machine #42 Main cluster
pid 5432, ppid 108 ββββΊ β apache2 process βββΊ vector differs βββΊ β’ β’ β’ β’ β’
name php-fpm parent (1/100) from baseline β’
args eval(base64β¦) resolved βΌ βΌ β
β #42
β IDF boost distance β 1.2 (outlier)
β 100Γ βΌ
β π΄ HIGH severity
ββββββββββββββββββββ anomaly
- Resolves parent: PPID 108 β apache2
- Computes TF-IDF: This exact process appears on 1/100 machines
- IDF boost: 100Γ signal amplification for this rare event
- DBSCAN: Machine #42 is 1.2 units away from main cluster
- Result: π΄ HIGH severity anomaly detected
Benchmarks on a 4-core CPU:
| Fleet Size | Logs | Processing Time | Memory |
|---|---|---|---|
| 100 machines | 100K | 0.8s | 45 MB |
| 1,000 machines | 1M | 6.2s | 320 MB |
| 10,000 machines | 10M | 58s | 2.8 GB |
With parallel processing enabled (Rayon)
# Daily cron job
0 2 * * * cd /opt/ironsift && \
./ingest_logs.sh && \
cargo run --release --bin ironsift -- --export-json && \
./alert_soc.sh forensic_report.json# Quick triage after breach detection
cargo run --release --bin ironsift -- --tolerance 0.03 --export-json# Test detection against custom malware
./inject_attack.sh && cargo run --bin ironsiftStay secure. Sift the iron from the ore. π
