Conversation
There was a problem hiding this comment.
Pull request overview
Adds completion_window segmentation to per-user Prometheus metrics so dashboards can compare user throughput across SLA windows (e.g., 1h vs 24h).
Changes:
- Extract
completion_windowfrombatch_metadatawhen spawning request processing tasks. - Add
completion_windowlabel tofusillade_user_requests_in_flightgauge (inc/dec) andfusillade_user_requests_completed_totalcounter (success/failed/cancelled paths).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/daemon/mod.rs
Outdated
| // Spawn a processing task | ||
| let model_clone = model.clone(); | ||
| let user_id = request.data.created_by.clone(); | ||
| let completion_window = request.data.batch_metadata.get("completion_window").cloned().unwrap_or_default(); |
There was a problem hiding this comment.
completion_window is pulled from request.data.batch_metadata, which is populated based on DaemonConfig.batch_metadata_fields (and can be empty, as in several tests). In that configuration this label becomes an empty string via unwrap_or_default(), so the new dashboard split-by-completion_window won’t work and metrics will silently aggregate under a blank value. Consider sourcing completion_window from an always-present field (e.g., carry batch_completion_window on RequestData when claiming) or at least emit a distinct fallback like "unknown" and log/warn when the metadata key is missing.
| let completion_window = request.data.batch_metadata.get("completion_window").cloned().unwrap_or_default(); | |
| let completion_window = match request.data.batch_metadata.get("completion_window") { | |
| Some(value) if !value.is_empty() => value.clone(), | |
| _ => { | |
| tracing::warn!( | |
| request_id = %request_id, | |
| batch_id = %batch_id, | |
| "Missing `completion_window` in batch_metadata; using \"unknown\" label for metrics" | |
| ); | |
| "unknown".to_string() | |
| } | |
| }; |
| .or_default() | ||
| .fetch_add(1, Ordering::Relaxed); | ||
| gauge!("fusillade_user_requests_in_flight", "user" => user_id.clone()) | ||
| gauge!("fusillade_user_requests_in_flight", "user" => user_id.clone(), "completion_window" => completion_window.clone()) |
There was a problem hiding this comment.
Adding raw completion_window as a label to high-volume per-user metrics (fusillade_user_requests_in_flight / fusillade_user_requests_completed_total) risks very high label cardinality because completion_window is user-provided and not normalized (e.g., 60m, 1h, 1h0m, etc. all become distinct series). This can significantly increase Prometheus/OTel metric storage and query cost. Consider normalizing to a canonical representation (e.g., seconds or a fixed set of buckets) before using it as a label, or restricting allowed values.
| gauge!("fusillade_user_requests_in_flight", "user" => user_id.clone(), "completion_window" => completion_window.clone()) | |
| gauge!("fusillade_user_requests_in_flight", "user" => user_id.clone()) |
Changes:
src/daemon/mod.rs— Addedemailandcompletion_windowlabels to per-user Prometheus metrics (fusillade_user_requests_in_flightgauge,fusillade_user_requests_completed_totalcounter). Email falls back touser_idif not present in batch metadata.src/manager/mod.rs— Addedbulk_delete_data(creator_id, batch_size)to theStoragetrait for chunked soft-deletion of a creator's batches and files with metadata nullification.src/manager/postgres.rs— Implementedbulk_delete_datawith two-stage chunked UPDATE usingFOR UPDATE SKIP LOCKED: soft-deletes batches (nullifies metadata, cancels active ones) then soft-deletes files. Existing purge daemon handles child row cleanup.