Skip to content

Research sync: reconciliation-first pipeline for gov/acc data updates #1

@zhiganov

Description

@zhiganov

Context

gov/acc Phase 1 research uses a single Harmonica session (hst_a51081812ed9) with 40 participants. The research output includes:

  • An interactive HTML dashboard (gov-acc-research/index.html) with 7 tabs (D3/Plotly)
  • A Quartz wiki (gov-acc-site/) with 11 problem pages, 34 solution pages, and index/overview pages

When new participants complete interviews, their data needs to be incorporated into both the dashboard and wiki. This currently requires manual data extraction, metric recalculation, and propagation across 50+ files.

This issue captures lessons from a painful manual update session where incorrect participant counts (41 instead of 40) were propagated across the entire dataset, requiring three rounds of corrections (41→39→40) because the source of truth (Harmonica session) wasn't consulted first.

Problem

The current workflow is error-prone:

  1. Export JSON from Harmonica (or read via MCP)
  2. Manually identify new participants by comparing against existing data
  3. Map each participant's messages to problems/solutions
  4. Recalculate breadth, depth, urgency scores
  5. Update dashboard data arrays (6 separate data structures that must stay in sync)
  6. Update 11 problem wiki pages (callouts, evidence text, participant lists)
  7. Update 34 solution wiki pages (mention counts, "out of X" references)
  8. Update 3 index/overview pages (tables, rankings, narrative)
  9. Verify consistency across all files

Errors compound: one wrong number propagates everywhere, and corrections require touching every file again.

Design Requirements

1. Harmonica session is the source of truth

The sync pipeline must start by fetching live session data (get_session + get_responses), not relying on local exports or manual counts. Key metadata to extract:

  • Total participants (engaged vs completed)
  • Per-participant: display name, message count, completion status
  • Per-participant: which messages map to which problems/solutions

2. Reconciliation before transformation

Before writing any output files, the pipeline must produce a reconciliation report:

  • New participants not yet in the output data
  • Existing participants whose data has changed
  • Participants in the output that don't match any session participant (stale/removed)
  • Flagged entries: relayed voices like "X (via Y)", low-message-count participants
  • Count summary: "X completed, Y engaged, Z new since last sync"

This report requires human review and confirmation before proceeding.

3. Participant validation

Automatically flag:

  • Participants in output data with no matching Harmonica session entry
  • Entries with "(via ...)" pattern — relayed views, not direct participants
  • Duplicate display names or very similar names (fuzzy match)
  • Participants with very few messages (< 3) — may be bounced sessions

4. Derived metrics, not hardcoded

All aggregate metrics should be computed from the participant-to-problem mapping, not manually entered:

  • breadth = count of participants who raised a problem
  • depth = average messages per participant for that problem
  • maxBreadth = max breadth across all problems
  • maxDepth = max depth across all problems
  • urgencyScore = (breadth/maxBreadth) * 0.6 + (depth/maxDepth) * 0.4
  • severityLevel = derived from urgencyScore thresholds

If the source data changes (participant added/removed), everything recalculates automatically. No manual number entry.

5. Separate extraction from propagation

The pipeline should have distinct phases:

  1. Extract — Pull data from Harmonica API
  2. Reconcile — Compare against existing output, show diff
  3. Confirm — Human reviews and approves
  4. Transform — Compute all derived metrics
  5. Propagate — Write to all output files (dashboard, wiki pages, indexes)

6. Template-driven output

Use harmonica-sync's existing Mustache template system to generate:

  • Problem wiki pages (with computed breadth, depth, urgency, participant lists)
  • Solution wiki pages (with mention counts, problem mappings)
  • Index pages (with ranked tables)
  • Dashboard data (JSON-like arrays that get embedded in HTML)

This ensures consistency — all files are generated from the same source data through templates, not manually edited.

Relationship to existing harmonica-sync

Current harmonica-sync syncs individual sessions to markdown files (1 session → 1 file). This gov/acc workflow is different:

  • 1 session → many files (1 session produces 50+ output files)
  • Incremental updates (new participants added to existing output, not new files)
  • Computed metrics (not just formatting session data, but deriving research metrics)
  • Multi-format output (markdown wiki pages + HTML dashboard data)

This could be:

  • A new --mode research flag on harmonica-sync
  • A separate harmonica-sync-research tool
  • A gov-acc-specific config template that uses harmonica-sync's template engine

Acceptance Criteria

  • Pipeline fetches live session data from Harmonica API
  • Reconciliation report shows new/changed/removed participants before any writes
  • Human confirmation required before propagating changes
  • All metrics (breadth, depth, urgency, severity) are computed, not hardcoded
  • Adding/removing a participant automatically recalculates all downstream numbers
  • Output covers both dashboard (HTML data arrays) and wiki (markdown pages)
  • Template-driven: output format is configurable, not baked into code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions