Skip to content

feat(feeds): AI Incident Database feed — AI-threat-landscape vertical#157

Merged
rinjanianalytics merged 1 commit into
masterfrom
feat/ai-incident-database-feed
Jun 17, 2026
Merged

feat(feeds): AI Incident Database feed — AI-threat-landscape vertical#157
rinjanianalytics merged 1 commit into
masterfrom
feat/ai-incident-database-feed

Conversation

@rinjanianalytics

Copy link
Copy Markdown
Owner

What

Builds #2 of the free-source roadmap: the AI Incident Database (incidentdatabase.ai) feed — real-world AI harm/failure incidents, the live "what's actually going wrong with deployed AI" signal that complements the static MITRE ATLAS technique taxonomy.

Key design call: a dedicated ai_incidents table, NOT atlas_case_studies

ATLAS case studies are ~30 curated incidents mapped to AML techniques; AID is ~1500 raw incidents with no technique mapping. Dumping AID into atlas_case_studies would distort the ATLAS coverage heatmap (which counts case studies per technique) and conflate two distinct sources. AI incidents get their own domain table — mirroring how telco got network_elements/fraud_schemes and on-chain got wallets. (Migration 0067; iocs.type-style no-constraint, new table is IF NOT EXISTS.)

Source path: snapshot, not API

AID's GraphQL endpoint is origin-locked (Forbidden — restricted to web browsers), so the official programmatic/research path is the published MongoDB snapshot — a public R2 bucket of mongodump backups (~94 MB tar.bz2). Each archive carries a clean top-level incidents.csv. The connector stream-extracts only that file: peak memory stays ~CSV-sized (1 MB) even though the tar decompresses to hundreds of MB.

Deps added (worker): unbzip2-stream + tar-stream + csv-parse (+ type shims for the two untyped stream libs, one copy per tsconfig program).

Surface

  • Upsert on natural key incident_id; derived tags (always ai-incident + alleged developer/deployer slugs) so the AI vertical contributes a movers signal like IOC tags.
  • GET /v1/ai-incidents (filter: q, since, limit)
  • GET /v1/ai-incidents/stats → total + monthly timeline + top developers (the "incidents over time" trend)
  • Registered as aiid; scheduled weekly (Mon 02:15 UTC — AID adds a handful/week, snapshot is large).

Verification (boundary-tested, per the #125/#130 lesson)

Ran the full pipeline end-to-end against a local DB, not just unit tests:

  • download → bz2 → tar → incidents.csv → parse (1517 rows) → map → real drizzle upsert (the excluded.* set, date column, jsonb arrays) → stats
  • idempotent: re-upsert held the count at 1517
  • stats signal is real: top alleged developers deepfake-technology-developers (381), openai (158), google (88)
  • gateway tsc (strict gate) + dockerfile-deps guard + api tests (15-feed registry) all green

Deploy notes

  • Has a migration (0067_ai_incidents.sql) — unlike OFAC. Deploy must run db:apply after the image rebuild.
  • Dashboard surface (an AI-incidents page / overview trend) is the natural follow-up, like the telco/onchain pages.

🤖 Generated with Claude Code

@rinjanianalytics rinjanianalytics force-pushed the feat/ai-incident-database-feed branch from 22a97d2 to c208d7d Compare June 17, 2026 02:23
…ical

#2 of the free-source roadmap. Ingests real-world AI harm/failure incidents
from the AI Incident Database (incidentdatabase.ai) — the live "what's
actually going wrong with deployed AI" signal that complements the static
MITRE ATLAS technique taxonomy.

Dedicated `ai_incidents` table (migration 0067), deliberately NOT
atlas_case_studies: ATLAS case studies are ~30 curated incidents mapped to
AML techniques; AID is ~1500 raw incidents with no technique mapping —
mixing them would distort the ATLAS coverage view. AI incidents are their
own domain entity, mirroring telco (network_elements/fraud_schemes) and
on-chain (wallets).

Source: AID's GraphQL API. It gates non-browser callers ("restricted to web
browsers") but allows a same-site `Origin` + a browser `User-Agent` — the
data is openly licensed for research, the gate is anti-abuse. The connector
(apps/worker/src/feeds/ai-incidents.ts) pages incidents(pagination,sort) —
~8 small requests for the ~1.5k corpus — so it runs daily. The API is richer
than the CSV snapshot: alleged-party relations carry both an `entity_id`
slug (clean tags) and a human `name` (display).

Upsert on natural key incident_id; derived `tags` (always `ai-incident` +
developer/deployer slugs) so the AI vertical contributes a movers signal
like IOC tags. Read route GET /v1/ai-incidents + /ai-incidents/stats (total
+ monthly timeline + top developers — the "incidents over time" trend).
Registered as `aiid`; scheduled daily (02:15 UTC).

Verified end-to-end against the live API + local DB: paged 1525 incidents →
map → drizzle upsert (0 failed), names resolve ("OpenAI", "Google
DeepMind"); idempotent; gateway tsc + api tests (15-feed registry) green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rinjanianalytics rinjanianalytics force-pushed the feat/ai-incident-database-feed branch from c208d7d to 444d99b Compare June 17, 2026 02:37
@rinjanianalytics rinjanianalytics merged commit d6cd931 into master Jun 17, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant