feat(feeds): AI Incident Database feed — AI-threat-landscape vertical#157
Merged
Merged
Conversation
22a97d2 to
c208d7d
Compare
…ical #2 of the free-source roadmap. Ingests real-world AI harm/failure incidents from the AI Incident Database (incidentdatabase.ai) — the live "what's actually going wrong with deployed AI" signal that complements the static MITRE ATLAS technique taxonomy. Dedicated `ai_incidents` table (migration 0067), deliberately NOT atlas_case_studies: ATLAS case studies are ~30 curated incidents mapped to AML techniques; AID is ~1500 raw incidents with no technique mapping — mixing them would distort the ATLAS coverage view. AI incidents are their own domain entity, mirroring telco (network_elements/fraud_schemes) and on-chain (wallets). Source: AID's GraphQL API. It gates non-browser callers ("restricted to web browsers") but allows a same-site `Origin` + a browser `User-Agent` — the data is openly licensed for research, the gate is anti-abuse. The connector (apps/worker/src/feeds/ai-incidents.ts) pages incidents(pagination,sort) — ~8 small requests for the ~1.5k corpus — so it runs daily. The API is richer than the CSV snapshot: alleged-party relations carry both an `entity_id` slug (clean tags) and a human `name` (display). Upsert on natural key incident_id; derived `tags` (always `ai-incident` + developer/deployer slugs) so the AI vertical contributes a movers signal like IOC tags. Read route GET /v1/ai-incidents + /ai-incidents/stats (total + monthly timeline + top developers — the "incidents over time" trend). Registered as `aiid`; scheduled daily (02:15 UTC). Verified end-to-end against the live API + local DB: paged 1525 incidents → map → drizzle upsert (0 failed), names resolve ("OpenAI", "Google DeepMind"); idempotent; gateway tsc + api tests (15-feed registry) green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
c208d7d to
444d99b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Builds #2 of the free-source roadmap: the AI Incident Database (incidentdatabase.ai) feed — real-world AI harm/failure incidents, the live "what's actually going wrong with deployed AI" signal that complements the static MITRE ATLAS technique taxonomy.
Key design call: a dedicated
ai_incidentstable, NOTatlas_case_studiesATLAS case studies are ~30 curated incidents mapped to AML techniques; AID is ~1500 raw incidents with no technique mapping. Dumping AID into
atlas_case_studieswould distort the ATLAS coverage heatmap (which counts case studies per technique) and conflate two distinct sources. AI incidents get their own domain table — mirroring how telco gotnetwork_elements/fraud_schemesand on-chain gotwallets. (Migration0067;iocs.type-style no-constraint, new table isIF NOT EXISTS.)Source path: snapshot, not API
AID's GraphQL endpoint is origin-locked (
Forbidden — restricted to web browsers), so the official programmatic/research path is the published MongoDB snapshot — a public R2 bucket ofmongodumpbackups (~94 MBtar.bz2). Each archive carries a clean top-levelincidents.csv. The connector stream-extracts only that file: peak memory stays ~CSV-sized (1 MB) even though the tar decompresses to hundreds of MB.Deps added (worker):
unbzip2-stream+tar-stream+csv-parse(+ type shims for the two untyped stream libs, one copy per tsconfig program).Surface
incident_id; derivedtags(alwaysai-incident+ alleged developer/deployer slugs) so the AI vertical contributes a movers signal like IOC tags.GET /v1/ai-incidents(filter: q, since, limit)GET /v1/ai-incidents/stats→ total + monthly timeline + top developers (the "incidents over time" trend)aiid; scheduled weekly (Mon 02:15 UTC — AID adds a handful/week, snapshot is large).Verification (boundary-tested, per the #125/#130 lesson)
Ran the full pipeline end-to-end against a local DB, not just unit tests:
incidents.csv→ parse (1517 rows) → map → real drizzle upsert (theexcluded.*set,datecolumn, jsonb arrays) → statsdeepfake-technology-developers(381),openai(158),google(88)tsc(strict gate) + dockerfile-deps guard + api tests (15-feed registry) all greenDeploy notes
0067_ai_incidents.sql) — unlike OFAC. Deploy must rundb:applyafter the image rebuild.🤖 Generated with Claude Code