Skip to content

Latest commit

 

History

History
60 lines (39 loc) · 2.55 KB

File metadata and controls

60 lines (39 loc) · 2.55 KB

Threat Model

Scope

safedata targets risks in training dataset construction and preprocessing pipelines for language-model-like systems.

Attacker taxonomy

  1. Script kiddie Low-skill actor injecting obvious toxic/PII strings or basic trigger tokens.

  2. Opportunistic insider Can alter labeling or introduce subtle policy-violating examples during annotation.

  3. Organized abuse actor Coordinates poisoning or evasion campaigns across multiple ingestion sources.

  4. Advanced persistent adversary Long-horizon attacker with infrastructure access and ability to stage multi-step data integrity attacks.

  5. Nation-state-grade actor Can compromise supply chain or storage layers and execute stealthy poisoning with domain expertise.

Attack surfaces

  • Collection: web crawlers, dataset marketplaces, user uploads, scraped forums.
  • Labeling: annotator interfaces, weak supervision pipelines, pseudo-labeling systems.
  • Storage and transport: object stores, dataset versioning systems, ETL queues, feature stores.

Attack types

Availability attacks

Goal: degrade utility by flooding with low-quality or adversarial noise. Examples: duplicate spam, malformed payloads, synthetic nonsense bursts.

Integrity attacks

Goal: alter model behavior through poisoning/backdoors. Examples: trigger-based backdoors, label flips, targeted representational manipulation.

Privacy attacks

Goal: expose memorized personal/sensitive information. Examples: insertion of direct identifiers, contextual PII snippets, credential leaks.

Defensive assumptions

  • Defends against: moderate-scale text-domain attacks discoverable via pattern, clustering, and calibration-driven classifiers.
  • Partially defends against: adaptive attackers reusing simple obfuscation (character swaps, homoglyphs).
  • Does not defend against: end-to-end compromise of trusted compute, stealth poisoning in latent spaces without detectable activation anomalies, or high-confidence factual misinformation beyond classifier patterns.

Alignment connection

Training data is part of alignment. If harmful patterns enter data curation, downstream alignment objectives become underdetermined or contradictory.

  • Toxic or hateful data can conflict with harmlessness objectives.
  • PII leakage conflicts with privacy and governance constraints.
  • Biased data can violate fairness and equal-treatment principles.
  • Poisoned data can create hidden behavioral policies outside intended constitutions.

safedata contributes a measurable front-end control layer, but it is one component in a broader alignment and governance stack.