Skip to content

Attribution Cascade: Small Edits Causing Massive False Human Attribution #589

@vaibhavmeena1

Description

@vaibhavmeena1

Attribution Cascade: Small Edits Causing Massive False Human Attribution

Problem Statement

Editing 1-2 lines within AI-generated code blocks results in 10+ lines being incorrectly attributed as human-authored during normal edit-commit workflows.

Impact

  • Attribution amplification: 1 line edited → 10+ lines marked human
  • Inflated human contribution metrics: Small fixes to AI code disproportionately inflate human credit
  • Structured text most affected: Markdown lists, READMEs, and repeated patterns show largest cascades
  • Unreliable metrics: Attribution data cannot be trusted for measuring AI vs human contributions

Reproduction Steps

  1. Generate multi-line AI content (e.g., markdown list, function with parameters)
  2. Commit the AI-generated content
  3. Edit 1-2 lines within the committed block (e.g., fix typo, change parameter value)
  4. Check git-ai status --json

Expected: ~2 human additions
Observed: 11+ human additions

Reproduction Example

# After AI generates and commits 50 lines
❯ git-ai status --json
{"stats": {"human_additions": 0, "ai_additions": 50, "ai_accepted": 50}}

# After editing 1-2 lines in that block
❯ git-ai status --json
{"stats": {"human_additions": 11, "ai_additions": 39, "ai_accepted": 39}}

Result: 9+ AI-generated lines falsely attributed to human from a 1-2 line edit.

Root Cause Analysis

REPLACE operations are unconditionally split into DELETE+INSERT pairs, losing the semantic connection that these are edits of existing content.

Level 1: Line-Level Diff (Correct)

src/authorship/imara_diff_utils.rs (lines 293-299):

if old_hunk_len > 0 && new_hunk_len > 0 {
    // Replace: both old and new have content
    ops.push(DiffOp::Replace {
        old_index: hunk_old_start,
        old_len: old_hunk_len,
        new_index: hunk_new_start,
        new_len: new_hunk_len,
    });
}

✅ Diff engine correctly identifies edits as REPLACE operations.

Level 2: Byte-Level Conversion (Attribution Breaks Here)

src/authorship/attribution_tracker.rs - append_range_diffs() (lines 1507-1514):

fn append_range_diffs(...) {
    // ...
    if !old_slice.is_empty() {
        diffs.push(ByteDiff::new(ByteDiffOp::Delete, old_slice.as_bytes()));
    }
    if !new_slice.is_empty() {
        diffs.push(ByteDiff::new(ByteDiffOp::Insert, new_slice.as_bytes()));
    }
}

❌ REPLACE operations unconditionally split into DELETE + INSERT.

Level 2b: Token-Level Processing

src/authorship/attribution_tracker.rs - build_token_aligned_diffs() (lines 1656-1680):

DiffOp::Replace { old_index, old_len, new_index, new_len } => {
    if old_len > 0 {
        diffs.push(ByteDiff::new(
            ByteDiffOp::Delete,
            &old_content.as_bytes()[old_start_pos..old_end_pos],
        ));
    }
    
    if new_len > 0 {
        diffs.push(ByteDiff::new(
            ByteDiffOp::Insert,
            &new_content.as_bytes()[new_start_pos..new_end_pos],
        ));
        substantive_ranges.push((new_start_pos, new_end_pos));  // ← Marked as NEW content
    }
}

❌ INSERT ranges marked as substantive_new_ranges, treated as brand new content.

Level 3: Attribution Assignment (Cascade Trigger)

src/authorship/attribution_tracker.rs - transform_attributions() (lines 1015-1020):

let (author_id, attribution_ts) = if contains_newline {
    (current_author.to_string(), ts)
} else if is_substantive_insert {
    (current_author.to_string(), ts)  // ← current_author = Human
} else {
    // ... formatting fallback
    (current_author.to_string(), ts)
};

❌ Every INSERT from a split REPLACE attributed to current_author (human during manual edits).

Observed Behavior

Edit 1-2 lines in a 10-line AI block:

1. Myers diff may widen the Replace hunk in structured text, causing multiple lines to appear modified.

2. Byte-level conversion splits it
   → ByteDiffOp::Delete (10 lines) + ByteDiffOp::Insert (10 lines)

3. Attribution sees INSERT of 10 lines
   → Marked as substantive_new_ranges → All 10 lines attributed to human

Result: 1-2 line edit → 10 lines marked human

Cascade Amplifiers

Myers diff loses alignment with:

  • Similar/repeated lines (lists, bullets, patterns)
  • Structural edits (adding/removing list items)
  • Indentation changes
  • Multi-line constructs (functions, classes)

Triggering Patterns

Markdown Lists

- Item 1
- Item 2  
- Item 3  # Edit this line
- Item 4
- Item 5

# Result: All 5 items re-attributed to human

Function Parameters

def process_data(
    input_file,
    output_file,
    verbose=False,
    timeout=30  # Change this to 60
):

# Result: Entire function signature → human

README Sections

## Features
- Fast performance  # Edit this
- Easy to use
- Secure by default

# Result: Entire features list → human

Evidence

✅ REPLACE operations correctly identified by imara-diff
✅ Unconditional REPLACE → DELETE+INSERT split in byte-level processing
✅ INSERT ranges marked as substantive_new_ranges
✅ No similarity check between deleted and inserted content
✅ All INSERTs attributed to current_author during manual edits
✅ REPLACE of N lines = N lines falsely attributed
✅ Structural edits in markdown/lists produce larger REPLACE hunks

Root Design Issue

The system assumes INSERT = newly authored content, but INSERT operations include:

  1. Truly new content (should be human-attributed)
  2. Edited existing AI content (should preserve AI attribution) ← no mechanism to distinguish
  3. Moved content (handled by move detection)

Potential Fix Directions

The system currently lacks a mechanism to distinguish edited existing content from truly new insertions during REPLACE handling.

Workaround

None available. Editing AI-generated blocks inherently triggers the cascade.

Environment

  • Version: git-ai v1.1.4
  • Platform: macos
  • Affects: All manual edits to AI-generated multi-line blocks
  • Severity: High - attribution metrics unreliable
  • Frequency: Reproducible on every edit

Analysis based on: Code review of src/authorship/attribution_tracker.rs and src/authorship/imara_diff_utils.rs confirming unconditional REPLACE → DELETE+INSERT splitting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions