Attribution Cascade: Small Edits Causing Massive False Human Attribution

# Attribution Cascade: Small Edits Causing Massive False Human Attribution

## Problem Statement

Editing 1-2 lines within AI-generated code blocks results in 10+ lines being incorrectly attributed as human-authored during normal edit-commit workflows.

## Impact

- **Attribution amplification**: 1 line edited → 10+ lines marked human
- **Inflated human contribution metrics**: Small fixes to AI code disproportionately inflate human credit
- **Structured text most affected**: Markdown lists, READMEs, and repeated patterns show largest cascades
- **Unreliable metrics**: Attribution data cannot be trusted for measuring AI vs human contributions

## Reproduction Steps

1. Generate multi-line AI content (e.g., markdown list, function with parameters)
2. Commit the AI-generated content
3. Edit 1-2 lines within the committed block (e.g., fix typo, change parameter value)
4. Check `git-ai status --json`

**Expected**: ~2 human additions  
**Observed**: 11+ human additions

### Reproduction Example

```bash
# After AI generates and commits 50 lines
❯ git-ai status --json
{"stats": {"human_additions": 0, "ai_additions": 50, "ai_accepted": 50}}

# After editing 1-2 lines in that block
❯ git-ai status --json
{"stats": {"human_additions": 11, "ai_additions": 39, "ai_accepted": 39}}
```

**Result**: 9+ AI-generated lines falsely attributed to human from a 1-2 line edit.

## Root Cause Analysis

REPLACE operations are unconditionally split into DELETE+INSERT pairs, losing the semantic connection that these are edits of existing content.

### Level 1: Line-Level Diff (Correct)

`src/authorship/imara_diff_utils.rs` (lines 293-299):

```rust
if old_hunk_len > 0 && new_hunk_len > 0 {
    // Replace: both old and new have content
    ops.push(DiffOp::Replace {
        old_index: hunk_old_start,
        old_len: old_hunk_len,
        new_index: hunk_new_start,
        new_len: new_hunk_len,
    });
}
```

✅ Diff engine correctly identifies edits as REPLACE operations.

### Level 2: Byte-Level Conversion (Attribution Breaks Here)

`src/authorship/attribution_tracker.rs` - `append_range_diffs()` (lines 1507-1514):

```rust
fn append_range_diffs(...) {
    // ...
    if !old_slice.is_empty() {
        diffs.push(ByteDiff::new(ByteDiffOp::Delete, old_slice.as_bytes()));
    }
    if !new_slice.is_empty() {
        diffs.push(ByteDiff::new(ByteDiffOp::Insert, new_slice.as_bytes()));
    }
}
```

❌ REPLACE operations unconditionally split into DELETE + INSERT.

### Level 2b: Token-Level Processing

`src/authorship/attribution_tracker.rs` - `build_token_aligned_diffs()` (lines 1656-1680):

```rust
DiffOp::Replace { old_index, old_len, new_index, new_len } => {
    if old_len > 0 {
        diffs.push(ByteDiff::new(
            ByteDiffOp::Delete,
            &old_content.as_bytes()[old_start_pos..old_end_pos],
        ));
    }
    
    if new_len > 0 {
        diffs.push(ByteDiff::new(
            ByteDiffOp::Insert,
            &new_content.as_bytes()[new_start_pos..new_end_pos],
        ));
        substantive_ranges.push((new_start_pos, new_end_pos));  // ← Marked as NEW content
    }
}
```

❌ INSERT ranges marked as `substantive_new_ranges`, treated as brand new content.

### Level 3: Attribution Assignment (Cascade Trigger)

`src/authorship/attribution_tracker.rs` - `transform_attributions()` (lines 1015-1020):

```rust
let (author_id, attribution_ts) = if contains_newline {
    (current_author.to_string(), ts)
} else if is_substantive_insert {
    (current_author.to_string(), ts)  // ← current_author = Human
} else {
    // ... formatting fallback
    (current_author.to_string(), ts)
};
```

❌ Every INSERT from a split REPLACE attributed to `current_author` (human during manual edits).

## Observed Behavior

Edit 1-2 lines in a 10-line AI block:

```
1. Myers diff may widen the Replace hunk in structured text, causing multiple lines to appear modified.

2. Byte-level conversion splits it
   → ByteDiffOp::Delete (10 lines) + ByteDiffOp::Insert (10 lines)

3. Attribution sees INSERT of 10 lines
   → Marked as substantive_new_ranges → All 10 lines attributed to human

Result: 1-2 line edit → 10 lines marked human
```

### Cascade Amplifiers

Myers diff loses alignment with:
- Similar/repeated lines (lists, bullets, patterns)
- Structural edits (adding/removing list items)
- Indentation changes
- Multi-line constructs (functions, classes)

## Triggering Patterns

### Markdown Lists

```markdown
- Item 1
- Item 2  
- Item 3  # Edit this line
- Item 4
- Item 5

# Result: All 5 items re-attributed to human
```

### Function Parameters

```python
def process_data(
    input_file,
    output_file,
    verbose=False,
    timeout=30  # Change this to 60
):

# Result: Entire function signature → human
```

### README Sections

```markdown
## Features
- Fast performance  # Edit this
- Easy to use
- Secure by default

# Result: Entire features list → human
```

## Evidence

✅ REPLACE operations correctly identified by `imara-diff`  
✅ Unconditional REPLACE → DELETE+INSERT split in byte-level processing  
✅ INSERT ranges marked as `substantive_new_ranges`  
✅ No similarity check between deleted and inserted content  
✅ All INSERTs attributed to `current_author` during manual edits  
✅ REPLACE of N lines = N lines falsely attributed  
✅ Structural edits in markdown/lists produce larger REPLACE hunks

## Root Design Issue

The system assumes `INSERT = newly authored content`, but INSERT operations include:
1. Truly new content (should be human-attributed)
2. Edited existing AI content (should preserve AI attribution) ← **no mechanism to distinguish**
3. Moved content (handled by move detection)

## Potential Fix Directions
The system currently lacks a mechanism to distinguish edited existing content from truly new insertions during REPLACE handling.

## Workaround

None available. Editing AI-generated blocks inherently triggers the cascade.

## Environment

- **Version**: git-ai v1.1.4
- **Platform**: macos
- **Affects**: All manual edits to AI-generated multi-line blocks
- **Severity**: High - attribution metrics unreliable
- **Frequency**: Reproducible on every edit

---

**Analysis based on**: Code review of `src/authorship/attribution_tracker.rs` and `src/authorship/imara_diff_utils.rs` confirming unconditional REPLACE → DELETE+INSERT splitting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attribution Cascade: Small Edits Causing Massive False Human Attribution #589

Attribution Cascade: Small Edits Causing Massive False Human Attribution

Problem Statement

Impact

Reproduction Steps

Reproduction Example

Root Cause Analysis

Level 1: Line-Level Diff (Correct)

Level 2: Byte-Level Conversion (Attribution Breaks Here)

Level 2b: Token-Level Processing

Level 3: Attribution Assignment (Cascade Trigger)

Observed Behavior

Cascade Amplifiers

Triggering Patterns

Markdown Lists

Function Parameters

README Sections

Evidence

Root Design Issue

Potential Fix Directions

Workaround

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attribution Cascade: Small Edits Causing Massive False Human Attribution #589

Description

Attribution Cascade: Small Edits Causing Massive False Human Attribution

Problem Statement

Impact

Reproduction Steps

Reproduction Example

Root Cause Analysis

Level 1: Line-Level Diff (Correct)

Level 2: Byte-Level Conversion (Attribution Breaks Here)

Level 2b: Token-Level Processing

Level 3: Attribution Assignment (Cascade Trigger)

Observed Behavior

Cascade Amplifiers

Triggering Patterns

Markdown Lists

Function Parameters

README Sections

Evidence

Root Design Issue

Potential Fix Directions

Workaround

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions