feat: tree-sitter scoring robustness and edge cases by dive2tech · Pull Request #235 · entrius/gittensor

dive2tech · 2026-02-25T10:47:00Z

Summary

Improves robustness and edge-case handling in the tree-sitter token scoring pipeline so invalid or malformed content does not crash validators and is handled in a predictable way.

Changes

MAX_AST_DEPTH — New constant to cap recursion when walking the AST and avoid stack overflow on very deep or pathological input.
Content normalization and safe encoding/decoding — Helpers handle None, non-str, empty/whitespace, and invalid UTF-8 (using errors='replace') without raising:
- _normalize_content — Returns None for invalid or empty content; otherwise stripped string.
- _safe_encode_content — UTF-8 encode with replace for invalid codepoints.
- _safe_decode_node_text — Decode node bytes with replace for malformed UTF-8.
- _safe_content_byte_size — Byte length for size checks; 0 on error or non-str.
parse_code — Uses normalization and safe encode; returns None for invalid or unsupported content instead of raising.
collect_node_signatures — Depth-limited walk (default MAX_AST_DEPTH); decodes node text with _safe_decode_node_text so bad UTF-8 in source does not crash.
score_tree_diff — Normalizes old/new content before parsing so empty or whitespace-only content is treated consistently.
calculate_token_score_from_file_changes — Normalizes content from FileContentPair; if new content is empty or whitespace-only after normalization, skips with skipped-empty (score 0); uses _safe_content_byte_size for the size check so invalid UTF-8 does not raise.

Tests

tests/validator/test_tree_sitter_scoring.py (35 tests) — Covers:
- _normalize_content, _safe_encode_content, _safe_decode_node_text, _safe_content_byte_size
- parse_code (None, empty, non-str, invalid UTF-8, unknown language)
- collect_node_signatures (depth limit)
- score_tree_diff (None/empty/invalid UTF-8)
- calculate_token_score_from_file_changes (empty and whitespace-only new content → skipped-empty)

Testing

pytest tests/validator/test_tree_sitter_scoring.py
pytest tests/validator/test_token_scoring_integration.py

- Add MAX_AST_DEPTH constant to limit recursion and avoid stack overflow - _normalize_content, _safe_encode_content, _safe_decode_node_text, _safe_content_byte_size for safe handling of None, non-str, empty, invalid UTF-8 - parse_code: normalize and safe encode; handle invalid content without raising - collect_node_signatures: depth-limited walk, safe node text decode - score_tree_diff: normalize old/new content before parsing - calculate_token_score_from_file_changes: skip empty/whitespace new content (skipped-empty), safe byte size check - Add tests/validator/test_tree_sitter_scoring.py with 35 edge-case and robustness tests Co-authored-by: Cursor <cursoragent@cursor.com>

anderdc · 2026-03-02T00:05:15Z

this to me seems to be overly defensive coding for issues that haven't happened

Can you give an example of MAX_AST_DEPTH causing a stack overflow? is there a PR that has had this happen? what does some code look like that would cause this?

for the other issues like the normalization can you give some PR examples and token output where the token score is not consistent for same changes?

I mostly need some proof that this is worth merging instead of finding niche non-bugs

dive2tech · 2026-03-02T02:42:50Z

Hi, @anderdc
Thanks for your feedback:

1. `MAX_AST_DEPTH` and stack overflow

There isn’t a specific gittensor PR where this has happened. This was added as a defensive measure.
Stack overflow can occur when:

Python’s recursion limit is hit (often ~1000).
collect_node_signatures walks the AST recursively via walk_node.
The AST is very deep (e.g. many levels of nesting).

For Python, a pathological case would need many levels of nesting. Real PRs rarely have 1000+ levels, so this is unlikely in normal repos.
Conclusion: I can’t provide a real PR where this happened. It’s a theoretical risk for pathological or malicious input. If that doesn’t justify the added complexity, we can remove the MAX_AST_DEPTH change.

2. Normalization / invalid UTF-8 — concrete bug

The UTF-8 handling does fix a real crash:

# This raises UnicodeEncodeError in Python 3:
content = "def foo(): pass \udc80"  # Invalid lone surrogate
content.encode('utf-8')  # UnicodeEncodeError

Unpaired surrogates can appear in files fetched from GitHub (encoding issues, copy-paste, etc.). Without safe encoding, parse_code will raise and break scoring.
The main benefit is avoiding crashes rather than fixing scoring inconsistency.

So if you'd like, I can drop MAX_AST_DEPTH ( no concrete incident ) and keep safe UTF-8 encoding/decoding in parse_code and node text handling (avoid UnicodeEncodeError/UnicodeDecodeError on invalid UTF-8).

dive2tech and others added 2 commits February 25, 2026 12:44

Merge branch 'test' into feat/tree-sitter-robustness-edge-cases

dd40d80

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: tree-sitter scoring robustness and edge cases#235

feat: tree-sitter scoring robustness and edge cases#235
dive2tech wants to merge 2 commits intoentrius:testfrom
dive2tech:feat/tree-sitter-robustness-edge-cases

dive2tech commented Feb 25, 2026

Uh oh!

anderdc commented Mar 2, 2026

Uh oh!

dive2tech commented Mar 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dive2tech commented Feb 25, 2026

Summary

Changes

Tests

Testing

Uh oh!

anderdc commented Mar 2, 2026

Uh oh!

dive2tech commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. MAX_AST_DEPTH and stack overflow

2. Normalization / invalid UTF-8 — concrete bug

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dive2tech commented Mar 2, 2026 •

edited

Loading

1. `MAX_AST_DEPTH` and stack overflow