Skip to content

feat: tree-sitter scoring robustness and edge cases#235

Open
dive2tech wants to merge 2 commits intoentrius:testfrom
dive2tech:feat/tree-sitter-robustness-edge-cases
Open

feat: tree-sitter scoring robustness and edge cases#235
dive2tech wants to merge 2 commits intoentrius:testfrom
dive2tech:feat/tree-sitter-robustness-edge-cases

Conversation

@dive2tech
Copy link

Summary

Improves robustness and edge-case handling in the tree-sitter token scoring pipeline so invalid or malformed content does not crash validators and is handled in a predictable way.

Changes

  • MAX_AST_DEPTH — New constant to cap recursion when walking the AST and avoid stack overflow on very deep or pathological input.

  • Content normalization and safe encoding/decoding — Helpers handle None, non-str, empty/whitespace, and invalid UTF-8 (using errors='replace') without raising:

    • _normalize_content — Returns None for invalid or empty content; otherwise stripped string.
    • _safe_encode_content — UTF-8 encode with replace for invalid codepoints.
    • _safe_decode_node_text — Decode node bytes with replace for malformed UTF-8.
    • _safe_content_byte_size — Byte length for size checks; 0 on error or non-str.
  • parse_code — Uses normalization and safe encode; returns None for invalid or unsupported content instead of raising.

  • collect_node_signatures — Depth-limited walk (default MAX_AST_DEPTH); decodes node text with _safe_decode_node_text so bad UTF-8 in source does not crash.

  • score_tree_diff — Normalizes old/new content before parsing so empty or whitespace-only content is treated consistently.

  • calculate_token_score_from_file_changes — Normalizes content from FileContentPair; if new content is empty or whitespace-only after normalization, skips with skipped-empty (score 0); uses _safe_content_byte_size for the size check so invalid UTF-8 does not raise.

Tests

  • tests/validator/test_tree_sitter_scoring.py (35 tests) — Covers:
    • _normalize_content, _safe_encode_content, _safe_decode_node_text, _safe_content_byte_size
    • parse_code (None, empty, non-str, invalid UTF-8, unknown language)
    • collect_node_signatures (depth limit)
    • score_tree_diff (None/empty/invalid UTF-8)
    • calculate_token_score_from_file_changes (empty and whitespace-only new content → skipped-empty)

Testing

  • pytest tests/validator/test_tree_sitter_scoring.py
  • pytest tests/validator/test_token_scoring_integration.py

dive2tech and others added 2 commits February 25, 2026 12:44
- Add MAX_AST_DEPTH constant to limit recursion and avoid stack overflow
- _normalize_content, _safe_encode_content, _safe_decode_node_text, _safe_content_byte_size for safe handling of None, non-str, empty, invalid UTF-8
- parse_code: normalize and safe encode; handle invalid content without raising
- collect_node_signatures: depth-limited walk, safe node text decode
- score_tree_diff: normalize old/new content before parsing
- calculate_token_score_from_file_changes: skip empty/whitespace new content (skipped-empty), safe byte size check
- Add tests/validator/test_tree_sitter_scoring.py with 35 edge-case and robustness tests

Co-authored-by: Cursor <cursoragent@cursor.com>
@anderdc
Copy link
Collaborator

anderdc commented Mar 2, 2026

this to me seems to be overly defensive coding for issues that haven't happened

Can you give an example of MAX_AST_DEPTH causing a stack overflow? is there a PR that has had this happen? what does some code look like that would cause this?

for the other issues like the normalization can you give some PR examples and token output where the token score is not consistent for same changes?

I mostly need some proof that this is worth merging instead of finding niche non-bugs

@dive2tech
Copy link
Author

dive2tech commented Mar 2, 2026

Hi, @anderdc
Thanks for your feedback:

1. MAX_AST_DEPTH and stack overflow

There isn’t a specific gittensor PR where this has happened. This was added as a defensive measure.
Stack overflow can occur when:

  • Python’s recursion limit is hit (often ~1000).
  • collect_node_signatures walks the AST recursively via walk_node.
  • The AST is very deep (e.g. many levels of nesting).

For Python, a pathological case would need many levels of nesting. Real PRs rarely have 1000+ levels, so this is unlikely in normal repos.
Conclusion: I can’t provide a real PR where this happened. It’s a theoretical risk for pathological or malicious input. If that doesn’t justify the added complexity, we can remove the MAX_AST_DEPTH change.

2. Normalization / invalid UTF-8 — concrete bug

The UTF-8 handling does fix a real crash:

# This raises UnicodeEncodeError in Python 3:
content = "def foo(): pass \udc80"  # Invalid lone surrogate
content.encode('utf-8')  # UnicodeEncodeError

Unpaired surrogates can appear in files fetched from GitHub (encoding issues, copy-paste, etc.). Without safe encoding, parse_code will raise and break scoring.
The main benefit is avoiding crashes rather than fixing scoring inconsistency.

So if you'd like, I can drop MAX_AST_DEPTH ( no concrete incident ) and keep safe UTF-8 encoding/decoding in parse_code and node text handling (avoid UnicodeEncodeError/UnicodeDecodeError on invalid UTF-8).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants