feat: tree-sitter scoring robustness and edge cases#235
feat: tree-sitter scoring robustness and edge cases#235dive2tech wants to merge 2 commits intoentrius:testfrom
Conversation
- Add MAX_AST_DEPTH constant to limit recursion and avoid stack overflow - _normalize_content, _safe_encode_content, _safe_decode_node_text, _safe_content_byte_size for safe handling of None, non-str, empty, invalid UTF-8 - parse_code: normalize and safe encode; handle invalid content without raising - collect_node_signatures: depth-limited walk, safe node text decode - score_tree_diff: normalize old/new content before parsing - calculate_token_score_from_file_changes: skip empty/whitespace new content (skipped-empty), safe byte size check - Add tests/validator/test_tree_sitter_scoring.py with 35 edge-case and robustness tests Co-authored-by: Cursor <cursoragent@cursor.com>
|
this to me seems to be overly defensive coding for issues that haven't happened Can you give an example of for the other issues like the normalization can you give some PR examples and token output where the token score is not consistent for same changes? I mostly need some proof that this is worth merging instead of finding niche non-bugs |
|
Hi, @anderdc 1.
|
Summary
Improves robustness and edge-case handling in the tree-sitter token scoring pipeline so invalid or malformed content does not crash validators and is handled in a predictable way.
Changes
MAX_AST_DEPTH — New constant to cap recursion when walking the AST and avoid stack overflow on very deep or pathological input.
Content normalization and safe encoding/decoding — Helpers handle
None, non-str, empty/whitespace, and invalid UTF-8 (usingerrors='replace') without raising:_normalize_content— ReturnsNonefor invalid or empty content; otherwise stripped string._safe_encode_content— UTF-8 encode with replace for invalid codepoints._safe_decode_node_text— Decode node bytes with replace for malformed UTF-8._safe_content_byte_size— Byte length for size checks; 0 on error or non-str.parse_code — Uses normalization and safe encode; returns
Nonefor invalid or unsupported content instead of raising.collect_node_signatures — Depth-limited walk (default
MAX_AST_DEPTH); decodes node text with_safe_decode_node_textso bad UTF-8 in source does not crash.score_tree_diff — Normalizes old/new content before parsing so empty or whitespace-only content is treated consistently.
calculate_token_score_from_file_changes — Normalizes content from
FileContentPair; if new content is empty or whitespace-only after normalization, skips with skipped-empty (score 0); uses _safe_content_byte_size for the size check so invalid UTF-8 does not raise.Tests
_normalize_content,_safe_encode_content,_safe_decode_node_text,_safe_content_byte_sizeparse_code(None, empty, non-str, invalid UTF-8, unknown language)collect_node_signatures(depth limit)score_tree_diff(None/empty/invalid UTF-8)calculate_token_score_from_file_changes(empty and whitespace-only new content → skipped-empty)Testing