fix(scanner): remove C7/C4 HIGH false positives found validating 8 real Python projects#46
Merged
Merged
Conversation
Validating the scanner against 8 real Python projects (encode/httpx,
encode/starlette, pallets/flask, fastapi/fastapi, encode/django-rest-framework,
aio-libs/aiohttp, sanic-org/sanic, pallets/werkzeug) surfaced two HIGH false
positives. HIGH blocks commits, so these are the worst kind.
C7 (self-compare): `assert x == x` is now exempt when the same test also runs a
discriminating or membership check on the same operand - `assert x != y`,
`assert not x == y`, or `assert x in {x}`. That pair is a deliberate
__eq__/__hash__ test (reflexive AND distinguishing), not a tautology. A lone
`assert x == x` still fires. Seen in aio-libs/aiohttp and encode/starlette.
C4 (forgotten test): a `test*`-named function is no longer flagged when it is a
web route handler / WSGI app (@app.get/@app.post/@Request.application/
@click.command) or when it is referenced - called, awaited, scheduled via
asyncio.create_task, or passed as a callback. A referenced function runs, so it
is not forgotten. Only a nested `test*` with a check in its own body that is
never referenced, or a top-level test-shaped function never called, still fires.
Seen in fastapi, werkzeug, sanic, flask, aiohttp.
Re-scan after the fix: 0 HIGH across all 8 projects (was 16+9+14+7+... before).
Each fix carries a fires-on-bad and a stays-clean regression test. reference.md
gains the C7/C4/C18 look-alike notes; CHANGELOG records both fixes. Bundled
scanner copy kept byte-identical (drift check passes).
Closes #39
Closes #40
Closes #41
…+ LLM pass) Scanner: 8 real projects (encode/httpx, encode/starlette, pallets/flask, fastapi/fastapi, encode/django-rest-framework, aio-libs/aiohttp, sanic-org/sanic, pallets/werkzeug), ~47 HIGH false positives in C7/C4 fixed, 0 HIGH on re-scan. Semantic pass: first labeled benchmark (24 Python cases) run blind on a small model (Claude Haiku) scored precision 1.00, recall 0.70 (1.00 on clear-cut smells), F1 0.82 - the evidence behind "a small model is enough for a precision-first semantic pass". VALIDATION.md row moved from in-progress to completed; README "How falsegreen is validated" carries the figures. Raw benchmark data, the spreadsheet, and the working report stay local (.handoff/, gitignored); only the measured conclusions are published here.
Senior-architect pre-push review found two narrow holes where the new
exemptions were broader than intended:
C7: the eq-semantics exemption fired on ANY later `!=`/`in` mentioning the
operand. Now it requires a DISTINCT peer for `!=`/`not ==` (a constant like
None does not count) and a literal container that holds the operand for
membership (`x in {x}`, not `x in some_registry`). So `assert x == x` next to
`x != None` or `x in some_registry` stays C7.
C4: the top-level forgotten-test exemption used name_is_used over the whole
module, so an unrelated same-name local rebinding in another function wrongly
excused a real forgotten test. New name_used_at_module_level counts only a call
target `name(...)` anywhere (covers `asyncio.run(main())`) or a module-level
Load. Nested defs keep the broader scoped check (callbacks registered by bare
name still count).
Three regression tests added for the closed holes. 104 tests green, drift
identical, self-scan + ruff clean.
Owner
Author
|
Validation (pre-merge):
Merging. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Validating the scanner against 8 real Python projects (each >200 stars, >=500 tests) surfaced ~47 HIGH-confidence false positives in two rules. HIGH blocks commits, so these are the worst kind. This PR removes them, with regression tests, and records the measured numbers.
Projects (owner/repo):
encode/httpx,encode/starlette,pallets/flask,fastapi/fastapi,encode/django-rest-framework,aio-libs/aiohttp,sanic-org/sanic,pallets/werkzeug.The fixes
C7 (compares a value to itself). Was firing on deliberate
__eq__/__hash__tests. Nowassert x == xis exempt only when the same test runs a discriminating check on the same operand:x != peer/not x == peeragainst a distinct peer (a constant likeNonedoes not count), orx in {x}membership in a literal container that holds x (x in some_registrydoes not count). A loneassert x == xstill fires.C4 (forgotten/uncollected test). Was firing on
test*-named web route handlers (@app.get/@app.post/@Request.application/@click.command) and on local helper coroutines. New principle: a function that is referenced (called, awaited, scheduled viaasyncio.create_task, or passed as a callback) actually runs, so it is not forgotten. Only an undecorated, no-arg nestedtest*with a check in its own body that is never referenced, or a top-level test-shaped function never called, is flagged. The top-level reference check is scoped to module level so an unrelated same-name local elsewhere does not excuse a real forgotten test.Result
scan.pybyte-for-byte), self-scan clean, ruff clean.reference.mdgains C7/C4/C18 look-alike notes; CHANGELOG records the fixes; README + VALIDATION.md carry the measured numbers (scanner 47->0, and the semantic-pass benchmark: precision 1.00 / recall 0.70 on a small model).Closes #39
Closes #40
Closes #41