Skip to content

test: add minimal comparative evaluation packet#343

Merged
Stahl-G merged 1 commit into
mainfrom
codex/v0114-minimal-comparative-eval
Jul 2, 2026
Merged

test: add minimal comparative evaluation packet#343
Stahl-G merged 1 commit into
mainfrom
codex/v0114-minimal-comparative-eval

Conversation

@Stahl-G

@Stahl-G Stahl-G commented Jul 2, 2026

Copy link
Copy Markdown
Owner

Summary

  • add a public-safe v0.11.4 minimal comparative evaluation packet with three synthetic tasks, C0/C1 arms, frozen raw outputs, and raw reviewer observations
  • add scripts/check_minimal_comparative_eval.py and wire it into release consistency
  • keep the packet explicitly non-authoritative: no quality proof, no semantic truth proof, no speed claim, no delivery/release approval

Validation

  • python3 scripts/check_minimal_comparative_eval.py
  • python3 scripts/check_public_safety.py --path docs/evaluation-results/v0.11.4-minimal-comparative-evaluation
  • python3 scripts/check_release_consistency.py --no-tag
  • python3 -m pytest -q tests/test_minimal_comparative_eval.py tests/test_release_consistency.py tests/test_product_baseline.py tests/test_public_safety_scan.py
  • python3 -m pytest -q

@Stahl-G Stahl-G force-pushed the codex/v0114-minimal-comparative-eval branch from a4128e4 to 1100ebf Compare July 2, 2026 05:34
@Stahl-G Stahl-G marked this pull request as ready for review July 2, 2026 05:41
@Stahl-G Stahl-G merged commit 5fefe75 into main Jul 2, 2026
13 checks passed
@Stahl-G Stahl-G deleted the codex/v0114-minimal-comparative-eval branch July 2, 2026 05:41

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1100ebf120

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return True
if kind == "lark_token" and re.search(r"\b(?:oc|ou|on|om)_x+\b", line):
return True
if kind == "lark_token" and re.search(r"\bsha256\b", line, re.IGNORECASE):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep sha256 allowlist from masking real tokens

When a real Lark-style token appears on any line that also contains sha256, this line-level allowlist suppresses every lark_token finding for that line. I checked the scanner path with check_public_safety.py --path on a temporary YAML line containing both a digest and cli_..., and the scan passed, so release/public safety checks can miss a leaked chat/file token placed next to a hash. Please restrict the exception to the hash candidate itself or to well-formed sha256 fields instead of allowing the entire line.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant