ci: add banned-files check to block secrets, binaries, and OS artifacts#674
ci: add banned-files check to block secrets, binaries, and OS artifacts#674WuKongAI-CMU wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
Add a CI workflow that scans PR diffs for files that should never be committed: credential files (.env, *.pem, *.key, credentials.json), OS metadata (.DS_Store, Thumbs.db), and bytecode (*.pyc). The check runs only on changed files (not the full tree) and exempts test/fixtures/ and testdata/ paths to avoid false positives. Also adds missing patterns (service-account*.json, Thumbs.db, desktop.ini) to .gitignore for belt-and-suspenders coverage. Fixes NVIDIA#550 Signed-off-by: peteryuqin <peter.yuqin@gmail.com>
📝 WalkthroughWalkthroughThis PR introduces a GitHub Actions workflow that prevents commits of sensitive files (credentials, keys, certificates), binaries, and OS artifacts. It scans files changed in pull requests against a deny list and fails if matches are found. Additionally, Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/banned-files.yaml:
- Around line 27-48: Add explicit directory patterns 'node_modules/', 'dist/',
and '__pycache__/' to the BANNED_PATTERNS array and change the matching logic
that currently tests only the basename to test the full file path; specifically,
update the BANNED_PATTERNS declaration (variable name BANNED_PATTERNS) to
include those three entries and replace any basename-based check with a
path-based comparison (e.g., use the full filepath variable in a shell pattern
match or grep against "$file" rather than basename("$file")) so patterns like
node_modules/package/index.js are caught.
- Line 54: Update the git diff command assigned to CHANGED: include 'M' in the
--diff-filter (so use ACRM or AC R M combined) to ensure modified files are
scanned, and replace the HEAD reference with the explicit PR head SHA (${{
github.event.pull_request.head.sha }}) so the command uses the PR's head commit;
update the line with CHANGED=$(git diff --name-only --diff-filter=... "${{
github.event.pull_request.base.sha }}" "${{ github.event.pull_request.head.sha
}}") accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8bfefea1-b225-4237-ad1e-4d763671a205
📒 Files selected for processing (2)
.github/workflows/banned-files.yaml.gitignore
| BANNED_PATTERNS=( | ||
| '.env' 'May contain API keys or secrets' | ||
| '.env.*' 'May contain API keys or secrets' | ||
| '.envrc' 'direnv secrets file' | ||
| '*.pem' 'Private key or certificate' | ||
| '*.key' 'Private key' | ||
| '*.p12' 'PKCS#12 keystore' | ||
| '*.pfx' 'PKCS#12 keystore' | ||
| '*.jks' 'Java keystore' | ||
| '*.keystore' 'Keystore file' | ||
| 'credentials.json' 'Cloud provider credentials' | ||
| 'service-account*.json' 'Cloud provider service account' | ||
| '.npmrc' 'May contain registry tokens' | ||
| '.netrc' 'May contain login credentials' | ||
| '*_rsa' 'SSH private key' | ||
| '*_ed25519' 'SSH private key' | ||
| '*_ecdsa' 'SSH private key' | ||
| '.DS_Store' 'macOS Finder metadata' | ||
| 'Thumbs.db' 'Windows Explorer metadata' | ||
| 'desktop.ini' 'Windows Explorer metadata' | ||
| '*.pyc' 'Python bytecode' | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify deny-list coverage and matching strategy
rg -n --type=yaml 'node_modules|dist|__pycache__|basename=\$\(basename "\$file"\)|\[\[ "\$basename" == \$pattern \]\]' .github/workflows/banned-files.yamlRepository: NVIDIA/NemoClaw
Length of output: 160
🏁 Script executed:
cat -n .github/workflows/banned-files.yamlRepository: NVIDIA/NemoClaw
Length of output: 4429
Directory build artifacts should be explicitly blocked in the deny list with full-path matching.
The BANNED_PATTERNS list is missing node_modules/, dist/, and __pycache__/ entries. Additionally, line 79 only checks the basename, which prevents path-based patterns from working. A directory artifact like node_modules/package/index.js would not be caught since the basename check only sees index.js.
Add the missing directory patterns and update the matching logic to check the full file path:
Suggested fix
BANNED_PATTERNS=(
'*.pyc' 'Python bytecode'
+ 'node_modules/*' 'Dependency vendor directory'
+ 'dist/*' 'Build artifact directory'
+ '__pycache__/*' 'Python cache directory'
) basename=$(basename "$file")
- if [[ "$basename" == $pattern ]]; then
+ if [[ "$file" == $pattern || "$basename" == $pattern ]]; then🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/banned-files.yaml around lines 27 - 48, Add explicit
directory patterns 'node_modules/', 'dist/', and '__pycache__/' to the
BANNED_PATTERNS array and change the matching logic that currently tests only
the basename to test the full file path; specifically, update the
BANNED_PATTERNS declaration (variable name BANNED_PATTERNS) to include those
three entries and replace any basename-based check with a path-based comparison
(e.g., use the full filepath variable in a shell pattern match or grep against
"$file" rather than basename("$file")) so patterns like
node_modules/package/index.js are caught.
| EXEMPT_DIRS="^(test/fixtures/|testdata/)" | ||
|
|
||
| # Get files changed in this PR | ||
| CHANGED=$(git diff --name-only --diff-filter=ACR "${{ github.event.pull_request.base.sha }}" HEAD) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify current diff-filter usage in workflow
rg -n --type=yaml 'git diff --name-only --diff-filter=' .github/workflows/banned-files.yamlRepository: NVIDIA/NemoClaw
Length of output: 170
🏁 Script executed:
cat -n .github/workflows/banned-files.yaml | head -80Repository: NVIDIA/NemoClaw
Length of output: 3596
🏁 Script executed:
cat -n .github/workflows/banned-files.yaml | tail -30Repository: NVIDIA/NemoClaw
Length of output: 1364
🏁 Script executed:
# Check if there are any banned files already tracked in the repository
git ls-files | grep -E '\.(env|pem|key|p12|pfx|jks|keystore)$|credentials\.json|service-account.*\.json|\.npmrc|\.netrc|_rsa|_ed25519|_ecdsa|\.DS_Store|Thumbs\.db|desktop\.ini|\.pyc' | head -20Repository: NVIDIA/NemoClaw
Length of output: 41
🏁 Script executed:
# Check the git diff behavior difference between HEAD and head.sha in the context
# Let's also verify how the checkout action handles the ref
git rev-parse HEAD
git log --oneline -1Repository: NVIDIA/NemoClaw
Length of output: 179
🏁 Script executed:
# Let's verify what the diff-filter options mean and check the actual behavior
# in a PR context. Let's also see if there are any comments or docs about this choice
git log -p --follow -S 'diff-filter=ACR' -- .github/workflows/banned-files.yaml | head -100Repository: NVIDIA/NemoClaw
Length of output: 41
🏁 Script executed:
# Let's check if there are any other workflow files or documentation that explain the design decision
find .github -type f -name "*.yaml" -o -name "*.yml" | xargs grep -l "diff-filter" 2>/dev/nullRepository: NVIDIA/NemoClaw
Length of output: 94
Include modified files in the scan set.
Line 54 uses --diff-filter=ACR, which excludes M (Modified). If a banned file were already tracked in the repository (either accidentally or from before this check existed), modifications to it would bypass this check. Include M in the filter to catch all changes, including modifications to banned files.
Additionally, replace HEAD with ${{ github.event.pull_request.head.sha }} for more explicit and reliable SHA references in the PR context.
Suggested fix
- CHANGED=$(git diff --name-only --diff-filter=ACR "${{ github.event.pull_request.base.sha }}" HEAD)
+ CHANGED=$(git diff --name-only --diff-filter=ACMR "${{ github.event.pull_request.base.sha }}" "${{ github.event.pull_request.head.sha }}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CHANGED=$(git diff --name-only --diff-filter=ACR "${{ github.event.pull_request.base.sha }}" HEAD) | |
| CHANGED=$(git diff --name-only --diff-filter=ACMR "${{ github.event.pull_request.base.sha }}" "${{ github.event.pull_request.head.sha }}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/banned-files.yaml at line 54, Update the git diff command
assigned to CHANGED: include 'M' in the --diff-filter (so use ACRM or AC R M
combined) to ensure modified files are scanned, and replace the HEAD reference
with the explicit PR head SHA (${{
github.event.pull_request.head.sha }}) so the command uses the PR's head commit;
update the line with CHANGED=$(git diff --name-only --diff-filter=... "${{
github.event.pull_request.base.sha }}" "${{ github.event.pull_request.head.sha
}}") accordingly.
|
Thanks for putting this together — the banned-files idea is solid and the pattern list was really helpful! We ended up implementing this as a pre-commit hook in #673 instead of a GHA workflow. The advantage is that it catches banned files before they leave the developer's machine, rather than after they've already been pushed to GitHub. The hook uses We also picked up the Going to close this in favor of the pre-commit approach in #673, but thank you for flagging the gap! |
Summary
Adds a CI workflow that automatically rejects PRs containing files that should never be committed:
.env,.env.*,*.pem,*.key,*.p12,credentials.json,service-account*.json,.npmrc,.netrc, SSH private keys.DS_Store,Thumbs.db,desktop.ini*.pycThe check inspects only files changed in the PR (
git diff --name-only), not the full repo tree. Paths undertest/fixtures/andtestdata/are exempt to avoid blocking legitimate test data.Also adds missing patterns (
service-account*.json,Thumbs.db,desktop.ini) to.gitignore.Test plan
.env.local,.DS_Store,*.pem,credentials.jsontest/fixtures/test.pem)Fixes #550
Summary by CodeRabbit