Skip to content

CCA pre-filters (Java import, C arg-count, Go sub-package), Java CCA cycle detection & perf, RPM version guard, ~40 bug fixes, ~200 new tests#261

Open
tmihalac wants to merge 25 commits into
RHEcosystemAppEng:mainfrom
tmihalac:CCA-Argument-Count-Pre-filter
Open

CCA pre-filters (Java import, C arg-count, Go sub-package), Java CCA cycle detection & perf, RPM version guard, ~40 bug fixes, ~200 new tests#261
tmihalac wants to merge 25 commits into
RHEcosystemAppEng:mainfrom
tmihalac:CCA-Argument-Count-Pre-filter

Conversation

@tmihalac

@tmihalac tmihalac commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Java CCA

  • Add _can_reference_class() import-based pre-filter (simple class name, wildcard import, same package, same artifact) to eliminate irrelevant uber-JAR candidates before expensive type resolution
  • Only filter third-party candidates; application code (root docs) always passes to avoid false negatives from polymorphic interface calls
  • Add DFS cycle detection guard in get_relevant_documents to prevent infinite loops from self-recursive or mutually recursive method calls
  • Switch logger.debug in __check_identifier_resolved_to_callee_function_package from f-strings to %s lazy formatting

C CCA

  • Add argument-count pre-filter to C parser's search_for_called_function to reject cross-package false positives where a same-named function has different arity (e.g. rsync read_byte(1) vs PostgreSQL read_byte(3))

Go CCA

  • Add Go sub-package awareness to Function Locator (_extract_go_subpackage, _go_subpackage_flow_control)
  • Add CCA sub-package filtering via resolve_subpackage_to_module in Go parser and tree fallback in chain_of_calls_retriever
  • Add Go sub-package enrichment from patches in intel_utils (extract_go_subpackages_from_patch)
  • Add Go sub-package prefix matching to Rule 8 so github.com/lib/foo/bar matches target github.com/lib/foo
  • Fix Go FL short_name dict collision: store list of packages per short name instead of overwriting

CCA / all ecosystems

  • Pre-index documents in chain_of_calls_retriever.__init__ (sort_docs, _root_docs, _source_path_index) for O(1) lookups in get_possible_docs instead of scanning all documents
  • Optimize _is_doc_excluded to compare source path (cheap) before page content (expensive)
  • Deduplicate parents list in non-Java CCA tree_dict to prevent duplicate entries from dependency tree builder
  • Add distinct "function not found" message when CCA returns empty call_hierarchy_list so agent distinguishes missing function from unreachable function
  • Escape regex metacharacters in Function Caller Finder query builder to handle identifiers containing dots, brackets, and plus sign
  • Reorder direct_parents in __find_caller_function_dfs so root-level packages are searched before library packages, fixing nondeterministic JS transitive search timeouts

JS parser fixes

  • Cap get_function_name regex searches to first 2000 chars to prevent catastrophic backtracking on huge functions
  • Fix is_comment_line to handle *-prefixed JSDoc continuation lines vs generator *method() syntax
  • Fix _extract_class_name regex to support $ in JS identifiers (\w+[\w$]+)
  • Fix resolve_chain backreference [^\1] in string pattern (changed to .*?)
  • Remove unused is_multiline parameter from _parse_declarations

RPM checker

  • Emit TARGET_IN_VULNERABLE_RANGE in VulnerabilityIntel.format_for_prompt() so the L1 agent sees the field referenced by the version-based fallback rules
  • Add VERSION GUARD clause to Case B sys prompt: when target is in vulnerable range, a grep match alone is not sufficient to conclude PATCHED — the fix must be at the exact CVE location
  • Shorten cve_verify_vuln_package LLM response instruction to one sentence

Config scanner

  • Add _CONFIG_DIR_ALLOWED_EXTENSIONS allowlist for directory-matched files, filtering out .js/.css/.map/.java/.py from config collection
  • Add build/tool config patterns: pyproject.toml, setup.cfg, tox.ini, tsconfig.json, .eslintrc.json, Makefile, CMakeLists.txt, meson.build

Source code bug fixes (~40)

  • Fix dep_tree.py missing comma causing --pythonvenv_python string concatenation
  • Fix dep_tree.py C/C++ detect_ecosystem walk not using _WALK_EXCLUDE_DIRS
  • Fix dep_tree.py remove unused _ensure_venv method
  • Fix dep_tree.py deduplicate types- package candidates via dict.fromkeys
  • Fix c_segmenter_custom.py remove_comments stripping patterns inside string literals
  • Fix c_lang_function_parsers.py debug print statement left in production code
  • Fix golang_functions_parsers.py len(declaration_parts) == (2 or 3) always-true comparison
  • Fix golang_functions_parsers.py no-op re.search("") call
  • Fix golang_functions_parsers.py is_package_imported raw string split, missing quote stripping, and unescaped identifier in regex
  • Fix golang_functions_parsers.py is_same_package crash on empty input
  • Fix python_functions_parser.py is_same_package returning True for two empty strings
  • Fix python_segmenters_with_classes_methods.py annotating all methods with last class name and skipping async def methods
  • Fix source_code_git_loader.py safe.directory guard unnecessarily gated on clone_url
  • Fix brew_downloader.py returning path with zero downloads and extracting SRPM from cache path instead of target path
  • Fix configuration_scanner.py re.match allowing partial filename matches (fullmatch), max_results=0 returning 1 result, cache race condition on concurrent eviction, and missing docker-compose*.yaml pattern
  • Fix import_usage_analyzer.py empty short_name matching everything
  • Fix async_http_utils.py off-by-one in retry count, consumer errors caught by retry loop, retry_on_client_errors overridden by Retry-After check, negative sleep from past X-RateLimit-Reset, and missing @functools.wraps
  • Fix function_name_locator.py python_flow_control crash on non-function documents, Go versioned module short-name collision, and get_function_name ValueError exception handling
  • Fix git_commit_searcher.py _rank_results mutating confidence in-place
  • Fix git_repo_manager.py double-wrapping GitCommandError
  • Fix intel_utils.py parse_cpe checking split_cpe[5] instead of split_cpe[10] for system
  • Fix llm_engine_utils.py assert False in production code replaced with RuntimeError
  • Fix repo_resolver.py case-sensitivity bug in normalize_package_name dropping original case for mixed-case JSON keys
  • Fix serp_api_wrapper.py key index not reset after full rotation, dead max_retries field, and callers passing removed field
  • Fix csaf_generator.py GHSA description dropped when no pre-existing note and notes appended with text: None
  • Fix web_patch_fetcher.py missing asyncio import, _is_commit_url false positive on /c/ outside kernel.org, dropping Gitiles commit URLs, yarl double-encoding %5E%21, and rewrite _fetch_gitiles_patch from sync requests to async aiohttp
  • Fix prompting.py build_tool_descriptions missing FL, CONFIG, IUA, GREP entries
  • Fix java_functions_parsers.py _count_call_args treating < comparison as generic bracket

Build & infra

  • Parallelize source JAR extraction using ThreadPoolExecutor with cgroup-aware worker count
  • Add -Dmaven.artifact.threads=10 to all mvn dependency:copy-dependencies invocations in dep_tree.py
  • Set MAVEN_OPTS with -Dmaven.repo.local on shared PVC in on-cm-runner.yaml, on-pull-request.yaml, and exploit_iq_service.yaml
  • Add GOCACHE env var to on-cm-runner.yaml, on-pull-request.yaml, exploit_iq_service.yaml
  • Increase batch runner CPU request/limit from 1/2 cores to 3/3 cores

Tests

  • Fix tautological assertions (disjunctive or, truthiness-only, conditional if-then-assert) and tests that reimplemented source logic instead of calling real functions
  • Consolidate 9 duplicate in-tree test files into canonical tests/ location
  • Add ~200 new tests: Java CCA (_can_reference_class, function_called_from_caller_body, __find_caller_function), JS parser/segmenter (backtracking cap, comment line, $ identifiers), C parser (argument-count filter, find_top_level_blocks), Go sub-package (FL, CCA, intel enrichment), RPM checker (format_for_prompt, VERSION GUARD), config scanner (allowlist, build patterns), and tools (FL, IUA, SerpAPI, git, async HTTP, web patch fetcher)

@vbelouso

vbelouso commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Code Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

3 similar comments
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test vulnerability-analysis-on-pr

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

@tmihalac tmihalac force-pushed the CCA-Argument-Count-Pre-filter branch from 0636a88 to 759ac49 Compare June 28, 2026 05:47
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test vulnerability-analysis-on-pr

2 similar comments
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test vulnerability-analysis-on-pr

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test vulnerability-analysis-on-pr

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

tmihalac added 14 commits June 28, 2026 13:19
- Replaced esprima-based JavaScript segmenter with tree-sitter for reliable
  parsing of modern JS syntax (optional chaining, nullish coalescing, top-level
  await)
  - Fixed JS function name extraction: keyword filtering, position-aware matching,
  redundant pattern removal, generator/TypeScript/anonymous-export support
  - Added build-artifact filtering (should_skip) that excludes app-level dist/,
  build/static/, .min.js while preserving node_modules/*/dist/ as legitimate
  third-party source
  - Added empty-name guards in CCA BFS to prevent documents with unextractable
  function names from entering call-chain analysis
  - Fixed _get_function_calls regex to detect calls through optional chaining
  (obj?.method())

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
  formatting

  - Add _can_reference_class() to JavaChainOfCallsRetriever for 4-way
  import visibility check (simple class name, wildcard import, same
  package, same artifact)
  - Apply import pre-filter in _get_possible_docs via optional
  declaring_fqcn/callee_file_name/code_documents params
  - Pass declaring FQCN from __find_caller_function to
  _get_possible_docs to eliminate irrelevant uber-JAR candidates before
  expensive type resolution
  - Only filter third-party candidates; application code (root docs)
  always passes to avoid false negatives from polymorphic interface
  calls
  - Add DFS cycle detection guard in get_relevant_documents to prevent
  infinite loops from self-recursive or mutually recursive method calls
  - Switch logger.debug in
  __check_identifier_resolved_to_callee_function_package from f-strings
  to %s lazy formatting to avoid string construction when debug
  logging is disabled

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
  escaping, and CCA empty-result guidance

  - Deduplicate parents list in non-Java CCA tree_dict to prevent
  duplicate entries from dependency tree builder
  - Add Go subpackage prefix matching to Rule 8 so
  "github.com/lib/foo/bar" matches target "github.com/lib/foo"
  - Add distinct "function not found" message when CCA returns empty
  call_hierarchy_list so agent distinguishes missing function from
  unreachable function
  - Escape regex metacharacters in Function Caller Finder query builder
  to handle identifiers containing dots, brackets, and plus sign

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
  - Fix dep_tree.py missing comma causing "--pythonvenv_python" string
  concatenation
  - Fix dep_tree.py C/C++ detect_ecosystem walk not using
  _WALK_EXCLUDE_DIRS
  - Fix c_segmenter_custom.py remove_comments stripping patterns inside
  string literals
  - Fix c_lang_function_parsers.py debug print statement left in
  production code
  - Fix golang_functions_parsers.py `len(declaration_parts) == (2 or
  3)` always-true comparison
  - Fix golang_functions_parsers.py no-op `re.search("")` call
  - Fix golang_functions_parsers.py is_package_imported raw string
  split and missing quote stripping
  - Fix golang_functions_parsers.py is_same_package crash on empty
  input
  - Fix javascript_functions_parser.py is_comment_line missing block
  comment continuation (`*`)
  - Fix javascript_functions_parser.py _extract_class_name regex
  missing `$` in identifier class
  - Fix javascript_functions_parser.py _parse_declarations unused
  is_multiline parameter
  - Fix javascript_functions_parser.py backreference `[^\1]` in string
  pattern
  - Fix python_functions_parser.py is_same_package returning True for
  two empty strings
  - Fix python_segmenters_with_classes_methods.py annotating all
  methods with last class name
  - Fix python_segmenters_with_classes_methods.py skipping async def
  methods
  - Fix source_code_git_loader.py safe.directory guard unnecessarily
  gated on clone_url
  - Fix brew_downloader.py returning path even with zero downloads
  - Fix brew_downloader.py extracting SRPM from cache path instead of
  target path
  - Fix configuration_scanner.py re.match allowing partial filename
  matches
  - Fix configuration_scanner.py max_results=0 returning 1 result
  - Fix configuration_scanner.py cache race condition on repo_key check
  outside lock
  - Fix configuration_scanner.py missing docker-compose*.yaml pattern
  - Fix import_usage_analyzer.py empty short_name matching everything
  - Fix async_http_utils.py off-by-one in retry count (`<=` vs `<`)
  - Fix async_http_utils.py consumer errors caught by retry loop
  instead of propagating
  - Fix async_http_utils.py retry_on_client_errors overridden by
  Retry-After check
  - Fix async_http_utils.py negative sleep from X-RateLimit-Reset in
  the past
  - Fix async_http_utils.py missing @functools.wraps on retry_async
  wrapper
  - Fix function_name_locator.py python_flow_control crash on
  non-function documents
  - Fix function_name_locator.py Go versioned module short-name
  collision (v2, v3)
  - Fix git_commit_searcher.py _rank_results mutating confidence
  in-place
  - Fix git_repo_manager.py double-wrapping GitCommandError
  - Fix intel_utils.py parse_cpe checking split_cpe[5] instead of
  split_cpe[10] for system
  - Fix llm_engine_utils.py assert False in production code replaced
  with RuntimeError
  - Fix repo_resolver.py case-sensitivity bug in normalize_package_name
  - Fix serp_api_wrapper.py key index not reset after full rotation
  - Fix serp_api_wrapper.py dead max_retries field
  - Fix csaf_generator.py GHSA description dropped when no pre-existing
  note
  - Fix csaf_generator.py notes appended with text: None when
  summary/justification missing
  - Fix web_patch_fetcher.py missing asyncio import for TimeoutError
  catch
  - Fix web_patch_fetcher.py _is_commit_url false positive on /c/
  outside kernel.org
  - Fix web_patch_fetcher.py dropping Gitiles commit URLs from
  candidates
  - Fix prompting.py build_tool_descriptions missing FL, CONFIG, IUA,
  GREP entries

  Test correctness fixes:
  - Replace tautological assertions (disjunctive or, truthiness-only,
  conditional if-then-assert)
  - Rewrite tests that reimplemented source logic instead of calling
  real functions
  - Fix mock searcher ignoring tantivy query parameter in IUA tests
  - Fix test_stub_only_triggers_pypi_fetch swallowing all exceptions
  via try/except pass
  - Fix test_clone_failure_cleans_temp_dir vacuously-passing assertion
  - Fix test_consumer_error_propagates using overly broad
  pytest.raises(Exception)
  - Fix test_optional_chaining_preservation asserting on input string
  not parsed output
  - Fix test_remove_comments_string_literal wrong docstring and
  tautological assertion
  - Fix test_key_rotation not verifying actual key sent in HTTP request
  - Fix test_all_tools_produce_7_descriptions omitting FL, CONFIG, IUA,
  GREP
  - Fix conditional assertion in git_commit_searcher silently passing
  on None
  - Fix test_third_party_docs weak assertion not verifying actual jar
  key
  - Fix test_llm_engine_utils disjunctive or assertion masking wrong
  return value

  Agent/pipeline coverage:
  - Add pre_process_node tests for ReachabilityAgent and
  CodeUnderstandingAgent
  - Add _postprocess_results exception handling tests
  - Add dispatch_question exception fallback and build_routing_prompt
  integration tests
  - Add Rule 8 vs Rule 9 priority interaction test
  - Add thought_node actions-is-None and observation_node
  truncation/pruning tests
  - Add _build_tool_guidance_for_ecosystem per-ecosystem filtering
  tests

  Java CCA coverage:
  - Add function_called_from_caller_body tests (24 cases)
  - Add extract_from_query, infer_class_name_and_package_name tests
  - Add is_java_fqcn, extract_maven_artifact, _is_doc_excluded tests
  - Add __find_caller_function and __find_initial_function direct tests

  JS parser/segmenter coverage:
  - Add search_for_called_function branch tests
  - Add is_valid, create_map_of_local_vars, is_exported_function tests
  - Add _get_tree caching, should_skip, nested class extraction tests

  C segmenter coverage:
  - Add find_top_level_blocks, remove_macro_blocks,
  extract_define_functions tests

  Go/Python/C parser coverage:
  - Add is_tree_key_match, get_function_name, is_package_imported edge
  case tests
  - Add Python utility method and class-without-parens tests
  - Add C get_package_names, filter_docs, document_imports_package
  tests

  Tools coverage:
  - Add FL stdlib_cache, flow_control, singleton isolation tests
  - Add config scanner cache eviction and concurrent access tests
  - Add IUA query verification and comment-line counting tests
  - Add git_commit_searcher _fetch_patch_via_http tests

  External integration coverage:
  - Add web patch fetcher parsing, Gitiles URL, commit extraction tests
  - Add async HTTP retry limit, raise_for_status, 500 boundary tests
  - Add SERP key exhaustion reset and error propagation tests
  - Add git_repo_manager clone, fetch, concurrency, host validation
  tests

  VEX/intel/version coverage:
  - Add unexpected justification_label, RPM+NVD range, version check
  error tests
  - Add package identifier utility method tests
  - Add _is_safe_url, identify() with intel=None tests

  LLM engine/checklist/prompting coverage:
  - Add preprocess_engine_input, postprocess_engine_output branch tests
  - Add build_no_vuln_packages_output justification tests
  - Add generate_checklist, build_tool_descriptions per-tool tests

  Remaining coverage:
  - Add _ensure_venv, determine_python_version,
  vulnerability_intel_sanitizer tests
  - Add source_classification, credential_client, transitive_detection
  tests
  - Rewrite cve_fetch_patches tests to call real _arun

  Test file consolidation:
  - Merge 18 deleted test files into consolidated per-domain test
  modules
  - Add 13 new focused test files for previously uncovered modules

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
  - Fix serp_api_wrapper.py callers passing removed max_retries field
  (Pydantic ValidationError at runtime)
  - Fix web_patch_fetcher.py _fetch_gitiles_patch yarl double-encoding
  %5E%21 (use yarl.URL(encoded=True))
  - Fix java_functions_parsers.py _count_call_args treating <
  comparison as generic bracket (dual-comma fallback)
  - Fix repo_resolver.py normalize_package_name dropping original case
  for mixed-case JSON keys (NetworkManager)
  - Fix javascript_functions_parser.py is_comment_line classifying
  generator *method() as block comment
  - Fix golang_functions_parsers.py is_package_imported unescaped
  identifier in regex (add re.escape)
  - Fix configuration_scanner.py cache read outside lock causing
  KeyError on concurrent eviction (use .get())

  Convention fixes:
  - Fix test_java_cca.py _extract docstring referencing
  search_for_called_function
  - Fix test_go_parser.py docstrings referencing fix history instead of
  describing behavior

  Tests:
  - Add SerpAPI extra_forbidden validation test
  - Add Gitiles yarl.URL encoding preservation tests
  - Add _count_call_args unbalanced angle bracket tests (comparison,
  bit shift, ternary)
  - Add normalize_package_name mixed-case preservation tests
  - Add is_comment_line generator method vs block comment tests
  - Add is_package_imported regex escape and substring rejection tests
  - Add configuration scanner cache eviction safety tests

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
  packages are searched before library packages
  - Fixes nondeterministic timeouts in JS transitive search
  (test_java_script_transitive_search_1 hung ~80% of runs)
  - Root cause: parent order from _get_parents was nondeterministic;
  when a library package (e.g. @cyclonedx/cyclonedx-library) was
  iterated first, DFS entered intra-package call chains (package lists
  itself as own parent) and explored hundreds of branches before
  finding the root caller
  - No search paths removed — only iteration order changed to
  prioritize root_project

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
  quality, misc fixes

  - Add argument-count pre-filter to C parser's
  search_for_called_function to reject
    cross-package false positives where a same-named function has
  different arity
    (e.g. rsync read_byte(1 param) vs PostgreSQL read_byte macro with 3
  args)
  - Add _count_c_declared_params and _count_call_site_args helpers for
  the pre-filter
  - Update search_for_called_function signature: callee_function is now
  a positional
    parameter (was keyword-only callee_function:Document = None)
  - Add 6 new test cases for argument-count filtering (match, mismatch,
  variadic,
    no-callee, void-param) and 12 tests for the counting helpers
  - Parallelize source JAR extraction using ThreadPoolExecutor with
  cgroup-aware
    worker count (_available_cpus helper reads /sys/fs/cgroup/cpu.max)
  - Add -Dmaven.artifact.threads=10 to all mvn
  dependency:copy-dependencies,
    clean install, and depgraph-maven-plugin invocations in dep_tree.py
  - Set MAVEN_OPTS with -Dmaven.repo.local on shared PVC in
    and exploit_iq_service.yaml for persistent Maven cache across runs
  - Add "AVOID UNANSWERABLE QUESTIONS" section to checklist prompt to
  prevent
    runtime-state questions that static analysis tools cannot answer
  - Shorten cve_verify_vuln_package LLM response instruction to one
  sentence
  - Update test_agent.py: _build_observation_context now takes
  critical_context list
    parameter; add test_crit_context_merged_into_knowledge; update
  pre_process_node
    assertions for critical_context field
  - Add CVE-2025-48734 comment on test_transitive_search_java_1
  - Remove section-separator comments from test_c_parser.py

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac tmihalac force-pushed the CCA-Argument-Count-Pre-filter branch from 023d7a3 to bde7996 Compare June 28, 2026 20:37
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test vulnerability-analysis-on-pr

1 similar comment
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test vulnerability-analysis-on-pr

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test vulnerability-analysis-on-pr

1 similar comment
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test vulnerability-analysis-on-pr

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

tmihalac added 4 commits June 29, 2026 10:13
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
  GUARD

  - Emit TARGET_IN_VULNERABLE_RANGE (YES/NO) in
  VulnerabilityIntel.format_for_prompt()
    so the L1 agent can see the field referenced by the VERSION-BASED
  FALLBACK rules
  - Add VERSION GUARD clause to Case B sys prompt CONCLUSION section:
  when
    TARGET_IN_VULNERABLE_RANGE is YES, a grep match alone is not
  sufficient to
    conclude PATCHED — the fix must be at the exact CVE location, not a
  similar
    pre-existing check
  - Add VERSION GUARD phase to Case B thought instructions PHASE 3
  VERDICT:
    when fix pattern was found but target is in vulnerable range,
  verify the match
    is the exact CVE fix or conclude VULNERABLE (version-based)
  - Add comment explaining the CVE-2024-48957/libarchive triggering
  case for the
    Case B prompt changes
  - Add 21 tests in test_vulnerability_intel_format.py covering
  format_for_prompt()
    field emission (TARGET_IN_VULNERABLE_RANGE, downstream patch,
  vulnerable/fix
    patterns, bitness, ordering),
  select_upstream_prompt_and_instructions routing,
    and Case B prompt content assertions

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
  build/tool config patterns

  - Add _CONFIG_DIR_ALLOWED_EXTENSIONS allowlist for files matched only
  by directory
  - Fixes Keycloak issue where .js/.css/.map files under resources/
  were collected
  - Add build/tool config patterns: pyproject.toml, setup.cfg, tox.ini,
  tsconfig.json,
    .eslintrc.json, Makefile, CMakeLists.txt, meson.build
  - Extensionless files in config dirs still accepted
  - Update test_collects_files_in_config_dir to use .xml instead of
  .txt
  - Add 10 new tests for allowlist, build tool patterns, and negative
  cases

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
  improvements

  - Merge 9 in-tree test files into canonical tests/ location, delete
  in-tree copies:
    test_configuration_scanner, test_credential_client,
  test_import_usage_analyzer,
    test_javascript_functions_parser, test_source_code_git_loader,
    test_transitive_detection, test_version_check,
  test_vulnerability_intel_sanitizer,
    test_web_patch_fetcher
  - Config scanner: add _CONFIG_DIR_ALLOWED_EXTENSIONS allowlist for
  directory-matched
    files, filtering out .js/.css/.map/.java/.py etc. from config
  collection
  - Config scanner: add build/tool config patterns (pyproject.toml,
  setup.cfg, tox.ini,
    tsconfig.json, .eslintrc.json, Makefile, CMakeLists.txt,
  meson.build)
  - Add MAVEN_OPTS with persistent repo cache to on-pull-request.yaml
  - Add 30+ new tests across merged files covering allowlist
  enforcement, build tool
    patterns, dependency manifests negative cases, and unique tests
  from in-tree files

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test vulnerability-analysis-on-pr

@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

  uber-jar threshold

  - Add Go sub-package awareness to Function Locator
  (_extract_go_subpackage, _go_subpackage_flow_control)
  - Add CCA sub-package filtering via resolve_subpackage_to_module in
  Go parser and tree fallback in chain_of_calls_retriever
  - Add Go sub-package enrichment from patches in intel_utils
  (extract_go_subpackages_from_patch)
  - Pass candidate_packages to enrich_vulnerable_functions_from_patch
  in cve_agent
  - Add base resolve_subpackage_to_module to LangFunctionsParser
  (returns None for non-Go)
  - Add GOCACHE env var to on-cm-runner.yaml, on-pull-request.yaml,
  exploit_iq_service.yaml
  - Revert uber_jar_file_threshold from 1000 back to 600 in all config
  files
  - Add 38 tests for Go sub-package fixes (FL feedback, CCA filtering,
  intel enrichment)

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

tmihalac added 2 commits June 30, 2026 18:13
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test vulnerability-analysis-on-pr

- Change build_short_go_package_name to store list of packages per short name instead of overwriting
- Search all matching packages in locate_functions via any() when short name resolves to multiple packages
- Add test for unrelated packages sharing the same short name (github.com/foo/util vs github.com/bar/util)
- Update existing short_name tests to expect list values and verify both versions preserved on collision

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac

Copy link
Copy Markdown
Collaborator Author

/test-heavy

  - Remove C-L, C-H, C-M, A-H, B-M coverage-tracking labels from
  comments, docstrings, and section headers
  - Remove "Coverage gap tests:" and "Fix A/B/C:" prefixes from test
  section comments
  - Keep descriptive text, only strip the label identifiers
  - 22 files cleaned across tests/ and src/*/tests/

Signed-off-by: Theodor Mihalache <tmihalac@redhat.com>
@tmihalac

tmihalac commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

/test-heavy

@tmihalac tmihalac requested a review from zvigrinberg July 1, 2026 08:37
@tmihalac tmihalac changed the title Add CCA import-based pre-filter, cycle detection, and log lazy formatting CCA pre-filters (Java import, C arg-count, Go sub-package), JS tree-sitter rewrite, RPM version guard, ~40 bug fixes, ~200 new tests Jul 1, 2026
@tmihalac tmihalac changed the title CCA pre-filters (Java import, C arg-count, Go sub-package), JS tree-sitter rewrite, RPM version guard, ~40 bug fixes, ~200 new tests CCA pre-filters (Java import, C arg-count, Go sub-package), CCA cycle detection & perf, RPM version guard, ~40 bug fixes, ~200 new tests Jul 1, 2026
@tmihalac tmihalac changed the title CCA pre-filters (Java import, C arg-count, Go sub-package), CCA cycle detection & perf, RPM version guard, ~40 bug fixes, ~200 new tests CCA pre-filters (Java import, C arg-count, Go sub-package), Java CCA cycle detection & perf, RPM version guard, ~40 bug fixes, ~200 new tests Jul 1, 2026

@zvigrinberg zvigrinberg left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tmihalac ,

Please see the comments below.

return self.common_flow_control(function, package_docs)
case Ecosystem.GO.value:
return self.common_flow_control(function, package_docs)
return self._go_subpackage_flow_control(function, package_docs, package)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: module_path receives unresolved short name — sub-package disambiguation is dead code

The refactoring that introduced packages_to_search stopped reassigning package to the resolved full module path. This call:

return self._go_subpackage_flow_control(function, package_docs, package)

passes the original short name (e.g., 'protojson') as module_path. Inside _extract_go_subpackage, the check dir_path.startswith(norm_module + '/') never matches a short name against a full source path like google.golang.org/protobuf/encoding/protojson/.... Every function falls through to return module_path, collapsing all sub-packages into one bucket — so len(subpkg_to_funcs) <= 1 is always true and the multi-sub-package logic never triggers.

Fix: Pass the resolved module path(s) from packages_to_search:

return self._go_subpackage_flow_control(function, package_docs, packages_to_search[0])

Or iterate over all resolved paths if multiple are possible.


def _count_call_site_args(function_body: str, func_name: str) -> int | None:
"""Count arguments at the first call site of func_name(...) in function_body.
Returns None if the call site cannot be parsed."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Comma counting ignores string/char literals — causes false rejections

_count_call_site_args iterates characters tracking only parenthesis depth:

for ch in args_str:
    if ch == '(':
        depth += 1
    elif ch == ')':
        depth -= 1
    elif ch == ',' and depth == 0:
        count += 1

This doesn't skip commas inside string or character literals. A call like func("error: a,b", x) is counted as 3 args instead of 2, causing a spurious mismatch against the declared parameter count and rejecting a valid call chain.

This is a common pattern in C code (format strings, error messages like fprintf(stderr, "expected %d, got %d", a, b)).

The Java counterpart _count_call_args has a related issue where < in comparison expressions (e.g., a < b) is treated as a generic angle-bracket opener.

)
result = list(close_matches)
result.append(guidance)
return result

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type mixing: Guidance string appended to function-name list

result = list(close_matches)
result.append(guidance)
return result

This appends a multi-line INFO string (e.g., "INFO: Matched functions exist in multiple sub-packages of 'module':\n subpkg1: func1\n...") to a list that otherwise contains only function names. The caller locate_functions returns this as "result": result, documented as [function_names].

While the LLM consumer may parse this gracefully, any programmatic downstream code that iterates over result treating every element as a valid function name will produce incorrect lookups or regex errors when it hits the guidance text (which contains spaces, colons, and newlines).

Suggestion: Return the guidance separately:

return {"functions": list(close_matches), "guidance": guidance}

Or log it instead of embedding it in the return value.

params["api_key"] = self._rotate_next_key()
else:
raise
self.__class__._serp_api_key_index = 0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Race condition: Resetting shared _serp_api_key_index outside the lock

self.__class__._serp_api_key_index = 0
raise Exception("All API keys exhausted")

_serp_api_key_index is a ClassVar shared across all instances. All other mutations (validate_serp_api_keys, _rotate_next_key) properly guard writes with _key_rotation_lock, but this reset happens outside the lock.

If a concurrent caller is mid-rotation (e.g., on key index 3 of 5), this unlocked reset to 0 causes it to re-try already-exhausted keys (getting 402/429 loops) or skip keys it hasn't tried.

Fix: Move inside the lock:

with self.__class__._key_rotation_lock:
    self.__class__._serp_api_key_index = 0
raise Exception("All API keys exhausted")

def get_possible_docs(self, function_name_to_search: str, package: str, exclusions: list[Document],
sources_location_packages: bool,
target_class_names: frozenset[str],
method_exclusions: dict) -> (list[Document], bool):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Pre-index doesn't achieve algorithmic speedup — still O(unique_paths) linear scan

candidates = [doc for path, docs in self._source_path_index.items()
              if package in path for doc in docs]

This iterates all unique source paths doing a substring match — same algorithmic complexity as the old linear scan over all documents. The reduction from N (total docs) to U (unique paths) is a constant-factor improvement when files yield many functions, but it's not the O(1) lookup the pre-indexing pattern suggests.

Not a blocker — the constant-factor improvement plus the reordered search_token check before expensive is_function/_is_doc_excluded calls is a net positive. But if this becomes a bottleneck, an inverted index on path segments (mapping each package name to matching paths) would give true O(1) lookup.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions before we merge the VERSION GUARD changes:

  1. What is the basis for this change?
    Can you provide a specific CVE case where the current behavior caused an incorrect verdict? Specifically:
    Without a concrete example demonstrating the problem, it's hard to evaluate whether this change is necessary or correctly scoped.

Did you run the modified prompts against actual RPM checker scenarios to verify:

In my experience, LLMs frequently ignore or misinterpret prompt instructions — especially conditional logic like "if X then require Y". Adding a VERSION GUARD clause sounds reasonable in theory, but:

The model may still conclude PATCHED on a grep match alone
The model may become overly conservative and mark everything VULNERABLE
The interaction between this guard and other prompt rules is unpredictable

Suggestion: Before merging

  1. find a real use case where this change is needed
  2. verify that the prompt really works and the llm does not ignores it
  3. run a regression test i have a dataset of 10 cases which i can send to check that changes does not create regression issues

@zvigrinberg zvigrinberg left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another cycle of review

Comment on lines +738 to +742
if stripped.startswith('*'):
rest = stripped[1:].lstrip()
if rest and (rest[0].isalnum() or rest[0] in ('$', '_', '[')):
return False
return True

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: is_comment_line misclassifies most JSDoc continuation lines as non-comment

A line like * Returns the cached value has rest='Returns...' and rest[0]='R' is alphanumeric, so the method returns False. This means the majority of JSDoc prose lines survive the comment filter. At line 207, unfiltered JSDoc prose is scanned for call patterns, producing false-positive caller-callee edges when prose mentions function names (e.g., * Delegates to parseJSON).

The intent was to distinguish *method() generator syntax from * JSDoc text, but the heuristic catches far too much.

Suggestion: Check whether the line is inside a /* ... */ block comment rather than inspecting the character after *. Or use a narrower heuristic — generator methods start with * immediately followed by an identifier without a space:

if stripped.startswith('*'):
    rest = stripped[1:]
    # Generator syntax: *methodName() — no space after *
    if rest and not rest[0].isspace():
        return False
    # JSDoc continuation: * some text — space after *
    return True

Comment on lines 652 to 654
if matching and matching.group(0):
import_line = code_content[matching.start():]
import_package_line = import_line[:import_line.find(os.linesep)].strip()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: is_package_imported regex adds trailing .*, matching identifier anywhere in import path

Old regex: import ['"].*{identifier}['"] — identifier at end of path.
New regex: import ['"].*{esc_id}.*['"] — identifier anywhere in path.

So identifier='json' now matches import "github.com/json-iterator/go" because json appears mid-path. This creates false-positive import matches and allows unrelated packages to pass the call-chain filter.

The re.escape(identifier) fix is correct and needed — but the trailing .* changes the matching semantics.

Suggestion: Keep the anchor at the end, just add the escape:

esc_id = re.escape(identifier)
matching = re.search(rf"import [\'\"].*{esc_id}[\'\"]" , code_content)

Or if you need to match identifier as a path segment (not just at the end), use a more precise pattern:

matching = re.search(rf"import [\'\"].*[/]{esc_id}[\'\"]" , code_content)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zvigrinberg This comment issue addresses an existing code, not new added one, thus need to embrace new needed escaping of identifier, and keep the greedy quantifier * at the end as is.

Comment on lines +832 to +840
elif ch == '>' and depth_a > 0:
depth_a -= 1
elif ch == ',' and depth_p == 0 and depth_b == 0:
commas_without_angles += 1
if depth_a == 0:
commas_with_angles += 1
if depth_a == 0:
return commas_with_angles + 1
return commas_without_angles + 1

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: _count_call_args miscounts when </> are comparison operators balancing across a comma

For assertTrue(x < 10, y > 5):

  1. < increments depth_a to 1
  2. At the comma, depth_a != 0, so only commas_without_angles is incremented (not commas_with_angles)
  3. > decrements depth_a back to 0
  4. Since depth_a == 0, returns commas_with_angles + 1 = 1 instead of the correct 2

This causes the arg-count pre-filter to reject valid call sites wherever assertions, comparisons, or ternaries use </> across argument boundaries.

Suggestion: Use a two-pass approach — first try treating </> as angle brackets. If depth_a goes negative at any point (a > without a preceding <), restart treating all </> as operators and count only parenthesis/bracket depth:

# If depth_a ever goes negative, it means > was a comparison, not a bracket close.
# Fall back to ignoring angle brackets entirely.
if depth_a < 0:
    return _count_ignoring_angles(s, open_idx, close_idx)

pattern = re.compile(r'\b' + re.escape(func_name) + r'\s*\(')
m = pattern.search(function_body)
if not m:
return None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: C arg-count pre-filter checks only the first call site — rejects valid matches when a same-named local function appears first

_count_call_site_args uses pattern.search(function_body) (first match only). The docstring confirms: "Count arguments at the first call site."

If a caller has init() (0 args, local helper) textually before init(ctx, cfg) (2 args, the real callee), re.search finds the wrong one. The arity mismatch causes return False, rejecting a valid caller-callee edge.

Suggestion: Use re.finditer and check ALL call sites — accept if ANY matches the declared param count:

def _count_call_site_args(function_body: str, func_name: str) -> list[int]:
    """Count arguments at all call sites of func_name(...) in function_body."""
    pattern = re.compile(r'\b' + re.escape(func_name) + r'\s*\(')
    counts = []
    for m in pattern.finditer(function_body):
        count = _count_args_from_match(function_body, m)
        if count is not None:
            counts.append(count)
    return counts

Then in the caller: if declared not in call_arg_counts: return False

Comment on lines +272 to +274
norm_module = module_path.rstrip("/")
if dir_path == norm_module or dir_path.startswith(norm_module + "/"):
return dir_path

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: _go_subpackage_flow_control doesn't catch ValueError from get_function_name, unlike python_flow_control fixed in the same PR

This PR adds try/except ValueError: continue to python_flow_control (line ~188 of the diff) but the new _go_subpackage_flow_control calls get_function_name unguarded:

function_name = self.lang_parser.get_function_name(doc)

Go's get_function_name raises ValueError on malformed function headers (line 416 in golang_functions_parsers.py: raise ValueError(f"Invalid function header")). A single malformed document crashes the entire locate_functions call.

Suggestion: Add the same guard:

for doc in package_docs:
    if self.lang_parser:
        try:
            function_name = self.lang_parser.get_function_name(doc)
        except ValueError:
            continue
        if function_name:
            func_to_docs[function_name].append(doc)

Comment on lines +494 to +495
if not path.endswith(".go"):
continue

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not path.endswith(".go"):
continue
go_func_parser=GoLanguageFunctionsParser()
extensions=go_func_parser.supported_files_extensions()
if not any (path.endswith(ext) for ext in extensions):
continue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants