Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 15 additions & 14 deletions docs/BENCHMARKS.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ The ranking below is ordered by **practical verification value first**: broad ec
| 0c | Cross-cutting rootfs / shipped-artifact snapshot scans (non-parser reference) | 🟢 Verified | Debian base-image rootfs snapshot (3,267 files)<br>Fedora base-image rootfs snapshot (1,579 files)<br>official Alpine minirootfs snapshot (84 files) | These targets simultaneously exercise distro metadata, package DB/archive surfaces, package-adjacent files, and common-profile detection on unpacked system trees. They complement, but do not replace, the Debian, RPM, Alpine, Linux Distro, and Windows Update family rows. Debian `bookworm-slim`, Fedora 26 rootfs, and Alpine 3.23.3 minirootfs are now verified references for installed-package coverage, rootfs package-adjacent metadata, and common-profile text-detection triage. |
| 0d | Cross-cutting filesystem-scale native source-tree scans (non-parser reference) | ⚪ Planned | `torvalds/linux` (100k files)<br>`rust-lang/rust` (8k files) | Use this lane when traversal robustness matters more than parser breadth. `torvalds/linux` is the extreme large native-tree and sparse-manifest case with lots of COPYING/README-style text noise, while `rust-lang/rust` adds a mixed Cargo-plus-bootstrap native layout. Watch generated/build artifacts, vendored/bootstrap directories, and common-profile deltas that are really tree-shape issues rather than parser regressions. |
| 0e | Cross-cutting licensing-edge-case repository scans (non-parser reference) | 🟢 Verified | `nmap/nmap` (500–2k files)<br>`ffmpeg/ffmpeg` (10,200 files)<br>`mongodb/mongo` (11k files) | Use this lane when the main goal is license-classification accuracy rather than parser breadth. `nmap/nmap @ d9199d7` is now a verified reference: the important NPSL/source-available regressions were fixed, Provenant’s NPSL rebucketing on Nmap-owned reference-notice files was kept as the more correct result, and the remaining ScanCode-only license tail was reviewed down to weak translated-manpage GPL bare-word duplicates plus low-signal placeholder detections rather than stronger ScanCode-better findings. `mongodb/mongo @ d6877a33` is now also a verified reference: the package-value tail is reviewed down to richer namespace/PURL identity on Provenant’s side, and the remaining license tail stays in lockfile/SBOM rebucketing rather than SSPL/source-available misses. `ffmpeg/ffmpeg @ 056562a5` is now also a verified reference: the file-level Autotools `configure` identity was aligned with ScanCode, Provenant kept one extra top-level Autotools package as the better result, and the remaining license tail was reviewed down to weak `configure` variable-name / bare-word GPL hits rather than stronger ScanCode-better evidence. |
| 1 | npm / yarn / pnpm (+ Bun) | 🟢 Verified | `npm/cli` (500–2k files)<br>`yarnpkg/berry` (500–2k files)<br>`vercel/next.js` (5k files)<br>`oven-sh/bun` (500–2k files)<br>`microsoft/vscode` (3k files) | Highest-value JS family. `npm/cli` is the npm-first manifest/lock/workspace reference, `yarnpkg/berry` covers modern Yarn metadata, `oven-sh/bun` covers Bun lockfile variants, and `vercel/next.js` plus `microsoft/vscode` add large TypeScript monorepo realism. Watch package-manager-specific lockfile and workspace-assembly mismatches before blaming generic README, vendored, or generated JS noise. |
| 1 | npm / yarn / pnpm (+ Bun) | 🟢 Verified | `npm/cli` (500–2k files)<br>`yarnpkg/berry` (500–2k files)<br>`vercel/next.js` (5k files)<br>`oven-sh/bun` (500–2k files)<br>`microsoft/vscode` (3k files) | Highest-value JS family. `npm/cli` is the npm-first manifest/lock/workspace reference, `yarnpkg/berry` covers modern Yarn metadata, `oven-sh/bun` covers Bun lockfile variants, and `vercel/next.js` plus `microsoft/vscode` add large TypeScript monorepo realism. `oven-sh/bun @ 700fc117` is now a verified Bun reference: legacy seven-field `bun.lockb` parsing was aligned with Bun's own loader expectations, Provenant kept much broader Bun/npm-family package extraction, and the remaining reviewed tail is down to one over-specific `BSD-2-Clause-Views` vs plain `BSD-2-Clause` rebucketing plus compare-normalization artifacts on otherwise matching raw `package.json` dependencies. Watch package-manager-specific lockfile and workspace-assembly mismatches before blaming generic README, vendored, or generated JS noise. |
| 2 | Python / PyPI | ⚪ Planned | `pandas-dev/pandas` (1.2k files)<br>`scipy/scipy` (1.3k files)<br>`django/django` (2.5k files)<br>`python-poetry/poetry` (500–2k files)<br>`astral-sh/uv` (500–2k files) | Broad Python family with both classic and modern metadata. `pandas-dev/pandas`, `scipy/scipy`, and `django/django` add realistic mixed source/doc/test trees, while `python-poetry/poetry` and `astral-sh/uv` cover Poetry- and uv-era lockfile/group behavior. Watch interactions between `pyproject.toml`, legacy setup metadata, extras/groups, and large doc/test subtrees that can dominate common-profile deltas. |
| 3 | Maven / Java | ⚪ Planned | `apache/maven` (500–2k files)<br>`apache/camel` (2k–10k files)<br>`spring-projects/spring-boot` (2k–10k files)<br>`apache/felix-dev` (2k–10k files) | High-value JVM lane. `apache/maven` is the clearest parent/module inheritance reference, `apache/camel` and `spring-projects/spring-boot` stress large nested multi-module builds, and `apache/felix-dev` adds OSGi plus `MANIFEST.MF` bundle metadata. Watch inherited metadata, nested-module aggregation, and bundle-manifest extraction rather than treating every Java delta as leaf-`pom.xml` parsing failure. |
| 4 | Go | ⚪ Planned | `containerd/containerd` (2k–10k files)<br>`go-gitea/gitea` (2k–10k files)<br>Go build-info sample binaries via local `--target-path` lane | Use both source and binary lanes here. `containerd/containerd` and `go-gitea/gitea` cover large real-world module graphs, while the local binary lane is the only way to verify embedded Go build info that repo scans cannot see. Watch nested modules, vendored trees, and source-versus-binary coverage gaps explicitly during compare review. |
Expand Down
23 changes: 17 additions & 6 deletions src/parsers/bun_lockb.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ pub struct BunLockbParser;

const HEADER_BYTES: &[u8] = b"#!/usr/bin/env bun\nbun-lockfile-format-v0\n";
const SUPPORTED_FORMAT_VERSION: u32 = 2;
const FIELD_COUNT_WITHOUT_SCRIPTS: usize = 7;
const FIELD_COUNT_WITH_SCRIPTS: usize = 8;
const PACKAGE_FIELD_LENGTHS: [usize; 8] = [8, 8, 64, 8, 8, 88, 20, 48];
const DEPENDENCY_ENTRY_SIZE: usize = 26;

Expand Down Expand Up @@ -125,10 +127,10 @@ pub(crate) fn parse_bun_lockb(bytes: &[u8]) -> Result<PackageData, String> {
}

let field_count = cursor.read_u64()? as usize;
if field_count != PACKAGE_FIELD_LENGTHS.len() {
if field_count != FIELD_COUNT_WITHOUT_SCRIPTS && field_count != FIELD_COUNT_WITH_SCRIPTS {
return Err(format!(
"Unexpected bun.lockb package field count {}",
field_count
"Unexpected bun.lockb package field count {} (supported: {} or {})",
field_count, FIELD_COUNT_WITHOUT_SCRIPTS, FIELD_COUNT_WITH_SCRIPTS
));
}

Expand All @@ -141,7 +143,7 @@ pub(crate) fn parse_bun_lockb(bytes: &[u8]) -> Result<PackageData, String> {
return Err("Invalid bun.lockb package section bounds".to_string());
}

let mut packages = parse_packages(bytes, list_len, packages_begin, packages_end)?;
let mut packages = parse_packages(bytes, list_len, field_count, packages_begin, packages_end)?;
cursor.pos = packages_end;
let buffers = parse_buffers(bytes, &mut cursor, total_buffer_size)?;
materialize_packages(&mut packages, buffers.string_bytes)?;
Expand All @@ -152,6 +154,7 @@ pub(crate) fn parse_bun_lockb(bytes: &[u8]) -> Result<PackageData, String> {
fn parse_packages(
bytes: &[u8],
list_len: usize,
field_count: usize,
packages_begin: usize,
packages_end: usize,
) -> Result<Vec<BunLockbPackage>, String> {
Expand All @@ -175,7 +178,8 @@ fn parse_packages(
.get(packages_begin..packages_end)
.ok_or_else(|| "Invalid bun.lockb package region".to_string())?;

let expected_size: usize = PACKAGE_FIELD_LENGTHS.iter().sum::<usize>() * list_len;
let expected_size: usize =
PACKAGE_FIELD_LENGTHS[..field_count].iter().sum::<usize>() * list_len;
if package_region.len() < expected_size {
return Err("bun.lockb package region is truncated".to_string());
}
Expand Down Expand Up @@ -213,7 +217,14 @@ fn parse_packages(
field_offset += 88;
}

let _ = field_offset + 20 * list_len + 48 * list_len;
field_offset += 20 * list_len;
if field_count == FIELD_COUNT_WITH_SCRIPTS {
field_offset += 48 * list_len;
}

if field_offset != expected_size {
return Err("bun.lockb package region layout is malformed".to_string());
}

Ok(packages)
}
Expand Down
30 changes: 30 additions & 0 deletions src/parsers/bun_lockb_golden_test.rs
Original file line number Diff line number Diff line change
@@ -1,10 +1,23 @@
#[cfg(all(test, feature = "golden-tests"))]
mod golden_tests {
use base64::Engine;
use std::path::PathBuf;

use crate::parsers::PackageParser;
use crate::parsers::bun_lockb::BunLockbParser;
use crate::parsers::golden_test_utils::compare_package_data_parser_only;
use tempfile::TempDir;

fn decode_legacy_no_scripts_fixture() -> Vec<u8> {
let fixture = PathBuf::from("testdata/bun/legacy/bun.lockb.v2-no-scripts.base64");
base64::engine::general_purpose::STANDARD
.decode(
std::fs::read_to_string(&fixture)
.expect("fixture should be readable")
.trim(),
)
.expect("fixture should decode")
}

#[test]
fn test_golden_bun_lockb_v2() {
Expand All @@ -18,4 +31,21 @@ mod golden_tests {
Err(e) => panic!("Golden test failed: {}", e),
}
}

#[test]
fn test_golden_bun_lockb_v2_without_scripts_field() {
let temp_dir = TempDir::new().expect("Failed to create temp dir");
let test_file = temp_dir.path().join("bun.lockb");
std::fs::write(&test_file, decode_legacy_no_scripts_fixture())
.expect("Failed to write decoded bun.lockb fixture");
let expected_file =
PathBuf::from("testdata/bun/golden/bun-lockb-v2-no-scripts-expected.json");

let package_data = BunLockbParser::extract_first_package(&test_file);

match compare_package_data_parser_only(&package_data, &expected_file) {
Ok(_) => (),
Err(e) => panic!("Golden test failed: {}", e),
}
}
}
33 changes: 33 additions & 0 deletions src/parsers/bun_lockb_test.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
#[cfg(test)]
mod tests {
use base64::Engine;

use crate::models::{DatasourceId, PackageType};
use crate::parsers::bun_lockb::parse_bun_lockb;
use crate::parsers::{BunLockbParser, PackageParser};
Expand All @@ -20,6 +22,17 @@ mod tests {
(temp_dir, lock_path)
}

fn bun_lockb_v2_without_scripts_fixture() -> Vec<u8> {
let fixture = PathBuf::from("testdata/bun/legacy/bun.lockb.v2-no-scripts.base64");
base64::engine::general_purpose::STANDARD
.decode(
fs::read_to_string(&fixture)
.expect("fixture should be readable")
.trim(),
)
.expect("fixture should decode")
}

#[test]
fn test_is_match_bun_lockb_without_sibling_text_lock() {
assert!(BunLockbParser::is_match(&PathBuf::from(
Expand Down Expand Up @@ -142,6 +155,26 @@ mod tests {
assert_eq!(workspace_dep.is_direct, Some(true));
}

#[test]
fn test_parse_bun_lockb_v2_without_scripts_field() {
let bytes = bun_lockb_v2_without_scripts_fixture();
parse_bun_lockb(&bytes).expect("seven-field bun.lockb should parse");

let (_temp_dir, lock_path) = create_temp_bun_lockb(&bytes);
let package_data = BunLockbParser::extract_first_package(&lock_path);

assert_eq!(package_data.package_type, Some(PackageType::Npm));
assert_eq!(package_data.datasource_id, Some(DatasourceId::BunLockb));
assert_eq!(package_data.name.as_deref(), Some("bundle"));
assert!(package_data.version.is_none());
assert!(
package_data
.dependencies
.iter()
.any(|dep| dep.purl.as_deref() == Some("pkg:npm/bun-types@0.5.8"))
);
}

#[test]
fn test_invalid_bun_lockb_header_returns_default_package() {
let (_temp_dir, lock_path) = create_temp_bun_lockb(b"not-a-bun-lockb");
Expand Down
Loading
Loading