mstykow · mstykow · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026 · Apr 9, 2026
diff --git a/docs/BENCHMARKS.md b/docs/BENCHMARKS.md
diff --git a/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md b/docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md
@@ -49,7 +49,7 @@ The ranking below is ordered by **practical verification value first**: broad ec
 | 0c       | Cross-cutting rootfs / shipped-artifact snapshot scans (non-parser reference)  | 🟢 Verified | Debian base-image rootfs snapshot (3,267 files)<br>Fedora base-image rootfs snapshot (1,579 files)<br>official Alpine minirootfs snapshot (84 files)                             | These targets simultaneously exercise distro metadata, package DB/archive surfaces, package-adjacent files, and common-profile detection on unpacked system trees. They complement, but do not replace, the Debian, RPM, Alpine, Linux Distro, and Windows Update family rows. Debian `bookworm-slim`, Fedora 26 rootfs, and Alpine 3.23.3 minirootfs are now verified references for installed-package coverage, rootfs package-adjacent metadata, and common-profile text-detection triage.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
 | 0d       | Cross-cutting filesystem-scale native source-tree scans (non-parser reference) | ⚪ Planned  | `torvalds/linux` (100k files)<br>`rust-lang/rust` (8k files)                                                                                                                     | Use this lane when traversal robustness matters more than parser breadth. `torvalds/linux` is the extreme large native-tree and sparse-manifest case with lots of COPYING/README-style text noise, while `rust-lang/rust` adds a mixed Cargo-plus-bootstrap native layout. Watch generated/build artifacts, vendored/bootstrap directories, and common-profile deltas that are really tree-shape issues rather than parser regressions.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
 | 0e       | Cross-cutting licensing-edge-case repository scans (non-parser reference)      | 🟢 Verified | `nmap/nmap` (500–2k files)<br>`ffmpeg/ffmpeg` (10,200 files)<br>`mongodb/mongo` (11k files)                                                                                      | Use this lane when the main goal is license-classification accuracy rather than parser breadth. `nmap/nmap @ d9199d7` is now a verified reference: the important NPSL/source-available regressions were fixed, Provenant’s NPSL rebucketing on Nmap-owned reference-notice files was kept as the more correct result, and the remaining ScanCode-only license tail was reviewed down to weak translated-manpage GPL bare-word duplicates plus low-signal placeholder detections rather than stronger ScanCode-better findings. `mongodb/mongo @ d6877a33` is now also a verified reference: the package-value tail is reviewed down to richer namespace/PURL identity on Provenant’s side, and the remaining license tail stays in lockfile/SBOM rebucketing rather than SSPL/source-available misses. `ffmpeg/ffmpeg @ 056562a5` is now also a verified reference: the file-level Autotools `configure` identity was aligned with ScanCode, Provenant kept one extra top-level Autotools package as the better result, and the remaining license tail was reviewed down to weak `configure` variable-name / bare-word GPL hits rather than stronger ScanCode-better evidence. |
-| 1        | npm / yarn / pnpm (+ Bun)                                                      | 🟢 Verified | `npm/cli` (500–2k files)<br>`yarnpkg/berry` (500–2k files)<br>`vercel/next.js` (5k files)<br>`oven-sh/bun` (500–2k files)<br>`microsoft/vscode` (3k files)                       | Highest-value JS family. `npm/cli` is the npm-first manifest/lock/workspace reference, `yarnpkg/berry` covers modern Yarn metadata, `oven-sh/bun` covers Bun lockfile variants, and `vercel/next.js` plus `microsoft/vscode` add large TypeScript monorepo realism. Watch package-manager-specific lockfile and workspace-assembly mismatches before blaming generic README, vendored, or generated JS noise.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
+| 1        | npm / yarn / pnpm (+ Bun)                                                      | 🟢 Verified | `npm/cli` (500–2k files)<br>`yarnpkg/berry` (500–2k files)<br>`vercel/next.js` (5k files)<br>`oven-sh/bun` (500–2k files)<br>`microsoft/vscode` (3k files)                       | Highest-value JS family. `npm/cli` is the npm-first manifest/lock/workspace reference, `yarnpkg/berry` covers modern Yarn metadata, `oven-sh/bun` covers Bun lockfile variants, and `vercel/next.js` plus `microsoft/vscode` add large TypeScript monorepo realism. `oven-sh/bun @ 700fc117` is now a verified Bun reference: legacy seven-field `bun.lockb` parsing was aligned with Bun's own loader expectations, Provenant kept much broader Bun/npm-family package extraction, and the remaining reviewed tail is down to one over-specific `BSD-2-Clause-Views` vs plain `BSD-2-Clause` rebucketing plus compare-normalization artifacts on otherwise matching raw `package.json` dependencies. Watch package-manager-specific lockfile and workspace-assembly mismatches before blaming generic README, vendored, or generated JS noise.                                                                                                                                                                                                                                                                                                                                |
 | 2        | Python / PyPI                                                                  | ⚪ Planned  | `pandas-dev/pandas` (1.2k files)<br>`scipy/scipy` (1.3k files)<br>`django/django` (2.5k files)<br>`python-poetry/poetry` (500–2k files)<br>`astral-sh/uv` (500–2k files)         | Broad Python family with both classic and modern metadata. `pandas-dev/pandas`, `scipy/scipy`, and `django/django` add realistic mixed source/doc/test trees, while `python-poetry/poetry` and `astral-sh/uv` cover Poetry- and uv-era lockfile/group behavior. Watch interactions between `pyproject.toml`, legacy setup metadata, extras/groups, and large doc/test subtrees that can dominate common-profile deltas.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
 | 3        | Maven / Java                                                                   | ⚪ Planned  | `apache/maven` (500–2k files)<br>`apache/camel` (2k–10k files)<br>`spring-projects/spring-boot` (2k–10k files)<br>`apache/felix-dev` (2k–10k files)                              | High-value JVM lane. `apache/maven` is the clearest parent/module inheritance reference, `apache/camel` and `spring-projects/spring-boot` stress large nested multi-module builds, and `apache/felix-dev` adds OSGi plus `MANIFEST.MF` bundle metadata. Watch inherited metadata, nested-module aggregation, and bundle-manifest extraction rather than treating every Java delta as leaf-`pom.xml` parsing failure.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
 | 4        | Go                                                                             | ⚪ Planned  | `containerd/containerd` (2k–10k files)<br>`go-gitea/gitea` (2k–10k files)<br>Go build-info sample binaries via local `--target-path` lane                                        | Use both source and binary lanes here. `containerd/containerd` and `go-gitea/gitea` cover large real-world module graphs, while the local binary lane is the only way to verify embedded Go build info that repo scans cannot see. Watch nested modules, vendored trees, and source-versus-binary coverage gaps explicitly during compare review.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |

diff --git a/src/parsers/bun_lockb.rs b/src/parsers/bun_lockb.rs
@@ -14,6 +14,8 @@ pub struct BunLockbParser;
 
 const HEADER_BYTES: &[u8] = b"#!/usr/bin/env bun\nbun-lockfile-format-v0\n";
 const SUPPORTED_FORMAT_VERSION: u32 = 2;
+const FIELD_COUNT_WITHOUT_SCRIPTS: usize = 7;
+const FIELD_COUNT_WITH_SCRIPTS: usize = 8;
 const PACKAGE_FIELD_LENGTHS: [usize; 8] = [8, 8, 64, 8, 8, 88, 20, 48];
 const DEPENDENCY_ENTRY_SIZE: usize = 26;
 
@@ -125,10 +127,10 @@ pub(crate) fn parse_bun_lockb(bytes: &[u8]) -> Result<PackageData, String> {
     }
 
     let field_count = cursor.read_u64()? as usize;
-    if field_count != PACKAGE_FIELD_LENGTHS.len() {
+    if field_count != FIELD_COUNT_WITHOUT_SCRIPTS && field_count != FIELD_COUNT_WITH_SCRIPTS {
         return Err(format!(
-            "Unexpected bun.lockb package field count {}",
-            field_count
+            "Unexpected bun.lockb package field count {} (supported: {} or {})",
+            field_count, FIELD_COUNT_WITHOUT_SCRIPTS, FIELD_COUNT_WITH_SCRIPTS
         ));
     }
 
@@ -141,7 +143,7 @@ pub(crate) fn parse_bun_lockb(bytes: &[u8]) -> Result<PackageData, String> {
         return Err("Invalid bun.lockb package section bounds".to_string());
     }
 
-    let mut packages = parse_packages(bytes, list_len, packages_begin, packages_end)?;
+    let mut packages = parse_packages(bytes, list_len, field_count, packages_begin, packages_end)?;
     cursor.pos = packages_end;
     let buffers = parse_buffers(bytes, &mut cursor, total_buffer_size)?;
     materialize_packages(&mut packages, buffers.string_bytes)?;
@@ -152,6 +154,7 @@ pub(crate) fn parse_bun_lockb(bytes: &[u8]) -> Result<PackageData, String> {
 fn parse_packages(
     bytes: &[u8],
     list_len: usize,
+    field_count: usize,
     packages_begin: usize,
     packages_end: usize,
 ) -> Result<Vec<BunLockbPackage>, String> {
@@ -175,7 +178,8 @@ fn parse_packages(
         .get(packages_begin..packages_end)
         .ok_or_else(|| "Invalid bun.lockb package region".to_string())?;
 
-    let expected_size: usize = PACKAGE_FIELD_LENGTHS.iter().sum::<usize>() * list_len;
+    let expected_size: usize =
+        PACKAGE_FIELD_LENGTHS[..field_count].iter().sum::<usize>() * list_len;
     if package_region.len() < expected_size {
         return Err("bun.lockb package region is truncated".to_string());
     }
@@ -213,7 +217,14 @@ fn parse_packages(
         field_offset += 88;
     }
 
-    let _ = field_offset + 20 * list_len + 48 * list_len;
+    field_offset += 20 * list_len;
+    if field_count == FIELD_COUNT_WITH_SCRIPTS {
+        field_offset += 48 * list_len;
+    }
+
+    if field_offset != expected_size {
+        return Err("bun.lockb package region layout is malformed".to_string());
+    }
 
     Ok(packages)
 }

diff --git a/src/parsers/bun_lockb_golden_test.rs b/src/parsers/bun_lockb_golden_test.rs
@@ -1,10 +1,23 @@
 #[cfg(all(test, feature = "golden-tests"))]
 mod golden_tests {
+    use base64::Engine;
     use std::path::PathBuf;
 
     use crate::parsers::PackageParser;
     use crate::parsers::bun_lockb::BunLockbParser;
     use crate::parsers::golden_test_utils::compare_package_data_parser_only;
+    use tempfile::TempDir;
+
+    fn decode_legacy_no_scripts_fixture() -> Vec<u8> {
+        let fixture = PathBuf::from("testdata/bun/legacy/bun.lockb.v2-no-scripts.base64");
+        base64::engine::general_purpose::STANDARD
+            .decode(
+                std::fs::read_to_string(&fixture)
+                    .expect("fixture should be readable")
+                    .trim(),
+            )
+            .expect("fixture should decode")
+    }
 
     #[test]
     fn test_golden_bun_lockb_v2() {
@@ -18,4 +31,21 @@ mod golden_tests {
             Err(e) => panic!("Golden test failed: {}", e),
         }
     }
+
+    #[test]
+    fn test_golden_bun_lockb_v2_without_scripts_field() {
+        let temp_dir = TempDir::new().expect("Failed to create temp dir");
+        let test_file = temp_dir.path().join("bun.lockb");
+        std::fs::write(&test_file, decode_legacy_no_scripts_fixture())
+            .expect("Failed to write decoded bun.lockb fixture");
+        let expected_file =
+            PathBuf::from("testdata/bun/golden/bun-lockb-v2-no-scripts-expected.json");
+
+        let package_data = BunLockbParser::extract_first_package(&test_file);
+
+        match compare_package_data_parser_only(&package_data, &expected_file) {
+            Ok(_) => (),
+            Err(e) => panic!("Golden test failed: {}", e),
+        }
+    }
 }
diff --git a/src/parsers/bun_lockb_test.rs b/src/parsers/bun_lockb_test.rs
@@ -1,5 +1,7 @@
 #[cfg(test)]
 mod tests {
+    use base64::Engine;
+
     use crate::models::{DatasourceId, PackageType};
     use crate::parsers::bun_lockb::parse_bun_lockb;
     use crate::parsers::{BunLockbParser, PackageParser};
@@ -20,6 +22,17 @@ mod tests {
         (temp_dir, lock_path)
     }
 
+    fn bun_lockb_v2_without_scripts_fixture() -> Vec<u8> {
+        let fixture = PathBuf::from("testdata/bun/legacy/bun.lockb.v2-no-scripts.base64");
+        base64::engine::general_purpose::STANDARD
+            .decode(
+                fs::read_to_string(&fixture)
+                    .expect("fixture should be readable")
+                    .trim(),
+            )
+            .expect("fixture should decode")
+    }
+
     #[test]
     fn test_is_match_bun_lockb_without_sibling_text_lock() {
         assert!(BunLockbParser::is_match(&PathBuf::from(
@@ -142,6 +155,26 @@ mod tests {
         assert_eq!(workspace_dep.is_direct, Some(true));
     }
 
+    #[test]
+    fn test_parse_bun_lockb_v2_without_scripts_field() {
+        let bytes = bun_lockb_v2_without_scripts_fixture();
+        parse_bun_lockb(&bytes).expect("seven-field bun.lockb should parse");
+
+        let (_temp_dir, lock_path) = create_temp_bun_lockb(&bytes);
+        let package_data = BunLockbParser::extract_first_package(&lock_path);
+
+        assert_eq!(package_data.package_type, Some(PackageType::Npm));
+        assert_eq!(package_data.datasource_id, Some(DatasourceId::BunLockb));
+        assert_eq!(package_data.name.as_deref(), Some("bundle"));
+        assert!(package_data.version.is_none());
+        assert!(
+            package_data
+                .dependencies
+                .iter()
+                .any(|dep| dep.purl.as_deref() == Some("pkg:npm/bun-types@0.5.8"))
+        );
+    }
+
     #[test]
     fn test_invalid_bun_lockb_header_returns_default_package() {
         let (_temp_dir, lock_path) = create_temp_bun_lockb(b"not-a-bun-lockb");