Skip to content

CODEC-335: Add DigestUtils.gitBlob and DigestUtils.gitTree methods#427

Merged
ppkarwasz merged 3 commits intomasterfrom
feat/git-blob
Mar 29, 2026
Merged

CODEC-335: Add DigestUtils.gitBlob and DigestUtils.gitTree methods#427
ppkarwasz merged 3 commits intomasterfrom
feat/git-blob

Conversation

@ppkarwasz
Copy link
Copy Markdown
Contributor

This change adds two methods to DigestUtils that compute generalized Git object identifiers using an arbitrary MessageDigest, rather than being restricted to SHA-1:

Motivation

The standard Git object identifiers use SHA-1, which is in the process of being replaced by SHA-256 in Git itself. These methods generalize the identifier computation to support any MessageDigest, enabling both forward compatibility and use with external standards.

In particular, the swh:1:cnt: (content) and swh:1:dir: (directory) identifier types defined by SWHID (ISO/IEC 18670) are currently compatible with Git blob and tree identifiers respectively (using SHA-1), and can be used to generate canonical, persistent identifiers for unpacked source and binary distributions.

Before you push a pull request, review this list:

  • Read the contribution guidelines for this project.
  • Read the ASF Generative Tooling Guidance if you use Artificial Intelligence (AI).
  • I used AI to create any part of, or all of, this pull request. Which AI tool was used to create this pull request, and to what extent did it contribute? Claude Code was used for tests and to review the main code
  • Run a successful build using the default Maven goal with mvn; that's mvn on the command line by itself.
  • Write unit tests that match behavioral changes, where the tests fail if the changes to the runtime are not applied. This may not always be possible, but it is a best practice.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Each commit in the pull request should have a meaningful subject line and body. Note that a maintainer may squash commits during the merge process.

This change adds two methods to `DigestUtils` that compute generalized Git object
identifiers using an arbitrary `MessageDigest`, rather than being restricted to SHA-1:

- `gitBlob(digest, input)`: computes a generalized
  [Git blob object identifier](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) for a given file or byte content.
- `gitTree(digest, file)`: computes a generalized
  [Git tree object identifier](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects) for a given directory.

### Motivation

The standard Git object identifiers use SHA-1, which is
[in the process of being replaced by SHA-256](https://git-scm.com/docs/hash-function-transition) in Git itself.
These methods generalize the identifier computation to support any `MessageDigest`,
enabling both forward compatibility and use with external standards.

In particular, the `swh:1:cnt:` (content) and `swh:1:dir:` (directory) identifier
types defined by [SWHID (ISO/IEC 18670)](https://www.swhid.org/specification/v1.2/5.Core_identifiers/) are currently compatible with
Git blob and tree identifiers respectively (using SHA-1), and can be used to generate
canonical, persistent identifiers for unpacked source and binary distributions.
Copy link
Copy Markdown
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ppkarwasz

Should all this git related code be in a new GitDigest class instead?

Curious: isn't all this in jgit?

You'll need to run 'mvn' by itself and fix build issues before you push.

@ppkarwasz
Copy link
Copy Markdown
Contributor Author

Should all this git related code be in a new GitDigest class instead?

gitBlob fits naturally in DigestUtils: it follows the same pattern as the existing digest methods (wrap content with a header, hash it).

gitTree is more specialised: it only accepts a directory and recursively computes hashes for the entire tree, which is arguably outside DigestUtils's scope. I'm open to moving both to a new GitDigest class.

Curious: isn't all this in jgit?

JGit does provide the building blocks via ObjectInserter, but it has two significant limitations that make it unsuitable here:

  1. ObjectInserter is hardcoded to SHA-1.
  2. The caller is responsible for walking the directory, sorting entries, and building each sub-tree formatter correctly. The proposed gitTree method handles all of that automatically.

For reference, here is the equivalent JGit code for a two-file tree:

final byte[] aBytes = ...; // a.txt
final byte[] bBytes = ...; // nested/b.txt
try (ObjectInserter inserter = new ObjectInserter.Formatter()) {
    final ObjectId aBlob = inserter.idFor(OBJ_BLOB, aBytes);
    final ObjectId bBlob = inserter.idFor(OBJ_BLOB, bBytes);
    final TreeFormatter nestedTreeFormatter = new TreeFormatter();
    nestedTreeFormatter.append("b.txt", FileMode.REGULAR_FILE, bBlob);
    final ObjectId nestedTree = inserter.idFor(nestedTreeFormatter);
    final TreeFormatter rootTreeFormatter = new TreeFormatter();
    rootTreeFormatter.append("a.txt", FileMode.REGULAR_FILE, aBlob);
    rootTreeFormatter.append("nested", FileMode.TREE, nestedTree);
    return inserter.idFor(rootTreeFormatter).name();
}

@ppkarwasz
Copy link
Copy Markdown
Contributor Author

@garydgregory,

What would you say about an API like the one below? It would have the advantage of being reusable in other contexts. For example Commons Compress could use it to compute a SWHID without extracting an archive.

public final class GitId {

    public enum FileMode {

        /** Regular, non-executable file ({@code 100644}). */
        REGULAR_FILE("100644"),

        /** Executable file ({@code 100755}). */
        EXECUTABLE_FILE("100755"),

        /** Symbolic link ({@code 120000}). */
        SYMBOLIC_LINK("120000"),

        /** Directory / subtree ({@code 40000}). */
        DIRECTORY("40000");

    }

    public static byte[] blobId(MessageDigest digest, byte[] content);

    public static byte[] blobId(MessageDigest digest, InputStream input) throws IOException;

    public static byte[] blobId(MessageDigest digest, Path path) throws IOException;

    public static TreeBuilder treeBuilder(MessageDigest digest);


    public static final class TreeBuilder {

        public TreeBuilder addFile(String name, FileMode mode, byte[] content);

        public TreeBuilder addFile(String name, FileMode mode, InputStream input) throws IOException;

        public TreeBuilder addFile(String name, FileMode mode, Path path) throws IOException;

        public TreeBuilder addDirectory(String name, TreeBuilder subtree);

        public byte[] build();
    }
}

@garydgregory
Copy link
Copy Markdown
Member

@garydgregory,

What would you say about an API like the one below? It would have the advantage of being reusable in other contexts. For example Commons Compress could use it to compute a SWHID without extracting an archive.

public final class GitId {

    public enum FileMode {

        /** Regular, non-executable file ({@code 100644}). */
        REGULAR_FILE("100644"),

        /** Executable file ({@code 100755}). */
        EXECUTABLE_FILE("100755"),

        /** Symbolic link ({@code 120000}). */
        SYMBOLIC_LINK("120000"),

        /** Directory / subtree ({@code 40000}). */
        DIRECTORY("40000");

    }

    public static byte[] blobId(MessageDigest digest, byte[] content);

    public static byte[] blobId(MessageDigest digest, InputStream input) throws IOException;

    public static byte[] blobId(MessageDigest digest, Path path) throws IOException;

    public static TreeBuilder treeBuilder(MessageDigest digest);


    public static final class TreeBuilder {

        public TreeBuilder addFile(String name, FileMode mode, byte[] content);

        public TreeBuilder addFile(String name, FileMode mode, InputStream input) throws IOException;

        public TreeBuilder addFile(String name, FileMode mode, Path path) throws IOException;

        public TreeBuilder addDirectory(String name, TreeBuilder subtree);

        public byte[] build();
    }
}

Hi @ppkarwasz

I'm not sure what Commons component the above should belong. I think you mean it to belong in Codec but I can't tell what's supposed to be an interface vs. implementation. Would this PR be reimplemented in terms of the above? Or would this PR provide the implementation for the above?

The name TreeBuilder is confusing to me without Javadoc. It's not building a tree, it's building a byte array. Do you mean it processes a directory tree? I can't tell.

In the PR description, you write:

In particular, the swh:1:cnt: (content) and swh:1:dir: (directory) identifier types defined by SWHID (ISO/IEC 18670) are currently compatible with Git blob and tree identifiers respectively (using SHA-1), and can be used to generate canonical, persistent identifiers for unpacked source and binary distributions.

Since Git has been migrating to SHA-256, does this still matter? You only mention SHA-1 in the above.

From API design, the API inflation is already present with byte[], InputStream, Path, and hints that File, Channel, Buffer, and URI should also be available, which is the problem Commons IOs builder package attempts to solve.

Aside from that, the current PR seems focused on narrow functionality without introducing framework code, so it fits in nicely. Let me review it again in the morning.

@ppkarwasz
Copy link
Copy Markdown
Contributor Author

ppkarwasz commented Mar 27, 2026

I'm not sure what Commons component the above should belong. I think you mean it to belong in Codec but I can't tell what's supposed to be an interface vs. implementation. Would this PR be reimplemented in terms of the above? Or would this PR provide the implementation for the above?

The name TreeBuilder is confusing to me without Javadoc. It's not building a tree, it's building a byte array. Do you mean it processes a directory tree? I can't tell.

I am not sure which component this belongs either.

To add more context: I am trying to create SLSA Provenance attestations for Java builds. For such attestations to have some value, they need to record some invariants of the build toolchain. When you build on your local machine, the most important build data is what you usually add to the vote e-mail: the Maven and JDK version.

Maven and JDK are already unpacked on your build machine, so it's not possible to get a classical hash of their distribution, but it is possible to make a “gitTree” hash, which is also among the digests allowed in SLSA.

That's why I am looking to introduce some support for gitBlob and gitTree in Commons Codec. It is probably the best choice, because three main libraries provide digest helper in the Java ecosystem: plexus-digest (tiny and rarely updated), Commons Codec and Guava.

I am trying to introduce support for “gitTree” in two steps:

Step 1

Initially I would need to just compute gitTree on a file system. This PR tries to introduce that with the minimal API changes.

Step 2

Once we compute the gitTree SHA-1 or SHA-256 hash of an unpacked Maven distribution, we would probably like to compare it with the packed Maven tarball. This is where we should offer users a more extensive API to compute the “gitTree” of a virtual tree of files (like a TAR archive).

Devising the best API is complex, so I would leave it for now, but I would take it into consideration to decide, where to put gitBlob and gitTree, so we don't need to deprecate methods later.

TL;DR What would you say about refactoring this PR to create some helper methods in a new GitIdentifiers class?

public final class GitIdentifiers {

    public static byte[] blobId(MessageDigest digest, byte[] content);

    public static byte[] blobId(MessageDigest digest, InputStream input) throws IOException;

    public static byte[] blobId(MessageDigest digest, Path path) throws IOException;

    public static byte[] treeId(MessageDigest digest, Path path) throws IOException;
}

Later on, we could extend that class to allow computing a treeId for other types of tree data.

@garydgregory
Copy link
Copy Markdown
Member

Hi @ppkarwasz

Maven and JDK are already unpacked on your build machine, so it's not possible to get a classical hash of their distribution, but it is possible to make a “gitTree” hash, which is also among the digests allowed in SLSA.

Do you plan on computing a hash of a Maven install folder?

Does that mean anything without accounting for files in a user's home .m2 folder like settings-security.xml, settings.xml, and toolchains.xml? What about the local .m2/repository/ cache? Anything can be in there in the sense that I can override existing JARs with local builds or manual installs.

Will the projects attestation say a project was built with a list of plugins, those plugin hashes and all the hashes of their plugins and non-plugin dependencies?

I'm trying to grasp what it is we are capturing and the value.

@garydgregory
Copy link
Copy Markdown
Member

Hi @ppkarwasz
I think I'm OK with this PR in its low-level building block form.

The DigestUtils class is starting to feel broken in the sense that we are always adding method for an InputStream, a Path, a byte[], and sometimes a File; and all of that feels like it needs a redo using common input processing like Commons IO's builder package. Maybe in the future, we only add InputStream methods, I'm not 100% sure.

@ppkarwasz
Copy link
Copy Markdown
Contributor Author

Do you plan on computing a hash of a Maven install folder?

Yes, the installation folder is usually unmodified, so we can expect the same value for each version.

Does that mean anything without accounting for files in a user's home .m2 folder like settings-security.xml, settings.xml, and toolchains.xml? What about the local .m2/repository/ cache? Anything can be in there in the sense that I can override existing JARs with local builds or manual installs.

I have a proof-of-concept plugin that lists all dependencies in the attestation, but that might be overkill, since we already have the same information in the SBOM and we validate the SBOM by doing a reproducibility check.

Will the projects attestation say a project was built with a list of plugins, those plugin hashes and all the hashes of their plugins and non-plugin dependencies?

Good point! I have seen that the CycloneDX Gradle plugin puts all the components that appear during the build in the SBOM. The CycloneDX Maven plugin does not do it. So I am confused, whether the hashes for plugins should be in the SBOM or attestation.

@ppkarwasz
Copy link
Copy Markdown
Contributor Author

Since you are more or less OK with this, I'll merge this PR and generate a snapshot. This way I'll be able to push a proof-of-concept plugin that compiles.

If any fixes are necessary, I'll add them to follow-up PRs.

@ppkarwasz ppkarwasz merged commit 3019feb into master Mar 29, 2026
15 checks passed
@ppkarwasz ppkarwasz deleted the feat/git-blob branch March 29, 2026 18:16
@garydgregory
Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants