-
-
Notifications
You must be signed in to change notification settings - Fork 68
clearcode/store_scans.py fails to bootstrap shard repositories reliably #847
Description
Summary :
I found an issue in the scan storage flow in clearcode/store_scans.py.
This module appears to be responsible for taking ClearlyDefined scan data, deriving a purl-based hash, mapping that hash to a GitHub repository shard, and then storing scans in that repository. Because of that, the repo creation and clone logic here is part of an important project workflow, not just a helper utility.
The current implementation in get_or_init_repo() seems to break in common cases where a repository already exists remotely or needs to be created and cloned for the first time.
Problem
Current logic:
if repo_name not in get_github_repos(user_name=user_name):
repo_url = create_github_repo(repo_name=repo_name)
repo_path = work_dir / repo_name
if repo_path.exists():
repo = Repo(repo_path)
if pull:
repo.origin.pull()
else:
repo = Repo.clone_from(repo_url, repo_path)
This has two clear failure cases:
If the remote repository already exists, but the local directory does not:
repo_url is never assigned
Repo.clone_from(repo_url, repo_path) will fail because repo_url is undefined
If the remote repository does not exist:
create_github_repo() creates the repo through the GitHub API, but does not return a clone URL
repo_url becomes None
Repo.clone_from(repo_url, repo_path) will still fail
So the bootstrap flow is unreliable in both scenarios:
cloning an existing shard repository
creating and then cloning a new shard repository
Additional issue
get_github_repos() yields full_name values such as org/repo, but the membership check compares those values with repo_name only.
That means this check may be incorrect:
if repo_name not in get_github_repos(user_name=user_name):
If repo_name is only something like abc, and get_github_repos() yields values like my-org/abc, then the repo existence check will fail even when the repo already exists.
Why this matters
This issue affects a core storage workflow in the project:
ClearlyDefined scan -> purl -> purl hash -> shard repo -> clone/init -> commit/push
Since this file is using purl-derived hashes to distribute scan data across many Git repositories, a bug here can block or break the scan archival process.
This is why I think this is an important issue to fix.
Expected behavior
get_or_init_repo() should:
correctly detect whether the remote shard repo already exists
create it if it does not exist
always have a valid clone URL before trying to clone
clone when the local checkout is missing
pull only when the local checkout already exists and pull=True
Suggested fix
A good fix would likely include:
making create_github_repo() return the created repo clone URL
making get_github_repos() return repo names in a format consistent with the membership check
ensuring repo_url is always defined before Repo.clone_from(...)
clarifying whether repo_namespace and user_name should support org repositories differently
References
GitHub REST API documentation:
https://docs.github.com/en/rest/repos/repos#create-a-repository-for-the-authenticated-user
GitHub REST API list repositories:
https://docs.github.com/en/rest/repos/repos#list-repositories-for-the-authenticated-user
Package URL specification:
https://github.com/package-url/purl-spec