Skip to content

clearcode/store_scans.py fails to bootstrap shard repositories reliably #847

@Shashidar123

Description

@Shashidar123

Summary :
I found an issue in the scan storage flow in clearcode/store_scans.py.

This module appears to be responsible for taking ClearlyDefined scan data, deriving a purl-based hash, mapping that hash to a GitHub repository shard, and then storing scans in that repository. Because of that, the repo creation and clone logic here is part of an important project workflow, not just a helper utility.

The current implementation in get_or_init_repo() seems to break in common cases where a repository already exists remotely or needs to be created and cloned for the first time.

Problem
Current logic:

if repo_name not in get_github_repos(user_name=user_name):
repo_url = create_github_repo(repo_name=repo_name)

repo_path = work_dir / repo_name
if repo_path.exists():
repo = Repo(repo_path)
if pull:
repo.origin.pull()
else:
repo = Repo.clone_from(repo_url, repo_path)
This has two clear failure cases:

If the remote repository already exists, but the local directory does not:
repo_url is never assigned
Repo.clone_from(repo_url, repo_path) will fail because repo_url is undefined
If the remote repository does not exist:
create_github_repo() creates the repo through the GitHub API, but does not return a clone URL
repo_url becomes None
Repo.clone_from(repo_url, repo_path) will still fail
So the bootstrap flow is unreliable in both scenarios:

cloning an existing shard repository
creating and then cloning a new shard repository
Additional issue
get_github_repos() yields full_name values such as org/repo, but the membership check compares those values with repo_name only.

That means this check may be incorrect:

if repo_name not in get_github_repos(user_name=user_name):
If repo_name is only something like abc, and get_github_repos() yields values like my-org/abc, then the repo existence check will fail even when the repo already exists.

Why this matters
This issue affects a core storage workflow in the project:

ClearlyDefined scan -> purl -> purl hash -> shard repo -> clone/init -> commit/push

Since this file is using purl-derived hashes to distribute scan data across many Git repositories, a bug here can block or break the scan archival process.

This is why I think this is an important issue to fix.

Expected behavior
get_or_init_repo() should:

correctly detect whether the remote shard repo already exists
create it if it does not exist
always have a valid clone URL before trying to clone
clone when the local checkout is missing
pull only when the local checkout already exists and pull=True
Suggested fix
A good fix would likely include:

making create_github_repo() return the created repo clone URL
making get_github_repos() return repo names in a format consistent with the membership check
ensuring repo_url is always defined before Repo.clone_from(...)
clarifying whether repo_namespace and user_name should support org repositories differently
References
GitHub REST API documentation:
https://docs.github.com/en/rest/repos/repos#create-a-repository-for-the-authenticated-user

GitHub REST API list repositories:
https://docs.github.com/en/rest/repos/repos#list-repositories-for-the-authenticated-user

Package URL specification:
https://github.com/package-url/purl-spec

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions