Skip to content

Add forge instance entity#800

Merged
gabriel-samfira merged 12 commits into
cloudbase:mainfrom
gabriel-samfira:add-forge-instance-entity
Jul 1, 2026
Merged

Add forge instance entity#800
gabriel-samfira merged 12 commits into
cloudbase:mainfrom
gabriel-samfira:add-forge-instance-entity

Conversation

@gabriel-samfira

Copy link
Copy Markdown
Member

This change adds a new forge entity type along side Repository{}, Organization{} and Enterprise{} called ForgeInstance{}. This new entity type is available currently only on Gitea and possibly on Forgejo if we ever get around to adding it.

The new ForgeInstance{} entity requires gitea credentials that have rw access to the admin api. This feels a bit too much to allow GARM to have, but if instance level pools are required, it cannot be helped.

The API is similar to the other entity types. WebUI has been updated to accomodate the new ForgeInstance type, as has the CLI.

This was referenced Jun 30, 2026
Add webhook management (create, list, get, delete) and runner operations
(list, remove, registration token) at the Gitea instance level. These
use the Gitea admin API endpoints (/admin/hooks, /admin/actions/runners).

Also add SupportsInstancePools() method on EndpointType and the
MetricsLabelInstanceScope constant for metrics recording.
Add the ForgeInstance and ForgeInstanceEvent models with UUID primary key
and a unique index on endpoint_name. Add full CRUD operations modeled
after enterprises, with endpoint name cross-validation against credentials.

Add database migration 0003_forge_instances to create the new tables and
add forge_instance_id columns to pools and workflow_jobs.

Wire ForgeInstance into all entity type switches in pools, jobs, util
(hasGithubEntity, GetForgeEntity, AddEntityEvent, SetEntityPoolManagerStatus,
updateEntityCredentials).
Add ForgeInstance params type with GetEntity(), CreateForgeInstanceParams
with forge type validation, and ForgeInstancePoolManager interface.

Implement pool manager lifecycle (create, start, stop, retry) with
WebSocket event subscriptions and cache worker integration.

Add runner CRUD: create, list, get, update, delete forge instances,
plus pool and instance management. Wire webhook dispatch for system
hooks and add ForgeInstance to all entity type switches in pool manager,
watcher filters, metadata service name, and job association.

Remove unused FetchTools method from basePoolManager.
Add REST API endpoints for forge instance CRUD, pool management,
instance listing, and webhook install/uninstall/info under
/forge-instances/{forgeInstanceID}/...

Add garm-cli forge-instance command (add, list, show, update, delete)
with endpoint name or UUID resolution, --random-webhook-secret, and
--install-webhook support.

Add --forge-instance/-f flag to pool list and pool add commands.

Regenerate swagger spec and OpenAPI client.
Add Forge Instances section to sidebar navigation with list and detail
pages. The list page shows endpoint name with forge type icon,
credentials, pool balancing type, status, and CRUD actions. The detail
page shows entity information, pools, instances, and events with
real-time WebSocket updates.

Add CreateForgeInstanceModal with endpoint selector (Gitea only),
credentials selector, pool balancer type, agent mode, and webhook
secret auto-generation.

Update all entity type unions across components to include forge_instance.
Wire ForgeInstance into eager cache, WebSocket subscriptions, pool entity
resolution (getEntityName, getEntityType, getEntityUrl), CreatePoolModal
entity level selector, UpdatePoolModal agent mode lookup, and
EndpointCell clickable links.
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Preload the entity's Endpoint relation in listEntityPools so that
pools listed via entity-specific endpoints include endpoint info.

Update the hardcoded SQL query in TestListAllPoolsDBFetchErr to include
the new forge_instance_id column.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
@gabriel-samfira gabriel-samfira force-pushed the add-forge-instance-entity branch from 3a98410 to c87fc58 Compare June 30, 2026 09:19
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
@Infinoid

Copy link
Copy Markdown
Contributor

I've set up a test instance to try this out, but I'm unsure whether I set it up properly. Here are the steps I followed:

  • build docker image for gabriel-samfira/add-forge-instance-entity (commit eee6176), and push it somewhere my k8s cluster can see it
  • build test copy of cli tool
  • deploy docker image into test environment (via a temporary terraform workspace)
  • configure garm-cli to talk to test manager (add test [[manager]] to config.toml and set it as default)
  • set up garm-test service account in gitea
  • set up personal access token for garm-test service account, with admin:readwrite scope
  • update gitea config to add test instance's hostname to webhook.ALLOWED_HOST_LIST
  • ./garm-cli gitea endpoint create --name=gitea-test …
  • ./garm-cli gitea credentials add --name=token …
  • ./garm-cli forge-instance add --forge-type=gitea --credentials=token --endpoint=gitea-test --install-webhook --random-webhook-secret
  • ./garm-cli pool add --forge-instance=gitea-test --tags=garm-test …
  • add a descriptive name to the webhook in gitea (it was empty)
  • try to trigger a job that runs-on: garm-test

Does this look right? I'm seeing the test environment receive webhook notifications from gitea, but rejecting them. Logs:

time=2026-06-30T13:11:35.235Z level=ERROR msg="failed to find pool manager" error="error fetching repo: error fetching repo: not found" hook_target_type=repository
time=2026-06-30T13:11:35.235Z level=ERROR msg="got not found error from DispatchWorkflowJob. webhook not meant for us?" error="error fetching poolManager: error fetching repo: error fetching repo: not found"
time=2026-06-30T13:11:35.235Z level=INFO msg=access_log method=POST uri=/webhooks/404760f1-a6d2-47d3-9248-aea6b0da4756 user_agent=Go-http-client/1.1 ip=[2600:1f18:7bbe:ba49:87aa::9]:38818 code=200 bytes=0 request_time=2.114832ms
time=2026-06-30T13:11:36.763Z level=ERROR msg="failed to find pool manager" error="error fetching repo: error fetching repo: not found" hook_target_type=repository
time=2026-06-30T13:11:36.763Z level=ERROR msg="got not found error from DispatchWorkflowJob. webhook not meant for us?" error="error fetching poolManager: error fetching repo: error fetching repo: not found"
time=2026-06-30T13:11:36.763Z level=INFO msg=access_log method=POST uri=/webhooks/404760f1-a6d2-47d3-9248-aea6b0da4756 user_agent=Go-http-client/1.1 ip=[2600:1f18:7bbe:ba49:87aa::9]:38818 code=200 bytes=0 request_time=711.585µs

@Infinoid

Copy link
Copy Markdown
Contributor

I added a debug message here, to print the values of forgeType and giteaTargetType. I got:

  • hookType=repository
  • giteaTargetType=repository

This is despite the fact that the gitea webhook was created at the instance level. (The edit link in the gitea UI is https://gitea.my-site/-/admin/hooks/20649)

So that doesn't seem like a good way to detect SystemHooks.

@gabriel-samfira

Copy link
Copy Markdown
Member Author

./garm-cli forge-instance add --forge-type=gitea --credentials=token --endpoint=gitea-test --install-webhook --random-webhook-secret

After you ran this, did the webhook show up under Site administration --> integration --> webhooks ?

@Infinoid

Infinoid commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Wait, no, I'm wrong! It didn't create a system webhook, it created a default webhook.

A system webhook triggers for any repo on the system, and operates at the system level.
A default webhook is a template that gets copied into new repos (including the fork I was using for testing), and operates at the repo level.

image

I think that's the problem.

Set is_system_webhook=true in the webhook config when creating
instance-level hooks. Without this, Gitea creates a "default" webhook
that gets copied into new repositories instead of a system webhook
that fires for all events on the instance.
Add garm-cli forge-instance webhook subcommands: install, show,
and uninstall. All commands accept either an endpoint name or UUID
for identification.
@Infinoid

Copy link
Copy Markdown
Contributor

It seems poorly documented, but I think the solution is adding is_system_webhook=true to the Config map when creating the webhook. I'm building a test image which does that now.

@Infinoid

Copy link
Copy Markdown
Contributor

With the is_system_webhook=true config tweak, I deleted the default webhook, the old pool and the old forge instance, and then recreated them again, using the same commands as above. It created the right type of webhook:

image

With that in place, I triggered another job, garm spawned an ec2 worker and and it ran to completion.

@gabriel-samfira

Copy link
Copy Markdown
Member Author

Yup. There were several issues that needed fixing. Many places throughout the code where changes are needed to add this.

One more thing, GARM tries to determine the endpoint from which a webhook originates. To do this, it looks at the job payload and compares the HTML URL in the payload with the base URL in the endpoint you defined in GARM. Make sure you use the same base URL as you receive via job. You should be able to see a payload in gitea if you click on the webhook that gets added after a hook is sent. If your endpoint base URL does not match what you have in the HTML URL, define a new endpoint in GARM.

When removing a Forge instance, the webhook is automatically cleaned up
just like for repos and orgs.

Fixed places where I forgot to set ForgeInstanceID in various structs.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
@gabriel-samfira gabriel-samfira force-pushed the add-forge-instance-entity branch from f1133c8 to 9c9f0da Compare June 30, 2026 16:12
@gabriel-samfira

Copy link
Copy Markdown
Member Author

Grab the latest commit from this branch. There was also an issue with recording jobs and another with consuming them by the forge instance pool manager. Should work now. I need to also add tests.

@Infinoid

Copy link
Copy Markdown
Contributor

Ok, trying 9c9f0da.

@Infinoid

Copy link
Copy Markdown
Contributor

I think the solution is adding is_system_webhook=true to the Config map

Damn, I felt so clever with this, but it looks like you got there 6 minutes earlier. 😛

@gabriel-samfira

gabriel-samfira commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

I think the solution is adding is_system_webhook=true to the Config map

Damn, I felt so clever with this, but it looks like you got there 6 minutes earlier. 😛

Still clever! Not a race! And you were right. That was the fix. I didn't even notice that there were 2 types of webhooks until you mentioned system webhooks 😀

But after adding that, runners were still not being spun up correctly. The pool manager ignored the job. The watcher wasn't recording the job, so no runnet was being spun up in a zero idle runner pool. There were some places where I forgot to populate the ForgeInstanceID. Hopefully I got them all. Most of this PR is the generated client code for the web UI and golang. But it's still a pretty big mechanical change. So it's easy to miss stuff.

@Infinoid

Copy link
Copy Markdown
Contributor

But after adding that, runners were still not being spun up correctly. The pool manager ignored the job. The watcher wasn't recording the job, so no runner was being spun up in a zero idle runner pool.

I had noticed that in the previous round of testing, actually. It started working when I set --min-idle-workers=1.

With 9c9f0da, it works even when --min-idle-workers=0. Looking good!

(Though it waits 30 seconds before it starts to spawn a runner; is that configurable?)

@gabriel-samfira

Copy link
Copy Markdown
Member Author

(Though it waits 30 seconds before it starts to spawn a runner; is that configurable?)

yes. You can edit that in the WebUI --> Dashboard --> Controller Information --> Settings --> Minimum Job Age Backoff (seconds) (set to 0)

or you can just:

garm-cli controller update --minimum-job-age-backoff 0

@gabriel-samfira

Copy link
Copy Markdown
Member Author

let me know how it works for you. If you see any weirdness. If not, I can just merge this. And increment later.

@Infinoid

Infinoid commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Well, so far so good. But I'll put it into heavier rotation and let you know what we see.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
@gabriel-samfira gabriel-samfira force-pushed the add-forge-instance-entity branch from 18a3507 to 555dc6a Compare July 1, 2026 10:30
@gabriel-samfira

Copy link
Copy Markdown
Member Author

merging this as-is. If you see anything weird, we can fix in a separate PR.

@gabriel-samfira gabriel-samfira merged commit 2d1a7af into cloudbase:main Jul 1, 2026
5 checks passed
@gabriel-samfira gabriel-samfira deleted the add-forge-instance-entity branch July 1, 2026 10:43
@Infinoid

Infinoid commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Sounds good, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants