Skip to content

[Bug][CircleCI] collectWorkflows aborts entire sync when pipeline workflow API returns HTTP 500 #8948

Description

@jbsmith7741

Search before asking

  • I had searched in the issues and found no similar issues.

What happened

The CircleCI collectWorkflows subtask fails when DevLake calls
GET /v2/pipeline/{pipeline_id}/workflow for a pipeline whose workflow
endpoint returns HTTP 500 from CircleCI, even though GET /v2/pipeline/{id}
returns valid metadata (200 OK with a populated body).

The subtask retries three times, then aborts. Because collectWorkflows runs
before collectJobs, no workflows or jobs are collected for the entire
project
on that run — CI/CD and DORA metrics go stale.

Environment

DevLake version v1.0.3-beta12
Plugin circleci
Database MySQL 8.x
CircleCI deployment type CircleCI Server (self-hosted) — not CircleCI Cloud
CircleCI Server version 4.9.4
Trigger Project blueprint data collection (full or incremental)

Note on CircleCI Server vs Cloud: This bug is reproducible on a
self-hosted CircleCI Server instance. It has not been verified on
CircleCI Cloud (circleci.com), but the DevLake code path is identical
for both. CircleCI Server exposes the same /v2/ API surface; the broken
workflow endpoint behaviour described here may be specific to self-hosted
versions where individual pipeline records can become corrupt or stuck.

Error / logs

subtask collectWorkflows ended unexpectedly
caused by: Retry exceeded 3 times calling /v2/pipeline/<pipeline-id>/workflow.
The last error was: Http DoAsync error calling [method:GET path:/v2/pipeline/<pipeline-id>/workflow query:map[]].
Response: {"message":"Internal Server Error"} (500)

Reproduced against CircleCI Server API directly

Pipeline metadata succeeds:

curl -s -H "Circle-Token: $TOKEN" \
  "https://<your-circleci-server>/api/v2/pipeline/<pipeline-id>"
# Returns: {"id":"<pipeline-id>","errors":[],"project_slug":"gh/<org>/<repo>","state":"created", ...}

Workflow list fails with 500:

curl -s -H "Circle-Token: $TOKEN" \
  "https://<your-circleci-server>/api/v2/pipeline/<pipeline-id>/workflow"
# Returns: {"message":"An internal server error occurred."} (HTTP 500)

The same pipeline ID returns 200 on the /pipeline/{id} endpoint but 500 on
/pipeline/{id}/workflow. This is a CircleCI Server-side condition (corrupt
or stuck pipeline record) that DevLake cannot prevent, but must handle
gracefully.

Affected pipeline (example shape)

Field Value
Pipeline ID <uuid> (valid, returned by /project/{slug}/pipeline pagination)
Project gh/<org>/<repo>
State created (stuck — workflows never materialized)
Trigger Webhook / pull request event

Pipelines in created state with no workflows are a known occurrence on
CircleCI Server when a webhook fires but the server fails to create workflow
records internally.

Root cause (DevLake side)

collectWorkflows iterates every row in _tool_circleci_pipelines for the
project on full sync (no SyncPolicy.TimeAfter filter on the DB query). For each
row it calls the workflow API. The plugin only skips 404 responses via
ignoreDeletedBuilds in shared.go500 is not skipped, so one bad
pipeline record kills the subtask.

This is not the same bug as #8907
(empty workflow ID → /v2/workflow//job 500 in collectJobs).

Root cause (CircleCI Server side)

For at least one pipeline, CircleCI Server returns 500 on the workflow endpoint
while pipeline metadata is available — likely a corrupt or stuck pipeline record
on the server (state created since creation, workflows never materialized).
This has been observed on a self-hosted CircleCI Server instance. It is unclear
whether CircleCI Cloud can produce this condition.

What do you expect to happen

  1. When CircleCI returns 404 or 500 for a single pipeline's workflow
    endpoint, DevLake should log and skip that pipeline and continue collecting
    workflows for the rest of the project.
  2. collectWorkflows should respect the blueprint Data Time Range
    (SyncPolicy.TimeAfter) when choosing which _tool_circleci_pipelines rows
    to iterate, so full sync does not call the workflow API for every historical
    pipeline row ever stored in the tool table.
  3. A single bad pipeline on CircleCI Server should not block CI/CD collection
    for an entire project.

How to reproduce

  1. Configure a CircleCI connection pointing at a CircleCI Server instance.
  2. Ensure _tool_circleci_pipelines contains at least one pipeline ID where
    GET /v2/pipeline/{id} returns 200 but
    GET /v2/pipeline/{id}/workflow returns 500.
    • These are typically pipelines in created state with no associated workflows,
      caused by a failed or corrupt webhook trigger on the server.
  3. Run CircleCI data collection for that project (full sync is the most reliable
    trigger because collectWorkflows iterates all DB pipeline rows without a
    time filter).
  4. Observe collectWorkflows fail with retry-exceeded 500; collectJobs and
    downstream converters do not run for the entire project.

To find candidate pipelines on your CircleCI Server instance:

# List project pipelines and look for state=created with no items in /workflow
curl -s -H "Circle-Token: $TOKEN" \
  "https://<your-circleci-server>/api/v2/project/gh/<org>/<repo>/pipeline" \
  | jq '.items[] | select(.state=="created") | .id'

# Then test each candidate:
curl -s -H "Circle-Token: $TOKEN" \
  "https://<your-circleci-server>/api/v2/pipeline/<candidate-id>/workflow"

Anything else

Operator workaround (per pipeline)

Delete the bad pipeline row from the tool table, then re-sync:

DELETE FROM _tool_circleci_pipelines
WHERE id = '<pipeline-id-returning-500>';

This is not durable — new bad records or full-sync iteration over remaining
historical rows can trigger the same failure again.

Related issues (not duplicates)

Issue Relationship
#7797 collectWorkflows 404 after retention; time-range fix on collectPipelines only — closed
#8907 /v2/workflow//job 500 in collectJobs (empty workflow ID) — closed, #8912
#8309 Malformed workflow JSON in convert phase — closed

No open issue covers 500 on /v2/pipeline/{valid-id}/workflow in
collectWorkflows.

Frequency

Occurs whenever collection reaches a pipeline with a broken workflow endpoint.
On full sync over a project with an extended pipeline history in
_tool_circleci_pipelines, the probability of hitting such a record increases
significantly. Projects that have been active for over a year are most at risk.

Version

v1.0.3-beta12

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugThis issue is a bug

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions