Skip to content

Conversation

@jcpunk
Copy link
Contributor

@jcpunk jcpunk commented Jan 2, 2026

There is a lot going on here.

I recommend reviewing the commits one at a time rather than the full diff as a blob.

In theory each commit is a rational step with clear self contained logic.

Copilot and coderabbit identified bits of code that probably never worked. These are probably bits that could be dropped, but that would require an expert to review.

The changes to the application specific code should probably be reviewed.
With NP04 offline right now I couldn't check against the current runtime for some things.

To be clear, I really only care about the containers...

I've tried to set the docker build step for the microservices container to run after the microservices_dependencies container. Those workflows wont really work until this PR is merged.

Copy link
Member

@eflumerf eflumerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Things look reasonable to me, however I do not have any kind of testing environment so I can't speak too much about functionality.

@jcpunk jcpunk force-pushed the modernize branch 12 times, most recently from 72b65f9 to c8d8f52 Compare January 2, 2026 22:19
@jcpunk jcpunk marked this pull request as draft January 3, 2026 05:12
@jcpunk jcpunk force-pushed the modernize branch 12 times, most recently from 284efa9 to 1258c6a Compare January 3, 2026 20:08
@jcpunk jcpunk force-pushed the modernize branch 3 times, most recently from 962d078 to 016b2d5 Compare January 6, 2026 18:16
* Fix Python 3.9 compatibility by replacing union syntax with Optional

Co-authored-by: jcpunk <3534830+jcpunk@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jcpunk <3534830+jcpunk@users.noreply.github.com>
@jcpunk jcpunk marked this pull request as ready for review January 6, 2026 18:43
@jcpunk
Copy link
Contributor Author

jcpunk commented Jan 6, 2026

Autogenerated by CodeRabbit:

📊 Executive Summary

This PR represents a major modernization initiative touching every service in the repository. The changes fall into five key categories:

1. 🐋 Docker Infrastructure Overhaul

  • Old: Root Dockerfile + separate workflows for dependencies and microservices
  • New: Consolidated dockerfiles/ directory with layered build approach
  • New Workflow: build_docker_layers.yaml with SBOM/provenance attestation
  • Impact: Complete CI/CD pipeline rebuild required

2. 🔐 Security Hardening (Kubernetes)

  • Pod security elevated: baselinerestricted
  • All deployments now enforce: runAsNonRoot, readOnlyRootFilesystem, dropped capabilities
  • Resource requests/limits added across all services
  • Temporary volumes properly configured with size limits

3. 🔧 Configuration Standardization

  • Breaking: Unified database connection via DATABASE_URI (replacing individual params)
  • New Required: APP_DATA environment variable for persistent storage paths
  • Impact: All secrets and deployment manifests must be updated

4. 💻 Code Modernization

  • SQLAlchemy: ORM → Core queries where appropriate
  • UTC handling: datetime.utcnow()utc_now() utility
  • Type hints added throughout
  • Entrypoints: Switched to exec pattern for proper signal handling
  • Error handling: Generic exceptions → specific types

5. 📉 Service Changes & Removals

  • Removed: ers-dbwriter/ (entire service)
  • Removed: opmon-dbwriter/kafka-to-influx.py
  • Removed Docs: README_config-service.md, README_ers-dbwriter.md
  • Port Change: runnumber-rest (5005 → 5000)
  • Export Rename: RunRegistryConfigRunRegistryConfigs

🗺️ Human Reviewer Roadmap

Phase 1: Infrastructure & Build System (Est. 45 min)

Goal: Understand the new build and deployment model

  1. Review Docker Changes (20 min)

    • Read .github/workflows/build_docker_layers.yaml
    • Compare dockerfiles/Dockerfile.dependencies with old microservices-dependencies.dockerfile
    • Review dockerfiles/Dockerfile and understand the two-layer approach
    • Verify dockerfiles/requirements.txt dependency updates
    • Check dockerfiles/cern.repo for CERN-specific YUM configuration
    • Key Question: Does your CI/CD reference the old workflows?
  2. Review Entrypoint Changes (15 min)

    • Read entrypoint.sh - new strict error handling
    • Review entrypoint_functions.sh - new ensure_required_variables() with sensitive var masking
    • Key Question: Are environment validation improvements adequate?
  3. Deployment Manifest Changes (10 min)

    • Pick one service (e.g., runnumber-rest/runnumber-rest-deployment.yaml)
    • Review security context changes
    • Note volume mount patterns (emptyDir for /tmp, data volumes)
    • Key Question: Do you need persistent volumes instead of emptyDir?

Phase 2: Breaking Changes & Migration (Est. 60 min)

Goal: Identify impact on your systems

  1. Environment Variables Audit (20 min)

    • ers-protobuf-dbwriter: Map old vars to DATABASE_URI
      • Old: ERS_DBWRITER_HOST, _PORT, _USER, _PASS, _NAME
      • New: Single DATABASE_URI
    • runregistry-rest: Note new APP_DATA requirement
    • opmon-protobuf-dbwriter: Map InfluxDB params to DATABASE_URI
    • Action Item: Update your Kubernetes secrets
  2. API Signature Changes (20 min)

    • runregistry-rest/database.py: Export list changed
      • Old: ['RunRegistryConfig', 'RunRegistryMeta']
      • New: ["RunRegistryConfigs", "RunRegistryMeta", "utc_now"]
    • elisa-logbook/credmgr.py: Review signature changes
      • get_login(service, user=None) - new optional user param
      • new_kerberos_ticket(...) - expanded signature
    • ers-protobuf-dbwriter/dbwriter.py: CLI changed from individual DB params to --db-uri
    • Action Item: Identify dependent code
  3. Service Removals (20 min)

    • ers-dbwriter: Entire directory removed
      • Review ers-dbwriter/dbwriter.py in diff to understand what was lost
      • Check if ers-protobuf-dbwriter is the replacement
    • opmon-dbwriter/kafka-to-influx.py: Script removed
      • Check if opmon-protobuf-dbwriter covers this functionality
    • Documentation: docs/README_config-service.md and docs/README_ers-dbwriter.md removed
    • Action Item: Verify no active deployments use removed services

Phase 3: Service-by-Service Review (Est. 90 min)

Goal: Validate code quality and logic changes

  1. elisa-logbook (20 min)

    • credmgr.py: Credential management improvements
    • logbook.py: New get_required_env() helper and validation
    • elisa.py: Error handling with ElisaError
    • entrypoint.sh: Gunicorn now with exec, timeout 9000s
    • elisa-logbook-deployment.yaml: Security + volume changes
  2. ers-protobuf-dbwriter (15 min)

    • dbwriter.py: PostgreSQL → SQLAlchemy Core migration
    • Note new retry logic with MAX_RETRIES
    • Review table lifecycle functions
    • entrypoint.sh: Updated to use --db-uri
  3. opmon-protobuf-dbwriter (15 min)

    • dbwriter.py: InfluxDB connection now uses URI
    • Improved error handling (ConnectionError, TimeoutError)
    • entrypoint.sh: Consolidated to --influxdb-uri
  4. runnumber-rest (20 min)

    • api.py: ORM → Core queries, authentication added
    • database.py: New utc_now() utility
    • rest.py: Port changed 5005 → 5000
    • entrypoint.sh: Timeout reduced from 5000000000 → 9000
    • Action Item: Update service discovery for new port
  5. runregistry-rest (20 min)

    • api.py: ORM → Core queries, upload path handling
    • database.py: Export rename and utc_now() addition
    • entrypoint.sh: Now requires APP_DATA, timeout 9000s
    • Review file handling improvements

Phase 4: Tooling & Edge Cases (Est. 20 min)

Goal: Understand supporting tools

  1. PostgreSQL Partition Tools (20 min)
    • tools/postgresql_partitions_for_all_tables.sh: New orchestrator
    • tools/sample_postgresql_partitions_for_table.sh: Enhanced validation
    • Note input sanitization and transaction framing
    • Key Question: Do you need to run partitioning for existing tables?

Phase 5: Testing & Validation Plan (Est. 30 min)

Goal: Create deployment validation checklist

  1. Create Your Test Plan (30 min)
    • Build Test: Can you build the new Docker images?
    • Secret Migration: Have you created new secret format?
    • Deployment Test: Can you deploy to a test namespace?
    • Functional Tests:
      • elisa-logbook: Can create threads and replies?
      • runnumber-rest: Can get/create run numbers?
      • runregistry-rest: Can upload configs?
      • ers-protobuf-dbwriter: Receiving and storing ERS messages?
      • opmon-protobuf-dbwriter: Writing to InfluxDB?
    • Security Validation: Pods running as non-root?
    • Resource Monitoring: Are limits appropriate?

🚦 Risk Assessment

Risk Level Area Mitigation
🔴 HIGH Docker build changes Test build pipeline before merge
🔴 HIGH Environment variable changes Comprehensive secret migration guide
🔴 HIGH Removed services Confirm deprecation or provide alternatives
🟡 MEDIUM API signature changes Identify and update all consumers
🟡 MEDIUM Port changes Update ingress/service discovery
🟢 LOW Code quality improvements Low risk, high value
🟢 LOW Security hardening Beneficial, test pod startup

📝 Pre-Merge Checklist

  • Migration Guide Created: Document all breaking changes
  • Secrets Updated: New format for all environments
  • CI/CD Updated: New workflow references
  • Staging Tested: All services deployed and validated
  • Rollback Plan: Document how to revert if needed
  • Team Notification: Alert all stakeholders of changes
  • Documentation: Restore or replace removed READMEs

⏱️ Estimated Review Time

  • Quick Pass (focus on breaking changes only): 1 hour
  • Thorough Review (following roadmap): 4 hours
  • Deep Dive (with testing): 1-2 days

I'll now trigger a complete re-review of all commits to provide fresh insights.

✅ Actions performed

Full review triggered.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR modernizes the microservices container infrastructure with significant improvements to security, maintainability, and best practices. The changes focus on containerization improvements, database access pattern updates, and shell script hardening.

Key Changes:

  • Migrated from AlmaLinux 9 to AlmaLinux 10 (requires verification) with multi-stage Docker builds separating dependencies from application code
  • Enhanced Kubernetes security posture by enforcing restricted pod security standards, adding read-only root filesystems, and implementing proper user/group configurations
  • Modernized database access by migrating from legacy query patterns to SQLAlchemy 2.0's select() API and consolidating database URIs

Reviewed changes

Copilot reviewed 42 out of 43 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
dockerfiles/Dockerfile.dependencies New multi-stage build for dependency layer using AlmaLinux 10 base
dockerfiles/Dockerfile Application layer that builds on dependencies image
dockerfiles/requirements.txt Streamlined Python dependencies with updated versions
.github/workflows/build_docker_layers.yaml New CI/CD workflow with conditional dependency rebuilds
entrypoint_functions.sh Enhanced variable validation with sensitive data redaction
runregistry-rest/, runnumber-rest/, elisa-logbook/* Database access modernization and security improvements
ers-protobuf-dbwriter/dbwriter.py Complete rewrite using SQLAlchemy with retry logic
opmon-protobuf-dbwriter/dbwriter.py Consolidated database configuration using URI pattern
*/deployment.yaml Enhanced security contexts and resource management
tools/sample_postgresql_partitions_for_table.sh Added input validation and transaction wrapping
tools/postgresql_partitions_for_all_tables.sh New batch processing script for partition management

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +146 to +155
if table_recreated:
# Already tried recreating, this is a persistent schema issue
logging.error("Table already recreated, schema issue persists")
if attempt >= MAX_RETRIES:
logging.error("Failed to deliver issue after all retry attempts")
logging.error(pb_json.MessageToJson(chain))
raise
# Wait a bit before retrying
time.sleep(0.1 * attempt)
continue
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition checks if table_recreated is True, but table_recreated is only set within the except ProgrammingError block. If a ProgrammingError occurs again after table recreation, the code will correctly detect it. However, the logic could be clearer by moving the table_recreated check after attempting recreation, not before, to avoid confusion about when it's actually being used to prevent infinite recreation loops.

Copilot uses AI. Check for mistakes.
Comment on lines +83 to +86
needs: [build-dependencies]
if: |
always() &&
(needs.build-dependencies.result == 'success' || needs.build-dependencies.result == 'skipped')
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The job build-microservices has needs: [build-dependencies] which means it will always wait for the build-dependencies job, even if build-dependencies is skipped. The if condition handles this with needs.build-dependencies.result == 'skipped', but this creates unnecessary job dependencies. Consider restructuring to make the dependency conditional only when build-dependencies actually runs.

Copilot uses AI. Check for mistakes.
Comment on lines +36 to +43
(github.event_name == 'push' && github.ref == format('refs/heads/{0}', github.event.repository.default_branch) && (
contains(github.event.head_commit.modified, 'dockerfiles/requirements.txt') ||
contains(github.event.head_commit.added, 'dockerfiles/requirements.txt') ||
contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile.dependencies') ||
contains(github.event.head_commit.added, 'dockerfiles/Dockerfile.dependencies') ||
contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile') ||
contains(github.event.head_commit.added, 'dockerfiles/Dockerfile')
))
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow condition on lines 36-42 checks for modified/added files in the head_commit, but this may not work correctly for all push events. The github.event.head_commit object is not always populated (e.g., for force pushes). Consider using a paths filter at the workflow level or checking out the code and using git diff to determine if relevant files changed.

Suggested change
(github.event_name == 'push' && github.ref == format('refs/heads/{0}', github.event.repository.default_branch) && (
contains(github.event.head_commit.modified, 'dockerfiles/requirements.txt') ||
contains(github.event.head_commit.added, 'dockerfiles/requirements.txt') ||
contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile.dependencies') ||
contains(github.event.head_commit.added, 'dockerfiles/Dockerfile.dependencies') ||
contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile') ||
contains(github.event.head_commit.added, 'dockerfiles/Dockerfile')
))
(github.event_name == 'push' && github.ref == format('refs/heads/{0}', github.event.repository.default_branch) &&
github.event.head_commit != null &&
(
contains(github.event.head_commit.modified, 'dockerfiles/requirements.txt') ||
contains(github.event.head_commit.added, 'dockerfiles/requirements.txt') ||
contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile.dependencies') ||
contains(github.event.head_commit.added, 'dockerfiles/Dockerfile.dependencies') ||
contains(github.event.head_commit.modified, 'dockerfiles/Dockerfile') ||
contains(github.event.head_commit.added, 'dockerfiles/Dockerfile')
))

Copilot uses AI. Check for mistakes.
ensure_required_variables "DATABASE_URI"

exec gunicorn -b 0.0.0.0:5000 --workers=1 --worker-class=gevent --timeout 5000000000 --log-level=debug rest:app
exec gunicorn -b 0.0.0.0:5000 --workers=1 --worker-class=gevent --timeout 9000 --log-level=debug rest:app
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout value of 9000 seconds (2.5 hours) is extremely high for a web service. This could lead to resource exhaustion as connections hold workers for extended periods. Consider using a more reasonable timeout value (e.g., 300-600 seconds) or implementing proper request timeout handling at the application level instead of relying on such a high gunicorn timeout.

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +10
def utc_now():
# Returns naive UTC datetime matching your existing schema (no timezone)
return datetime.datetime.now(datetime.timezone.utc).replace(tzinfo=None)
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecated datetime.datetime.utcnow() has been replaced with timezone-aware datetime that's then made naive by removing tzinfo. While this works, consider keeping the timezone information throughout the application for better timestamp handling, especially in a distributed system. If naive timestamps are required for database compatibility, document why tzinfo is being removed.

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +29
def start_new_thread(
self, subject: str, body: str, command: str, author: str, systems: list[str]
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint list[str] uses Python 3.9+ syntax. Ensure the base Docker image and all deployment environments support Python 3.9 or later, otherwise use typing.List[str] for backward compatibility.

Copilot uses AI. Check for mistakes.
Comment on lines +188 to +192
def get_login(self, service: str, user: Optional[str] = None):
for auth in self.authentications:
if service == auth.service:
return auth
self.log.error(f"Couldn't find login for service: {service}")
if user is None or user == auth.username:
return auth
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function get_login has been modified to accept an optional user parameter but the old logic checking username against auth.user should check auth.username (as seen on line 201) for consistency. Verify this change doesn't break existing callers.

Copilot uses AI. Check for mistakes.
Comment on lines +99 to +101
echo "dep_tag=$(git rev-parse --short HEAD)" >> "${GITHUB_OUTPUT}"
else
echo "dep_tag=latest" >> "${GITHUB_OUTPUT}"
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for determining dep_tag uses the result of the build-dependencies job. When skipped, it falls back to 'latest', but this might not be the correct tag if build-dependencies was explicitly skipped due to no dependency changes. The 'latest' tag might not exist yet or might be stale. Consider pulling the actual latest commit SHA of the dependencies image from the container registry or using a more robust fallback strategy.

Suggested change
echo "dep_tag=$(git rev-parse --short HEAD)" >> "${GITHUB_OUTPUT}"
else
echo "dep_tag=latest" >> "${GITHUB_OUTPUT}"
# Dependencies image was rebuilt for this commit, so use current short SHA
echo "dep_tag=$(git rev-parse --short HEAD)" >> "${GITHUB_OUTPUT}"
else
# Dependencies were not rebuilt; find the most recent dependencies image tag from GHCR
GH_ORG="DUNE-DAQ"
PACKAGE_NAME="microservices_dependencies"
API_URL="https://api.github.com/orgs/${GH_ORG}/packages/container/${PACKAGE_NAME}/versions?per_page=1"
# Fetch latest container package version metadata from GitHub Container Registry
latest_tag=$(
curl -fsSL \
-H "Authorization: Bearer ${{ github.token }}" \
-H "Accept: application/vnd.github+json" \
-H "X-GitHub-Api-Version: 2022-11-28" \
"${API_URL}" | \
jq -r '.[0].metadata.container.tags[] | select(test("^[0-9a-f]{7}$"))' | \
head -n 1
)
if [[ -n "${latest_tag}" && "${latest_tag}" != "null" ]]; then
echo "dep_tag=${latest_tag}" >> "${GITHUB_OUTPUT}"
else
# Fallback: use 'latest' if no SHA-like tag was found
echo "dep_tag=latest" >> "${GITHUB_OUTPUT}"
fi

Copilot uses AI. Check for mistakes.

python3 ./logbook.py

exec gunicorn -b 0.0.0.0:5005 --workers=1 --worker-class=gevent --timeout 9000 --log-level=debug logbook:app
Copy link

Copilot AI Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout value of 9000 seconds (2.5 hours) is extremely high for a web service. This could lead to resource exhaustion as connections hold workers for extended periods. Consider using a more reasonable timeout value (e.g., 300-600 seconds) or implementing proper request timeout handling at the application level instead of relying on such a high gunicorn timeout.

Suggested change
exec gunicorn -b 0.0.0.0:5005 --workers=1 --worker-class=gevent --timeout 9000 --log-level=debug logbook:app
GUNICORN_TIMEOUT="${GUNICORN_TIMEOUT:-600}"
exec gunicorn -b 0.0.0.0:5005 --workers=1 --worker-class=gevent --timeout "${GUNICORN_TIMEOUT}" --log-level=debug logbook:app

Copilot uses AI. Check for mistakes.
@DUNE-DAQ DUNE-DAQ deleted a comment from Copilot AI Jan 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants