Skip to content

bug: SPIFFE service identifiers in InternalRBACRules not updated after carbide→NICo rename — all internal gRPC calls return 403 #1891

@vipulagarwal

Description

@vipulagarwal

Version

main

Describe the bug.

NOTE: Written with assistance from Claude Sonnet 4.6

Summary

After the platform rename from carbide to NICo, all deployed services present
SPIFFE identifiers using the nico-* prefix (e.g. nico-dns, nico-dhcp).
However, InternalRBACRules in crates/api/src/auth/internal_rbac_rules.rs
still matched against hardcoded carbide-* strings. Every internal service-to-api
gRPC call failed mTLS authorization with HTTP 403, silently breaking all
service-to-service communication.

Affected services

All internal principals that authenticate via SpiffeServiceIdentifier:

Service Old (broken) identifier Expected identifier
DNS carbide-dns nico-dns
DHCP carbide-dhcp nico-dhcp
SSH Console carbide-ssh-console nico-ssh-console
SSH Console RS carbide-ssh-console-rs nico-ssh-console-rs
PXE carbide-pxe nico-pxe
Hardware Health carbide-hardware-health nico-hardware-health
RLA / Flow carbide-rla nico-rla
Maintenance Jobs carbide-maintenance-jobs nico-maintenance-jobs
DSX Exchange Consumer carbide-dsx-exchange-consumer nico-dsx-exchange-consumer

Root cause

InternalRBACRules::principal_to_rule_principal() maps each RulePrincipal
variant to a Principal::SpiffeServiceIdentifier string. These strings were
hardcoded at implementation time and never updated when services were renamed:

// crates/api/src/auth/internal_rbac_rules.rs
RulePrincipal::Dns => {
    Principal::SpiffeServiceIdentifier("carbide-dns".to_string())  // wrong after rename
}

When nico-api validates an inbound gRPC call from nico-dns, it resolves the
presented SPIFFE URI spiffe://forge.local/forge-system/nico-dns → extracts
service name nico-dns → compares against carbide-dns → no match → 403.

Impact

  • nico-dns → nico-api: LookupRecordLegacy denied — DNS resolution for
    provisioned hosts broken
  • nico-dhcp → nico-api: DHCP lease lookups denied — host boot broken
  • nico-pxe → nico-api: GetCloudInitInstructions denied — PXE boot broken
  • nico-ssh-console / nico-ssh-console-rs → nico-api: SSH console access denied
  • nico-hardware-health / nico-rla → nico-api: health reporting and maintenance
    scheduling denied
  • All failures surface as 403 with no indication that the SPIFFE identifier
    is the cause — no certificate error, no TLS handshake failure

Fix

Update all carbide-* strings to nico-* in InternalRBACRules:

File: crates/api/src/auth/internal_rbac_rules.rs

Prevention

These identifiers are stringly-typed and have no compile-time link to the actual
service names. Consider:

  1. Deriving SPIFFE identifiers from a shared constant / config rather than
    duplicating strings in both the service cert configuration and InternalRBACRules
  2. Adding an integration test that verifies each RulePrincipal variant resolves
    to a SPIFFE identifier that matches the cert subject of the corresponding deployed service

Minimum reproducible example

### Steps to reproduce

#### Live deployment

1. Deploy `nico-api` and `nico-dns` via `setup.sh`
2. Trigger any DNS lookup for a provisioned host (e.g. attempt a PXE boot). It was also observed that underlying k8s host DNS breaks as all DNS requests go through `nico-dns` pod.
3. Observe in `nico-api` logs:


WARN  auth::internal_rbac_rules — principal SpiffeServiceIdentifier("nico-dns") \
  not authorized for method LookupRecordLegacy — no matching rule

Relevant log output

Other/Misc.

No response

Code of Conduct

  • I agree to follow NCX Infra Controller's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report

Metadata

Metadata

Assignees

Labels

bugA defect in existing software (deprecated - use issue type, but it's needed for reporting now)

Type

No fields configured for Bug.

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions