Skip to content

Add shared ereport types crate, more sequencer ereports#2399

Open
hawkw wants to merge 15 commits intomasterfrom
eliza/seq-ereports2
Open

Add shared ereport types crate, more sequencer ereports#2399
hawkw wants to merge 15 commits intomasterfrom
eliza/seq-ereports2

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Feb 24, 2026

Presently, the compute sled sequencer tasks (cosmo_seq and
gimlet_seq) duplicate a large number of type definitions for ereport
messages. This is due to our present practice of defining a task's
ereports as one big enum of all the ereports it may send, which is
almost always unique to that task. This is unfortunate, as it means that
we must duplicate these messages when we would like them to be
consistent between multiple tasks.

Furthermore, the sequencer tasks currently only emit ereports for a
small subset of events that we care about --- mostly Vcore
regulator PMBus alerts. There are several other events that should
definitely be reported. I had previously attempted to add some new
ereports to the sequencer tasks in #2242, but at the time, neither
@rmustacc nor I were particularly satisfied with the content and class
hierarchy of those ereports, and I was unhappy with the fact that these
ereports added additional duplicated code between the two sequencer
tasks.

This PR is a second attempt to add some of the sequencer ereports I
would like, using the new API added in #2397 to define those ereport
messages in a shared crate, rather than duplicating them between the two
sequencer tasks. In particular, I've done the following:

  1. Introduce a new lib/ereports crate that contains shared ereport
    message type definitions, using the #[ereport(...)] attribute from
    microcbor: add #[ereport(...)] attribute #2397 to define the ereports as separate types.
  2. Refactor the existing gimlet_seq and cosmo_seq ereports to move
    the message types for PMBus alerts and BMR491 mitigation failures to
    the ereports crate, and use them in both sequencer tasks.
  3. Also refactor the psc_seq task to use the shared definition of
    PmbusStatus registers. I didn't change the actual ereport messages
    there, as they're pretty different from the CPU sequencer ones.
  4. Added tracking of when the sequencer transitioned to the current
    power state, so that this can be included in some of the ereports,
    as @rmustacc suggested in this comment.
  5. Add new ereports for the following events:
    • The CPU asserting its THERMTRIP_L pin
    • The CPU asserting its SMERR_L pin (on Cosmo, as I'm not sure if
      the Gimlet sequencer task can notice this)
    • Failure to sequence due to an unsupported CPU type (coretype and
      sp3r{x}/sp5r{x} pins)
  6. Assorted refactoring of the ereport code in both sequencers

I did not add ereports for MAPO or NIC MAPO events, as per a
conversation on Matrix with @rmustacc, I think we will want to report
MAPOs by adding better monitoring of POWER_GOOD status from various
power rails. Some of that is described in issues #2372, #2394, and
perhaps also #2398.

This does not actually close #2142, as that describes sequencing
failures due to timeouts waiting for POWER_GOOD on various rails,
which I will want to implement subsequently.

Closes #2242.

Comment on lines 182 to 226
@@ -219,17 +221,15 @@ fn main() -> ! {

if let Some(last_cause) = last_cause {
// Report the failure even if we eventually succeeded.
try_send_ereport(
&packrat,
&mut ereport_buf[..],
EreportClass::Bmr491MitigationFailure,
EreportKind::Bmr491MitigationFailure {
refdes: FixedStr::from_str(dev.component_id()),
failures,
last_cause,
succeeded,
},
);
let ereport = ereports::pwr::Bmr491MitigationFailure {
refdes: FixedStr::<{ REFDES_LEN }>::from_str(
dev.component_id(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit annoying that these are generic over REFDES_LEN, since it means that the downstream task must specify it both in the max_cbor_len_for! macro and when actually constructing the FixedStrs (as the type is not inferred there, since it could be any FixedStr).

The alternative to this would be making the ereports crate depend on build_i2c, so that it can get the MAX_COMPONENT_ID_LEN value from there. But, that felt a bit gross for the lib that just defines a bunch of types to have to go build all the I2C stuff. Also, there might be some cases where a task needs to override the length in order to use a refdes that isn't coming from the I2C config. For instance, when we wire up the "look up refdes for a sensor ID" stuff in #2364 (which i started on in #2383), a sensor which is not an I2C device may have a longer refdes than any of the I2C devices in the config (unlikely but who knows!). Hmm.

@hawkw
Copy link
Member Author

hawkw commented Feb 26, 2026

Testing out the SMERR_L and THERMTRIP IRQ ereports on Cosmo, by manually poking the sequencer FPGA Interrupt Generation Register (IGR):

eliza@jeeves ~ $ pfexec humility hiffy -c FmcDemo.poke32 -a addr=0xc0000208,value=0x2
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
FmcDemo.poke32() => ()
eliza@jeeves ~ $ pfexec humility hiffy -c Sequencer.get_state
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
Sequencer.get_state() => A0Thermtrip
eliza@jeeves ~ $ pfexec humility hiffy -c Sequencer.set_state -a state=A2
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
Sequencer.set_state() => Changed
eliza@jeeves ~ $ pfexec humility hiffy -c Sequencer.set_state -a state=A0
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
Sequencer.set_state() => Changed
eliza@jeeves ~ $ pilot sp -i e1000g0 exec -e 'ereports' fe80::aa40:25ff:fe04:11c5
Feb 26 18:46:53.957 INFO creating SP handle on interface e1000g0, component: faux-mgs
Feb 26 18:46:53.988 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:11c5%2]:11111, interface: e1000g0, socket: control-plane-agent, component: faux-mgs
restart ID: 7b2025d6-239d-9da0-5ddd-d6b13b6fb64b
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 2

ereports:
0x1: {
    "baseboard_part_number": String("913-0000023"),
    "baseboard_rev": Number(2),
    "baseboard_serial_number": String("2J80WF9K\0\0\0"),
    "ereport_message_version": Number(0),
    "hubris_caboose": Object {
        "board": String("cosmo-b"),
        "commit": String("579b66a6f5c2a940073f4574228df91ca1f69306-dirty"),
        "version": Null,
    },
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("packrat"),
    "hubris_uptime_ms": Number(0),
    "lost": Null,
}

0x2: {
    "baseboard_part_number": String("913-0000023"),
    "baseboard_rev": Number(2),
    "baseboard_serial_number": String("2J80WF9K\0\0\0"),
    "dev_id": String("sp5-host-cpu"),
    "ereport_message_version": Number(0),
    "hubris_caboose": Object {
        "board": String("cosmo-b"),
        "commit": String("579b66a6f5c2a940073f4574228df91ca1f69306-dirty"),
        "version": Null,
    },
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("cosmo_seq"),
    "hubris_uptime_ms": Number(568146),
    "k": String("hw.cpu.amd.thermtrip"),
    "refdes": String("P0"),
    "state": Object {
        "cur": String("A0PlusHP"),
        "since": Number(314536),
    },
    "v": Number(0),
}

eliza@jeeves ~ $ pfexec humility hiffy -c Sequencer.get_state
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
Sequencer.get_state() => A0
eliza@jeeves ~ $ pfexec humility hiffy -c FmcDemo.poke32 -a addr=0xc0000208,value=0x4
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
FmcDemo.poke32() => ()
eliza@jeeves ~ $ pilot sp -i e1000g0 exec -e 'ereports' fe80::aa40:25ff:fe04:11c5
Feb 26 18:47:39.106 INFO creating SP handle on interface e1000g0, component: faux-mgs
Feb 26 18:47:39.137 INFO initial discovery complete, addr: [fe80::aa40:25ff:fe04:11c5%2]:11111, interface: e1000g0, socket: control-plane-agent, component: faux-mgs
restart ID: 7b2025d6-239d-9da0-5ddd-d6b13b6fb64b
restart IDs did not match (requested 00000000-0000-0000-0000-000000000000)
count: 3

ereports:
0x1: {
    "baseboard_part_number": String("913-0000023"),
    "baseboard_rev": Number(2),
    "baseboard_serial_number": String("2J80WF9K\0\0\0"),
    "ereport_message_version": Number(0),
    "hubris_caboose": Object {
        "board": String("cosmo-b"),
        "commit": String("579b66a6f5c2a940073f4574228df91ca1f69306-dirty"),
        "version": Null,
    },
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("packrat"),
    "hubris_uptime_ms": Number(0),
    "lost": Null,
}

0x2: {
    "baseboard_part_number": String("913-0000023"),
    "baseboard_rev": Number(2),
    "baseboard_serial_number": String("2J80WF9K\0\0\0"),
    "dev_id": String("sp5-host-cpu"),
    "ereport_message_version": Number(0),
    "hubris_caboose": Object {
        "board": String("cosmo-b"),
        "commit": String("579b66a6f5c2a940073f4574228df91ca1f69306-dirty"),
        "version": Null,
    },
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("cosmo_seq"),
    "hubris_uptime_ms": Number(568146),
    "k": String("hw.cpu.amd.thermtrip"),
    "refdes": String("P0"),
    "state": Object {
        "cur": String("A0PlusHP"),
        "since": Number(314536),
    },
    "v": Number(0),
}

0x3: {
    "baseboard_part_number": String("913-0000023"),
    "baseboard_rev": Number(2),
    "baseboard_serial_number": String("2J80WF9K\0\0\0"),
    "dev_id": String("sp5-host-cpu"),
    "ereport_message_version": Number(0),
    "hubris_caboose": Object {
        "board": String("cosmo-b"),
        "commit": String("579b66a6f5c2a940073f4574228df91ca1f69306-dirty"),
        "version": Null,
    },
    "hubris_task_gen": Number(0),
    "hubris_task_name": String("cosmo_seq"),
    "hubris_uptime_ms": Number(714333),
    "k": String("hw.cpu.amd.smerr"),
    "refdes": String("P0"),
    "state": Object {
        "cur": String("A0"),
        "since": Number(617259),
    },
    "v": Number(0),
}

eliza@jeeves ~ $ pfexec humility hiffy -c Sequencer.get_state
humility: WARNING: archive in environment variable overriding archive in environment file
humility: attached to 1fc9:0143:ISS1EG0OBG2HT via CMSIS-DAP
Sequencer.get_state() => A2
eliza@jeeves ~ $

Base automatically changed from eliza/derive-ereport-for-real to master February 26, 2026 22:45
@hawkw hawkw force-pushed the eliza/seq-ereports2 branch from ba992d6 to 6717b05 Compare February 26, 2026 22:48
@hawkw hawkw changed the title [WIP] shared ereport types crate, more sequencer ereports (soon) Add shared ereport types crate, more sequencer ereports Feb 26, 2026
@hawkw hawkw marked this pull request as ready for review February 27, 2026 19:31
@hawkw
Copy link
Member Author

hawkw commented Feb 27, 2026

@rmustacc I've requested your review on this mostly just for any feedback on the new ereport messages and their contents, since the new ereports are based on your suggestions from #2242. Hopefully, now that we have a shared library with most of the ereport type definitions, it should be a bit easier to review just the ereport messages, without having to wade through a whole bunch of implementation details moving Rust around. In particular, you can see the complete list of all the ereport classes that each sequencer task will emit here (for Gimlet):

const EREPORT_BUF_LEN: usize = microcbor::max_cbor_len_for![
ereports::pwr::PmbusAlert<FixedStr<'static, 9>, { REFDES_LEN }>,
ereports::pwr::Bmr491MitigationFailure<{ REFDES_LEN }>,
ereports::cpu::Thermtrip,
ereports::cpu::UnsupportedCpu<CpuTypeBits>,
];

and here (for Cosmo):

const EREPORT_BUF_LEN: usize = microcbor::max_cbor_len_for![
ereports::pwr::PmbusAlert<vcore::Rail, { REFDES_LEN }>,
ereports::pwr::Bmr491MitigationFailure<{ REFDES_LEN }>,
ereports::cpu::Thermtrip,
ereports::cpu::Smerr,
ereports::cpu::UnsupportedCpu<CpuTypeBits>,
];

and you should be able to find most of these Rust types defined in this library, except for a few whose content is specific to the Cosmo or Gimlet sequencer:

https://github.com/oxidecomputer/hubris/tree/092f4949a3e2a074f6bf4412edd45a070e71e399/lib/ereports/src

Please don't feel like you need to review anything beyond how we feel about the ereports, their contents, and the class hierarchy and naming --- thanks in advance for taking a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ereport: sequencing failure

1 participant