cpu-seq: Add more ereports by hawkw · Pull Request #2242 · oxidecomputer/hubris

hawkw · 2025-09-22T18:23:48Z

Depends on #2246

This branch adds new ereports to drv_gimlet_seq_server and
drv_cosmo_seq_server for failures going to A0 and for events that
result in unexpected power-offs, such as THERMTRIP, sequencer FPGA A0
MAPO, and SMERR assertions. This depends on the microcbor derives for
CBOR encoding I added in #2246, which make it much easier for a task to
record a variety of ereports represented by different Rust types,
without having to worry too much about the maximum length of the
encoding buffer.

This branch isn't ready to merge yet, as it depends on #2246. However,
I'd love to get reviews on this now, mostly to get some feedback on the
data included in these ereports and the class hierarchy/general message
schemas. I've kinda made these up; if there's more information that
would be useful to include in these, I'd love to get some advice from
@rmustacc and others.

rmustacc

Thanks for putting this together @hawkw. I have a bunch of thoughts with varying degrees of cogency.

rmustacc · 2025-10-10T00:45:52Z

drv/cosmo-seq-server/src/main.rs

+    //
+    // Interrupts
+    //
+    #[cbor(rename = "hw.cpu.thermtrip")]


I've been going back and forth with myself on whether to phrase this as specific to amd or not. e.g. hw.cpu.amd.thermtrip, as we're referring to its thermal trip assertion pin. I think it's probably better to keep this as is.

rmustacc · 2025-10-10T00:47:18Z

drv/cosmo-seq-server/src/main.rs

+    //
+    #[cbor(rename = "hw.cpu.thermtrip")]
+    Thermtrip,
+    #[cbor(rename = "hw.seq.smerr")]


So, I don't think we should phrase this in terms of the sequencer as this is a fact about the CPU. So probably closer to hw.cpu.smerr. It's not clear to me again if we want to namespace this or if we'd try to make these similar. This almost feels like hw.cpu.amd.smerr (though I had to look up what smerr is).

rmustacc · 2025-10-10T00:49:49Z

drv/cosmo-seq-server/src/main.rs

+    Thermtrip,
+    #[cbor(rename = "hw.seq.smerr")]
+    Smerr,
+    #[cbor(rename = "hw.seq.a0_mapo")]


In this case I don't think we should phrase this as specific to the sequencer, this feels more related to power. So I would think of this as hw.pwr.a0.mapo. This kind of organization assumes we want to make a class of A0/A2-specific power events. I also think we should consider instead hw.pwr.mapo with a power domain present as an argument. A small part of me prefers this in case we detect A2 issues or we have a more complex situation with the Metro FPGA.

rmustacc · 2025-10-10T00:54:11Z

drv/cosmo-seq-server/src/main.rs

+    //
+    // Initialization failures
+    //
+    #[cbor(rename = "hw.cpu.a0_fail.unknown")]


I get where you're coming from here, but I think I would basically keep this as hw.cpu.unsup or you know the old callback to enotsup. I think unknown here is a bit confusing and it's not clear to me that an explicit a0_fail subclass makes sense here.

yeah, i like unsup --- i thought about unknown_cpu but that felt like a few too many characters. I agree that unknown feels like it could mean "unknown error", which is not what this is supposed to mean. if we dropped a0_fail, unknown_cpu seems not too bad...

rmustacc · 2025-10-10T00:55:06Z

drv/cosmo-seq-server/src/main.rs

+                class: EreportClass::Thermtrip,
+                version: 0,
+                report: &HOST_CPU_REFDES,
+                // TODO(eliza): eventually, it would be nice to include sequencer


What sequencer registers are you imagining for this?

rmustacc · 2025-10-10T01:09:10Z