Conversation
| // | ||
| // Interrupts | ||
| // | ||
| #[cbor(rename = "hw.cpu.thermtrip")] |
There was a problem hiding this comment.
I've been going back and forth with myself on whether to phrase this as specific to amd or not. e.g. hw.cpu.amd.thermtrip, as we're referring to its thermal trip assertion pin. I think it's probably better to keep this as is.
| // | ||
| #[cbor(rename = "hw.cpu.thermtrip")] | ||
| Thermtrip, | ||
| #[cbor(rename = "hw.seq.smerr")] |
There was a problem hiding this comment.
So, I don't think we should phrase this in terms of the sequencer as this is a fact about the CPU. So probably closer to hw.cpu.smerr. It's not clear to me again if we want to namespace this or if we'd try to make these similar. This almost feels like hw.cpu.amd.smerr (though I had to look up what smerr is).
| Thermtrip, | ||
| #[cbor(rename = "hw.seq.smerr")] | ||
| Smerr, | ||
| #[cbor(rename = "hw.seq.a0_mapo")] |
There was a problem hiding this comment.
In this case I don't think we should phrase this as specific to the sequencer, this feels more related to power. So I would think of this as hw.pwr.a0.mapo. This kind of organization assumes we want to make a class of A0/A2-specific power events. I also think we should consider instead hw.pwr.mapo with a power domain present as an argument. A small part of me prefers this in case we detect A2 issues or we have a more complex situation with the Metro FPGA.
| // | ||
| // Initialization failures | ||
| // | ||
| #[cbor(rename = "hw.cpu.a0_fail.unknown")] |
There was a problem hiding this comment.
I get where you're coming from here, but I think I would basically keep this as hw.cpu.unsup or you know the old callback to enotsup. I think unknown here is a bit confusing and it's not clear to me that an explicit a0_fail subclass makes sense here.
There was a problem hiding this comment.
yeah, i like unsup --- i thought about unknown_cpu but that felt like a few too many characters. I agree that unknown feels like it could mean "unknown error", which is not what this is supposed to mean. if we dropped a0_fail, unknown_cpu seems not too bad...
| class: EreportClass::Thermtrip, | ||
| version: 0, | ||
| report: &HOST_CPU_REFDES, | ||
| // TODO(eliza): eventually, it would be nice to include sequencer |
There was a problem hiding this comment.
What sequencer registers are you imagining for this?
|
|
||
| #[derive(Copy, Clone, Eq, PartialEq, microcbor::Encode, counters::Count)] | ||
| pub enum EreportClass { | ||
| #[cbor(rename = "hw.cpu.thermtrip")] |
There was a problem hiding this comment.
Seeing these in both Gimlet and Cosmo makes me wonder if there should be shared definitions here even if it has to get compiled into both tasks. That way we're consistent about the data that each contains.
| A0Timeout, | ||
| #[cbor(rename = "hw.a0_fail.timeout.groupc")] | ||
| A0TimeoutGroupC, | ||
| #[cbor(rename = "hw.pwr.pmbus.a0_fail.i2c_err")] |
There was a problem hiding this comment.
The class on here is weird to me, but I think we want to figure out the above first. It's not really a PMBus-specific happening here right? It's actually that we couldn't communicate over I2C based on my memory, though that might be off.
| record_reg(Addr::FLT_A0_SMSTATUS); | ||
| record_reg(Addr::FLT_GROUPB_PG); | ||
| record_reg(Addr::FLT_GROUPC_PG); | ||
| let seq_status = SeqStatus { |
There was a problem hiding this comment.
So this is interesting and suggests that the sequencing stuff may need to be board specific. This almost makes me want to think that this is hw.gimlet.a0.seq.fail or something. This needs more thought, but it's clear that we can't use the same payload between both. Or that we need some way for consumers to know that the board will change the payload. The reason I have the board here is that it's board-specific.
| refdes: &'static str, | ||
| rail: &'static str, | ||
| #[count(children)] | ||
| err: i2c::ResponseCode, |
There was a problem hiding this comment.
This feels like something that we need to think carefully about. I think there's a desire to change this around a bit right, but this is starting to bake it into a more public API.
| #[cbor(flatten)] | ||
| refdes: &'static HostCpuRefdes, | ||
| #[cbor(flatten)] | ||
| coretype: Coretype, |
There was a problem hiding this comment.
Why does this have coretype but not SP3r1/SP3r2?
There was a problem hiding this comment.
Coretype here is a Rust struct that actually includes coretype/sp3r1/sp3r2 fields; it's flattened so we would actually generate the CBOR:
"coretype": <bool>,
"sp3r1": <bool>,
"sp3r2": <bool>,
It's a separate struct because it gets passed around in a few places here. Naming it Coretype may have been a mistake.
This commit adds a new attribute to `microcbor`, intended to make defining ereport types more convenient. Presently, we tend to define ereports by using an enum to represent all possible ereport classes that a task may report, with `microcbor`'s `#[cbor(variant_id = "...")]` attribute on the enum definition and `#[cbor(rename = "...")]` on the variants. Then, we define a struct which contains the class enum along with a version, and either an enum or generic field to represent the ereport body. For an example of this usage, consider the `cosmo_seq` task: https://github.com/oxidecomputer/hubris/blob/aa843e7b937e5a7d8bb21298919440689657ee29/drv/cosmo-seq-server/src/main.rs#L458-L481 This pattern has some disadvantages. In particular, it makes it very difficult for multiple tasks to share the definitions of some ereport types in a shared crate, which is useful in some situations. In particular, as we add ereports for sequencer events (see #2242), we would like to be able to share some ereport messages between the Cosmo and Gimlet sequencer tasks (and perhaps also the Tofino and PSC sequencers, in some cases). The current pattern makes this difficult. While we could use `#[derive(EncodeFields)]` to define common types and then embed them in an enum of "all ereport types in this task" in a task crate, the definition of the ereport message's class and version would be in the task rather than where the message is defined, meaning they are duplicated. This would be sad: since the class is important to how upstack software interprets the ereport, ensuring that both tasks emit the same class and version fields is a big chunk of why we would even want shared definitions. Also, the enum-based approach has some other disadvantages. When we define separate enums for the class and for message bodies, it is possible to accidentally use the wrong class for a given message body --- nothing ensures that these match. And, using enums for everything means that the size of the message that has to be constructed on the stack is the size of the _largest variant_, which makes stack usage worse when a particular code path always reports a smaller variant. This branch introduces a new API for defining ereport types as a `struct` for each individual class of ereport message. This is done using a new attribute which can be added to types that `#[derive(microcbor_derive::Encode)]`. The new attribute, `#[ereport(...)]`, takes `class = "a sting literal"` and `version = <an int literal>"` arguments, and, if present, changes the generated `Encode` implementation to output the `"k" = <class>` and `"v" = <version>` pairs when encoding the type. The maximum CBOR length value is also adjusted to include the length of the additional K/V pairs. The usage of the new attribute is discussed in greater detail in the RustDoc. Now, we can define individual ereport messages as their own top-level Rust types, and those types will always be serialized with the correct class and version values. Multiple tasks can share these types, and can still use the automatic buffer size calculation by passing _multiple types_ to the `microcbor::max_cbor_len_for!` macro, which is how that API was really intended to be used in the first place. For example, we might imagine something like: ```rust #[derive(Encode)] #[ereport(class = "hw.discovery.ae35.fault", version = 0)] struct Ae35UnitEreport { critical_in_hrs: u32, detected_by: fixedstr::FixedStr<'static, 8>, } #[derive(Encode)] #[ereport(class = "hw.apollo.undervolt", version = 13)] #[cbor(variant_id = "bus")] enum UndervoltEreport { MainBusA { volts: f32 }, MainBusB { volts: f32 }, // "Houston, we've got a main bus B undervolt!" } use some_other_crate_that_defines_ereports; const EREPORT_BUF_SIZE: usize = microcbor::max_cbor_len_for![ Ae32UnitEreport, UndervoltEreport, some_other_crate_that_defines_ereports::SomeOtherEreport, ]; ``` and that will all just work. As an aside, I *did* consider the fact that this *could* be an API to add any arbitrary compile-time fields when encoding. I decided *not* to do that, as the goal here was specifically to help with ereports, and I felt like there was some value in having the attribute also enforce the names and types of the conventional ereport fields. That way, you are expressing the intent to say that "this is an ereport message", and the proc-macro ensures you have included the requisite fields and that they have the requisite types. We may consider adding a general-purpose "additional fields with compile time values" attribute in the future if such a thing seems useful, and if we do, the `#[ereport(...)]` attribute could be reimplemented using that internally.
This commit adds a new attribute to `microcbor`, intended to make defining ereport types more convenient. Presently, we tend to define ereports by using an enum to represent all possible ereport classes that a task may report, with `microcbor`'s `#[cbor(variant_id = "...")]` attribute on the enum definition and `#[cbor(rename = "...")]` on the variants. Then, we define a struct which contains the class enum along with a version, and either an enum or generic field to represent the ereport body. For an example of this usage, consider the `cosmo_seq` task: https://github.com/oxidecomputer/hubris/blob/aa843e7b937e5a7d8bb21298919440689657ee29/drv/cosmo-seq-server/src/main.rs#L458-L481 This pattern has some disadvantages. In particular, it makes it very difficult for multiple tasks to share the definitions of some ereport types in a shared crate, which is useful in some situations. In particular, as we add ereports for sequencer events (see #2242), we would like to be able to share some ereport messages between the Cosmo and Gimlet sequencer tasks (and perhaps also the Tofino and PSC sequencers, in some cases). The current pattern makes this difficult. While we could use `#[derive(EncodeFields)]` to define common types and then embed them in an enum of "all ereport types in this task" in a task crate, the definition of the ereport message's class and version would be in the task rather than where the message is defined, meaning they are duplicated. This would be sad: since the class is important to how upstack software interprets the ereport, ensuring that both tasks emit the same class and version fields is a big chunk of why we would even want shared definitions. Also, the enum-based approach has some other disadvantages. When we define separate enums for the class and for message bodies, it is possible to accidentally use the wrong class for a given message body --- nothing ensures that these match. And, using enums for everything means that the size of the message that has to be constructed on the stack is the size of the _largest variant_, which makes stack usage worse when a particular code path always reports a smaller variant. This branch introduces a new API for defining ereport types as a `struct` for each individual class of ereport message. This is done using a new attribute which can be added to types that `#[derive(microcbor_derive::Encode)]`. The new attribute, `#[ereport(...)]`, takes `class = "a sting literal"` and `version = <an int literal>"` arguments, and, if present, changes the generated `Encode` implementation to output the `"k" = <class>` and `"v" = <version>` pairs when encoding the type. The maximum CBOR length value is also adjusted to include the length of the additional K/V pairs. Theusage of the new attribute is discussed in greater detail in the RustDoc. Now, we can define individual ereport messages as their own top-level Rust types, and those types will always be serialized with the correct class and version values. Multiple tasks can share these types, and can still use the automatic buffer size calculation by passing _multiple types_ to the `microcbor::max_cbor_len_for!` macro, which is how that API was really intended to be used in the first place. For example, we might imagine something like: ```rust #[derive(Encode)] #[ereport(class = "hw.discovery.ae35.fault", version = 0)] struct Ae35UnitEreport { critical_in_hrs: u32, detected_by: fixedstr::FixedStr<'static, 8>, } #[derive(Encode)] #[ereport(class = "hw.apollo.undervolt", version = 13)] #[cbor(variant_id = "bus")] enum UndervoltEreport { MainBusA { volts: f32 }, MainBusB { volts: f32 }, // "Houston, we've got a main bus B undervolt!" } use some_other_crate_that_defines_ereports; const EREPORT_BUF_SIZE: usize = microcbor::max_cbor_len_for![ Ae32UnitEreport, UndervoltEreport, some_other_crate_that_defines_ereports::SomeOtherEreport, ]; ``` and that will all just work. As an aside, I *did* consider the fact that this *could* be an API to add any arbitrary compile-time fields when encoding. I decided *not* to do that, as the goal here was specifically to help with ereports, and I felt like there was some value in having the attribute also enforce the names and types of the conventional ereport fields. That way, you are expressing the intent to say that "this is an ereport message", and the proc-macro ensures you have included the requisite fields and that they have the requisite types. We may consider adding a general-purpose "additional fields with compile time values" attribute in the future if such a thing seems useful, and if we do, the `#[ereport(...)]` attribute could be reimplemented using that internally.
Depends on #2246
This branch adds new ereports to
drv_gimlet_seq_serveranddrv_cosmo_seq_serverfor failures going to A0 and for events thatresult in unexpected power-offs, such as THERMTRIP, sequencer FPGA A0
MAPO, and SMERR assertions. This depends on the
microcborderives forCBOR encoding I added in #2246, which make it much easier for a task to
record a variety of ereports represented by different Rust types,
without having to worry too much about the maximum length of the
encoding buffer.
This branch isn't ready to merge yet, as it depends on #2246. However,
I'd love to get reviews on this now, mostly to get some feedback on the
data included in these ereports and the class hierarchy/general message
schemas. I've kinda made these up; if there's more information that
would be useful to include in these, I'd love to get some advice from
@rmustacc and others.