Skip to content

sidecar-seq: re-sequence on bad PCIe link#2402

Draft
Aaron-Hartwig wants to merge 1 commit intomasterfrom
aaron/sidecar-resequence
Draft

sidecar-seq: re-sequence on bad PCIe link#2402
Aaron-Hartwig wants to merge 1 commit intomasterfrom
aaron/sidecar-resequence

Conversation

@Aaron-Hartwig
Copy link
Contributor

For reasons we've been unable to nail down, sometimes on a cold boot of a rack the Tofino is unhappy with its PCIe link to the SP5 on Cosmo. To date, the workaround of a Tofino resequence has reliably addressed the issue. The goal of this commit to have Sidecar's SP automatically do this. The SP will monitor for this case by starting a 30 second timer once we observe the SP5 to have released PERST to the Tofino. When the link comes up, it does so within a few seconds. When it doesn't, we will resequence the Tofino after waiting 30 seconds. We will only do this once.

@Aaron-Hartwig Aaron-Hartwig self-assigned this Feb 26, 2026
@Aaron-Hartwig
Copy link
Contributor Author

I'm opening this as a draft to get feedback on the strategy before attempting to test this more broadly (i.e., outside my equipment).

Here we can watch the Sidecar SP detect the problem has occurred and resequence itself

   1  172        2        1 TofinoInA0
   2  189        2        1 TofinoBar0RegisterValue(SoftwareReset, 0xf3fc0f0)
   3  219        2        1 TofinoBar0RegisterValue(ResetOptions, 0x70000a8)
   4  232        2        1 TofinoEepromIdCode(0x220134)
   5  258        2        1 TofinoBar0RegisterValue(PciePhyLaneControl0, 0xc000c)
   6  258        2        1 TofinoBar0RegisterValue(PciePhyLaneControl1, 0xc000c)
   7  284        2        1 TofinoCfgRegisterValue(KGen, 0xe20f03)
   8  303        2        1 TofinoBar0RegisterValue(SoftwareReset, 0xf3fc000)
   9   62        2        1 SetPCIePresent
  10  405        2      314 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: false })
  11   91        2        1 TofinoPcieReset(false)
  12  405        2       29 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: false })
  13  407        2        1 TofinoResequence
  14  331        2        1 TofinoPowerDown
  15   62        2        1 ClearPCIePresent
  16   98        2        1 TofinoPowerUp
  17  110        2        1 TofinoVidAttempt(0x0)
  18   52        2        1 SetVddCoreVout(Volts(0.815))
  19  128        2        1 TofinoVidAck
  20  159        2        9 TofinoNotInA0
  21  172        2        1 TofinoInA0
  22  189        2        1 TofinoBar0RegisterValue(SoftwareReset, 0xf3fc0f0)
  23  219        2        1 TofinoBar0RegisterValue(ResetOptions, 0x70000a8)
  24  232        2        1 TofinoEepromIdCode(0x220134)
  25  258        2        1 TofinoBar0RegisterValue(PciePhyLaneControl0, 0xc000c)
  26  258        2        1 TofinoBar0RegisterValue(PciePhyLaneControl1, 0xc000c)
  27  284        2        1 TofinoCfgRegisterValue(KGen, 0xe20f03)
  28  303        2        1 TofinoBar0RegisterValue(SoftwareReset, 0xf3fc000)
  29   62        2        1 SetPCIePresent
  30  405        2        1 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: false })

And then we see the link come up:

29   62        2        1 SetPCIePresent
30  405        2       20 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: false })
31  405        2       24 TofinoSequencerTick(LatchOffOnFault, A0 { pcie_link: true })

Comment on lines +174 to +176
// used to track how many notification loops elapsed while in A0 without a
// PCIe link
no_pcie_count: u8,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brief comment explaining that this is limited so we won't overflow?

{
self.no_pcie_count += 1;

if self.no_pcie_count >= 30 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make 30 a const?

ringbuf_entry!(Trace::TofinoResequence);
self.tofino.power_down()?;
self.tofino.power_up()?;
self.resequenced = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we ever reset self.resequenced after we've sucessfully reset. This function is still called in the timer notification past bootup though. I get that this is supposed to be a one time bootup thing but something here seems strange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants