Skip to content

[WIP] Add base feature to enable 24/7 recovery measures#313

Closed
knuton wants to merge 1 commit intodividat:mainfrom
knuton:unsupervised-recovery
Closed

[WIP] Add base feature to enable 24/7 recovery measures#313
knuton wants to merge 1 commit intodividat:mainfrom
knuton:unsupervised-recovery

Conversation

@knuton
Copy link
Copy Markdown
Member

@knuton knuton commented Jan 29, 2026

Checklist

  • Changelog updated
  • Code documented
  • User manual updated

@knuton knuton force-pushed the unsupervised-recovery branch from e1877a4 to 561f190 Compare January 29, 2026 17:34
@knuton knuton force-pushed the unsupervised-recovery branch from 561f190 to 7b456cb Compare January 29, 2026 20:14
Copy link
Copy Markdown
Collaborator

@yfyf yfyf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked through the options listed here and also consulted with the LLMs about what additional options could be considered, did not find anything extra that is useful.

My general thoughts:

  • It is probably not wise to diverge from mainstream (kernel and distro) defaults unless we have a very good reason for it. Customization leads to tricky-to-debug low-level interactions with the hardware.
  • In our setup, false positives are worse than false negatives. If a PlayOS PC freezes and we do not automatically panic/reboot, that sucks, but it's not a deal breaker - users will observe the state and reboot. On the other hand, if we unintentionally trigger a panic and restart during usage, that is very frustrating and users have no control over it.
  • We will probably not get reports about the false positives, since it will seem like a generic "glitch" and in unsupervised settings it can take a long time until we find out we have false positives. We are more likely to get reports on the false negatives (unhandled freezing), since they do require intervention.

Comment thread base/unsupervised.nix
"kernel.hardlockup_panic" = 1;

# panic if a task is in TASK_UNINTERRUPTIBLE state (waiting for I/O) for more than 5 mins
"kernel.hung_task_panic" = 1;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning behind this? This will cause a panic if some (any!) task hangs on I/O. While we do not expect this, this condition also seems very wide, it could mean panics due things we do now even know exist. This does not seem to be enabled on any distributions.

I would say the risk is too big and getting feedback about false-positives will be very hard.

Comment thread base/unsupervised.nix

# EXPLICITLY DISABLED
# An oops in a driver may leave the system operational, we avoid panicking to allow
# enough time for systems with minor hardware/driver compatibility issues to update.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "to update" you mean RAUC update? I think not just because of updates, in general we would rather continue than force a reboot as long as the system is functional. If there are lock-ups or unrecoverable errors, that is probably best handled by other conditions. All distros seem to disable this by default.

Off-topic thought: might be a good idea to check if updates fail repeatedly on system.A and explicitly switch primary slot to attempt to update via system.B?

Comment thread base/unsupervised.nix
config = lib.mkIf cfg.enable {
boot.kernel.sysctl = {
# reboot only after 60 s to possibly allow onsite personal to catch screenshot
"kernel.panic" = 30;
Copy link
Copy Markdown
Collaborator

@yfyf yfyf Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment does not match value?

Comment thread base/unsupervised.nix
# EXPLICITLY DISABLED
# An oops in a driver may leave the system operational, we avoid panicking to allow
# enough time for systems with minor hardware/driver compatibility issues to update.
"kernel.panic_on_oops" = 0;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also oops_limit, we could consider lowering the default (10000) to something, which would be a "middle ground" between panic_on_oops = 1 (which is equivalent to oops_limit = 1) and 10000, if desired.

@knuton
Copy link
Copy Markdown
Member Author

knuton commented Feb 20, 2026

  • If a PlayOS PC freezes and we do not automatically panic/reboot, that sucks, but it's not a deal breaker - users will observe the state and reboot

This is not necessarily true. Some users may reboot, but in some deployment scenarios that might be after 3 days and dozens of potential users who did not dare/know how to reboot. That scenario specifically makes it a bit of a deal breaker and is the motivating factor for looking into these options.

  • We will probably not get reports about the false positives, since it will seem like a generic "glitch" and in unsupervised settings it can take a long time until we find out we have false positives.

Yes, this is one of my chief doubts.

@yfyf
Copy link
Copy Markdown
Collaborator

yfyf commented Feb 23, 2026

Some users may reboot, but in some deployment scenarios that might be after 3 days and dozens of potential users who did not dare/know how to reboot. That scenario specifically makes it a bit of a deal breaker and is the motivating factor for looking into these options.

I understand that in certain cases it can mean prolonged periods of confusion, but I don't see it as sufficient reason for aggressively pushing the system into reboots. You cannot eliminate all (pseudo-)freeze cases with the means in this PR, e.g. if any application-level (i.e. userland) issues happen, the kiosk will remain borked until rebooted. You would need some crazy end-to-end watchdog that verifies application layer is functional, which is probably unrealistic.

@knuton
Copy link
Copy Markdown
Member Author

knuton commented Mar 2, 2026

Closing as currently of unclear value.

@knuton knuton closed this Mar 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants