hardware-health: rasdaemon MCE attribution + watchdog auto-reboot on mediaserver
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
707f78c9d1
commit
d69c9f624f
2 changed files with 36 additions and 0 deletions
|
|
@ -44,6 +44,7 @@
|
||||||
./services/memos.nix
|
./services/memos.nix
|
||||||
# ./services/neko.nix # superseded by selkies.nix (Neko can't handle GW's mouse grab)
|
# ./services/neko.nix # superseded by selkies.nix (Neko can't handle GW's mouse grab)
|
||||||
./services/selkies.nix
|
./services/selkies.nix
|
||||||
|
./services/hardware-health.nix
|
||||||
];
|
];
|
||||||
|
|
||||||
### Make build time quicker
|
### Make build time quicker
|
||||||
|
|
|
||||||
35
services/hardware-health.nix
Normal file
35
services/hardware-health.nix
Normal file
|
|
@ -0,0 +1,35 @@
|
||||||
|
# services/hardware-health.nix — RAS error attribution + watchdog auto-recovery
|
||||||
|
#
|
||||||
|
# Context: Jun 2026 the dual Xeon E5-2697 v3 began throwing a storm of
|
||||||
|
# *corrected* Machine Check Exceptions on both sockets (Bank 5 / Bank 20),
|
||||||
|
# ~18k events in 36h, eventually hanging the box. Since this host is the
|
||||||
|
# router, a hang takes the whole LAN offline until a manual power-cycle.
|
||||||
|
#
|
||||||
|
# This module:
|
||||||
|
# - rasdaemon: decodes every MCE to a specific DIMM/channel/socket and
|
||||||
|
# persists a per-component error DB, so a failing part can be named
|
||||||
|
# (needed for the seller's warranty claim). Query with `ras-mc-ctl
|
||||||
|
# --error-count` and `ras-mc-ctl --summary`.
|
||||||
|
# - hardware watchdog: if userspace hangs again, systemd stops petting
|
||||||
|
# /dev/watchdog0 and the chipset watchdog reboots the box (~30s),
|
||||||
|
# restoring the LAN without physical access.
|
||||||
|
|
||||||
|
{ config, lib, pkgs, ... }:
|
||||||
|
{
|
||||||
|
config = lib.mkIf (config.networking.hostName == "FredOS-Mediaserver") {
|
||||||
|
|
||||||
|
# Decode + log + persist machine-check / memory errors per component.
|
||||||
|
hardware.rasdaemon.enable = true;
|
||||||
|
|
||||||
|
# ras-mc-ctl on PATH for manual inspection.
|
||||||
|
environment.systemPackages = [ pkgs.rasdaemon ];
|
||||||
|
|
||||||
|
# Hardware watchdog: auto-reboot a hung box instead of a dead LAN.
|
||||||
|
# systemd pets /dev/watchdog0 at half the runtime interval; if it stops
|
||||||
|
# (hang), the chipset resets after RuntimeWatchdogSec.
|
||||||
|
systemd.settings.Manager = {
|
||||||
|
RuntimeWatchdogSec = "30s";
|
||||||
|
RebootWatchdogSec = "10min";
|
||||||
|
};
|
||||||
|
};
|
||||||
|
}
|
||||||
Loading…
Add table
Add a link
Reference in a new issue