Reliability
Fluidstack managed slurm is designed for reliable training at scale.
Passive Health Checks
Fluidstack pre-installs a set of background health check add-ons to detect hardware failures on worker nodes. Hardware failures are reported in the drain reason field.
Active Health Checks
Fluidstack pre-installs an active health check add-on to detect hardware failures on worker nodes. Active health checks run on idle nodes to detect thermal throttling, performance regression, and functional error.
Auto-Remediation
Fluidstack will automatically drain, power-cycle, and schedule physical maintenance to ensure a healthy node pool and maintain high schedulability. Fluidstack runs comprehensive validation tests before returning a node from maintenance.