Skip to main content

Reliability

Fluidstack managed slurm is designed for reliable training at scale.

Passive Health Checks

Fluidstack pre-installs a set of background health check add-ons to detect hardware failures on worker nodes. Hardware failures are reported in the drain reason field.

Active Health Checks

Fluidstack pre-installs an active health check add-on to detect hardware failures on worker nodes. Active health checks run on idle nodes to detect thermal throttling, performance regression, and functional error.

Auto-Remediation

Fluidstack will automatically drain, power-cycle, and schedule physical maintenance to ensure a healthy node pool and maintain high schedulability. Fluidstack runs comprehensive validation tests before returning a node from maintenance.