Skip to main content

Reliability

Fluidstack managed kubernetes is designed for reliable training and inference at scale.

Passive Health Checks

Fluidstack pre-installs a set of background health check add-ons to detect hardware failures on worker nodes. Hardware failures are reported as node conditions, enabling training and inference workloads to respond to failures in a scheduler native manner.

Active Health Checks

Fluidstack pre-installs an active health check add-on to detect hardware failures on worker nodes. Active health checks run on idle nodes to detect thermal throttling, performance regression, and functional error.

Auto-Remediation

Fluidstack will automatically cordon, drain, power-cycle, and schedule physical maintenance to ensure a healthy node pool and maintain high schedulability. Fluidstack runs comprehensive validation tests before returning a node from maintenance.