Reliability
Fluidstack managed kubernetes is designed for reliable training and inference at scale.
Passive Health Checks
Fluidstack pre-installs a set of background health check add-ons to detect hardware failures on worker nodes. Hardware failures are reported as node conditions, enabling training and inference workloads to respond to failures in a scheduler native manner.
Active Health Checks
Fluidstack pre-installs an active health check add-on to detect hardware failures on worker nodes. Active health checks run on idle nodes to detect thermal throttling, performance regression, and functional error.
Auto-Remediation
Fluidstack will automatically cordon, drain, power-cycle, and schedule physical maintenance to ensure a healthy node pool and maintain high schedulability. Fluidstack runs comprehensive validation tests before returning a node from maintenance.