Observability
Fluidstack managed slurm includes a comprehensive observability suite with managed Grafana dashboards.
AI Engineer
- View per-pod GPU utilization to detect underuse.
- Track GPU memory and OOMKilled events to tune batches.
- Monitor pod-to-pod NCCL throughput to spot IB issues.
- Correlate runtime with GPU throttling and power capping.
- Check ECC errors and node RAS for correctness.
Cluster Admin
- Monitor cluster-wide GPU and CPU utilization by namespace.
- Track scheduler latency and pending pod age.
- Check node readiness, taints, and pod restart rates.
- Monitor IB fabric utilization and port error counters.
- Correlate pod failures with hardware, cgroup, and eviction events.