Skip to main content

Observability

Fluidstack managed slurm includes a comprehensive observability suite with managed Grafana dashboards.

AI Engineer

View per-pod GPU utilization to detect underuse.
Track GPU memory and OOMKilled events to tune batches.
Monitor pod-to-pod NCCL throughput to spot IB issues.
Correlate runtime with GPU throttling and power capping.
Check ECC errors and node RAS for correctness.

Cluster Admin

Monitor cluster-wide GPU and CPU utilization by namespace.
Track scheduler latency and pending pod age.
Check node readiness, taints, and pod restart rates.
Monitor IB fabric utilization and port error counters.
Correlate pod failures with hardware, cgroup, and eviction events.

AI Engineer
Cluster Admin