Skip to main content

Observability

Fluidstack managed slurm includes a comprehensive observability suite with managed Grafana dashboards.

AI Engineer

  • View per-pod GPU utilization to detect underuse.
  • Track GPU memory and OOMKilled events to tune batches.
  • Monitor pod-to-pod NCCL throughput to spot IB issues.
  • Correlate runtime with GPU throttling and power capping.
  • Check ECC errors and node RAS for correctness.

Cluster Admin

  • Monitor cluster-wide GPU and CPU utilization by namespace.
  • Track scheduler latency and pending pod age.
  • Check node readiness, taints, and pod restart rates.
  • Monitor IB fabric utilization and port error counters.
  • Correlate pod failures with hardware, cgroup, and eviction events.