Skip to main content

Observability

Fluidstack managed slurm includes a comprehensive observability suite with managed Grafana dashboards.

AI Engineer

  • View per-job GPU and Tensorcore utilization to detect underuse.
  • Track GPU memory and OOM kills to tune batches.
  • Monitor Infiniband throughput to detect communication bottlenecks.
  • Correlate runtime with GPU throttling events for root cause.
  • Check ECC errors and RAS to ensure result integrity.
  • Correlate Storage, GPU and Network metrics to identify complex bottlenecks in distributed jobs.

Cluster Admin

  • Monitor cluster-wide GPU and CPU utilization for efficiency.
  • Track Slurm scheduling latency and backfill effectiveness.
  • Check node availability, drain reasons, and preemption rates.
  • Monitor IB link utilization and port error counters.
  • Correlate job failures with hardware and thermal alerts.