Observability
Fluidstack managed slurm includes a comprehensive observability suite with managed Grafana dashboards.
AI Engineer
- View per-job GPU and Tensorcore utilization to detect underuse.
- Track GPU memory and OOM kills to tune batches.
- Monitor Infiniband throughput to detect communication bottlenecks.
- Correlate runtime with GPU throttling events for root cause.
- Check ECC errors and RAS to ensure result integrity.
- Correlate Storage, GPU and Network metrics to identify complex bottlenecks in distributed jobs.
Cluster Admin
- Monitor cluster-wide GPU and CPU utilization for efficiency.
- Track Slurm scheduling latency and backfill effectiveness.
- Check node availability, drain reasons, and preemption rates.
- Monitor IB link utilization and port error counters.
- Correlate job failures with hardware and thermal alerts.