Observability

Fluidstack managed slurm includes a comprehensive observability suite with managed Grafana dashboards.

AI Engineer

View per-job GPU and Tensorcore utilization to detect underuse.
Track GPU memory and OOM kills to tune batches.
Monitor Infiniband throughput to detect communication bottlenecks.
Correlate runtime with GPU throttling events for root cause.
Check ECC errors and RAS to ensure result integrity.
Correlate Storage, GPU and Network metrics to identify complex bottlenecks in distributed jobs.