Skip to main content

Lighthouse

Lighthouse is the monitoring, observability, and reliability suite for your Fluidstack infrastructure.

Managed Metrics & Dashboards

Fluidstack infrastructure includes comprehensive monitoring and observability tooling with managed metrics and Grafana dashboards.

Passive Health Checks

Fluidstack pre-installs a set of background health checks that leverage NVIDIA's DCGM background health checks to automatically monitor your GPUs and related hardware for a wide range of issues on worker nodes. These passive health checks help us detect early signs of hardware problems and ensure your infrastructure remains reliable. Any health check failures are reported as node conditions for Kubernetes nodes and as node reasons for Slurm nodes.

We monitor for the following types of errors:

  • PCIe errors: Excessive PCIe replays, which may indicate communication issues between the GPU and the system.
  • Memory errors:
    • Volatile or non-recoverable memory errors (DBEs) since the last GPU reset.
    • Pending or excessive retired memory pages, which can signal failing memory modules.
    • Row remapping failures and uncontained errors, which may impact GPU stability.
  • Inforom corruption: Detection of corrupted GPU firmware information.
  • Thermal issues:
    • Thermal violations, such as GPUs exceeding safe temperature thresholds.
    • CPUs or GPUs operating above warning or critical temperature levels.
  • Power issues:
    • Power violations, such as GPUs exceeding their power limits.
    • CPUs drawing more power than their specified limits.
  • NVLink errors:
    • Critical NVLink replay or recovery errors.
    • High rates of NVLink CRC errors, which can affect GPU-to-GPU communication.
  • NVSwitch errors:
    • Fatal or non-fatal errors reported by NVSwitch components.
    • Detection of NVLinks reported as down.

If any of these issues are detected, they are reported as node conditions to help you quickly identify and address hardware problems.

Active Health Checks

info

Active health checks are currently a preview feature. Please contact us if you're interested in trying out this functionality.

Fluidstack runs on-demand active health checks leveraging NVIDIA's DCGM active health checks to ensure GPUs are actually ready to handle workloads. When GPUs are idle (no active workloads), we continuously run DCGM diagnostics to validate readiness:

dcgm diag -r 2
  • Run level: Medium (-r 2)
  • Typical duration: ~2 minutes per run
  • Tests included:
    • Deployment and software: NVIDIA library access and versioning; 3rd-party software conflicts; integration issues
    • Interconnect and topology: Correctable/uncorrectable PCIe/NVLink issues; topological limitations; OS-level device restrictions and cgroups; basic power and thermal constraint checks
    • Hardware diagnostics: GPU hardware and SRAMs; computational robustness; memory; PCIe/NVLink

These diagnostics are preemptible: if a user workload starts, the health check is interrupted so it never blocks machine usage.

Any health check failures are reported as node conditions for Kubernetes nodes and as node reasons for Slurm nodes.

Auto-Remediation

info

Auto-Remediation is currently a preview feature. Please contact us if you're interested in trying out this functionality.

Fluidstack will automatically drain, power-cycle, and schedule physical maintenance for nodes to ensure a healthy node pool and maintain high schedulability. Fluidstack runs comprehensive validation tests before returning a node from maintenance.