Skip to main content

Prometheus Integration

Rackscope queries Prometheus for all metrics — it does not collect metrics itself.

Connection Configuration

# config/app.yaml
telemetry:
prometheus_url: http://prometheus:9090
identity_label: instance # label that maps to topology instance names

Authentication

telemetry:
prometheus_url: https://prometheus.example.com
auth:
type: bearer
token: your-prometheus-token
# OR basic auth:
auth:
type: basic
username: admin
password: secret
tls:
verify: true # set false to skip cert verification (not recommended)

Health Checks (PromQL)

Checks are defined in config/checks/library/*.yaml:

checks:
- id: node_up
name: "Node Up"
kind: server
scope: node
expr: "up{job=\"node_exporter\", instance=~\"$instances\"}"
output: bool
rules:
- op: "=="
value: 1
severity: OK
- op: "=="
value: 0
severity: CRIT

PromQL Placeholders

PlaceholderReplaced with
$instancesAll node instance IDs (regex joined with |)
$chassisAll chassis IDs
$racksAll rack IDs
$jobsAll Slurm job IDs

The TelemetryPlanner replaces these placeholders with batched queries to avoid per-device query explosion.

Check Scopes

ScopeDescription
nodeEvaluated per Prometheus instance
chassisEvaluated per device chassis
rackEvaluated per rack

Metrics Library

Metrics define how to query and display specific data points:

# config/metrics/library/node_temperature.yaml
id: node_temperature
name: Node Temperature
description: "CPU/IPMI temperature sensor"
expr: "node_hwmon_temp_celsius{instance=\"{instance}\"}"
labels:
instance: "{instance}"
display:
unit: "°C"
chart_type: line
color: "#ef4444"
time_ranges: [1h, 6h, 24h, 7d]
default_range: 24h
aggregation: avg
thresholds:
warn: 70
crit: 85
category: temperature
tags: [compute, hardware]

Query Optimization

Rackscope uses a TelemetryPlanner to avoid per-device query explosion:

  • Groups all instances into batch queries (e.g., up{instance=~"node001|node002|..."})
  • Respects planner.max_ids_per_query limit (default: 100)
  • Caches results for planner.cache_ttl_seconds (default: 30s)

Never write per-node queries — use the planner's batch mechanism via check definitions.