Skip to main content

Prometheus Integration

Rackscope queries Prometheus for all metrics — it does not collect metrics itself.

Connection Configuration

# config/app.yaml
telemetry:
prometheus_url: http://prometheus:9090
identity_label: instance # label that maps to topology instance names

Authentication

telemetry:
prometheus_url: https://prometheus.example.com
basic_auth_user: admin # optional
basic_auth_password: secret # optional
tls_verify: false # only if using self-signed certificates

Health Checks (PromQL)

Checks are defined in config/checks/library/*.yaml:

checks:
- id: node_up
name: "Node Up"
kind: server
scope: node
expr: "up{job=\"node_exporter\", instance=~\"$instances\"}"
output: bool
rules:
- op: "=="
value: 1
severity: OK
- op: "=="
value: 0
severity: CRIT

PromQL Placeholders

PlaceholderReplaced with
$instancesAll node instance IDs (regex joined with |)
$chassisAll chassis IDs
$racksAll rack IDs
$jobsAll Slurm job IDs

The TelemetryPlanner replaces these placeholders with batched queries to avoid per-device query explosion.

Check Scopes

ScopeDescription
nodeEvaluated per Prometheus instance
chassisEvaluated per device chassis
rackEvaluated per rack

Metrics Library

Metrics define how to query and display specific data points:

# config/metrics/library/node_temperature.yaml
id: node_temperature
name: Node Temperature
description: "CPU/IPMI temperature sensor"
expr: "node_hwmon_temp_celsius{instance=\"{instance}\"}"
labels:
instance: "{instance}"
display:
unit: "°C"
chart_type: line
color: "#ef4444"
time_ranges: [1h, 6h, 24h, 7d]
default_range: 24h
aggregation: avg
thresholds:
warn: 70
crit: 85
category: temperature
tags: [compute, hardware]

Query Optimization

Rackscope uses a TelemetryPlanner to avoid per-device query explosion:

  • Groups all instances into batch queries (e.g., up{instance=~"node001|node002|..."})
  • Respects planner.max_ids_per_query limit (default: 200)
  • Caches results for planner.cache_ttl_seconds (default: 30s)

Never write per-node queries — use the planner's batch mechanism via check definitions.

Query Caching Strategy

Prometheus queries are cached at three independent levels:

LayerConfig keyDefaultWhat it caches
ServiceCachecache.service_ttl_seconds5sFully assembled JSON responses (room state, rack state)
Planner snapshotplanner.cache_ttl_seconds60sAll node/rack health states in one snapshot
Health query cachecache.health_checks_ttl_seconds30sRaw PromQL results for health checks
Metrics query cachecache.metrics_ttl_seconds120sRaw PromQL results for detailed metrics

On a cache hit, zero Prometheus queries are fired. A request only reaches Prometheus when all upstream caches are expired simultaneously.

To monitor cache effectiveness:

curl http://localhost:8000/api/stats/telemetry
# Returns: cache_hits, cache_misses, in_flight, query_count, avg_latency_ms

A healthy deployment shows cache_hits / (cache_hits + cache_misses) > 95% under normal load.

See Performance & Caching for the full architecture diagram.