Skip to main content

Metrics Simulator

The Rackscope Simulator generates realistic Prometheus metrics for development, testing, and presentations without any real hardware. Prometheus scrapes it on its normal schedule; the backend queries Prometheus exactly as in production — the simulator is completely transparent to the rest of the stack.

Simulator Settings


Architecture: Two Independent Systems

The simulator involves two completely separate concepts that are often confused. Understanding the distinction is critical before making any configuration change.

Simulator Catalog — What Gets Generated

File: config/plugins/simulator/metrics/metrics_full.yaml

The metrics catalog defines which Prometheus metrics the simulator generates and exposes on :9000. It controls metric names, label sets, instance filtering, and how the simulator engine populates values (from behavioral profiles, incidents, and overrides).

The catalog is read by plugins/simulator/process/main.py on every tick. Changes to it do not affect what the UI displays — only what Prometheus scrapes.

The default catalog (metrics_full.yaml) produces 43 distinct metric families grouped by exporter:

GroupMetric prefixNotes
Simulator syntheticup, node_temperature_celsius, node_power_watts, node_load_percent, node_health_statusDerived from behavioral profiles
Node Exporter — CPUnode_cpu_seconds_totalMonotonic idle counter
Node Exporter — Memorynode_memory_MemTotal_bytes, node_memory_MemAvailable_bytes
Node Exporter — Filesystemnode_filesystem_size_bytes, node_filesystem_avail_bytesmountpoint=/
Node Exporter — Disk I/Onode_disk_read_bytes_total, node_disk_written_bytes_totaldevice=sda
Node Exporter — Networknode_network_receive_bytes_total, node_network_transmit_bytes_totaldevice=eth0
IPMIipmi_fan_speed_state, ipmi_power_state, ipmi_sensor_state, ipmi_temperature_state, ipmi_voltage_state, ipmi_upcompute/visu/srv/service only
Switchesswitch_port_oper_status, switch_port_in_octets, switch_port_out_octets*isw*, *esw*, sw-*
Slurmslurm_node_statuscompute/visu, labels_only
E-Series storageeseries_* (8 metrics)da*-r02-* pattern
Sequana3 HYCsequana3_hyc_* (4 metrics)r01-* racks
Sequana3 PMCsequana3_pmc_total_wattr01-* racks
Raritan PDUraritan_pdu_* (6 metrics)all racks

Display Library — How Metrics Are Shown

Directory: config/metrics/library/

The display library defines how the UI renders metrics in charts: units, axis labels, colors, thresholds, and chart types. It does not affect which metrics the simulator generates.

The default library ships with 7 display metrics: node_temperature, node_power, pdu_active_power, pdu_apparent_power, pdu_current, pdu_voltage, and pdu_energy.

Adding a metric to the display library does not make the simulator generate it. Adding a metric to the simulator catalog does not make the UI show a chart for it. Both must be configured independently.


Folder Structure

config/plugins/simulator/
├── config/
│ └── plugin.yaml ← single config file (hot-reloaded every tick)
├── metrics/
│ ├── metrics_full.yaml ← primary catalog (all 43 metric families)
│ ├── metrics_slurm.yaml ← Slurm-only catalog (labels_only format)
│ └── metrics_examples.yaml ← reference examples for custom catalogs
└── overrides/
└── overrides.yaml ← runtime metric overrides (TTL-aware)

How It Works: The Tick Loop

The simulator runs a continuous loop. On every tick:

Every N seconds (update_interval_seconds, default: 20):

1. Reload plugin.yaml
Hot-reload: changes take effect on the next tick without restarting.

2. Load topology
Discovers all instances (compute nodes, switches, storage arrays, PDUs)
from the YAML topology files mounted into the simulator container.

3. Load overrides (with TTL expiration)
Reads config/plugins/simulator/overrides/overrides.yaml.
Overrides with expires_at in the past are silently skipped.

4. Roll incidents (on cycle boundary)
Every `3600 / changes_per_hour / update_interval` ticks, a new random set
of victims is drawn from the topology according to the active incident_mode:
- nodes_crit → up=0, Slurm=down
- nodes_warn → health_status=1, Slurm=drain
- racks_crit → all nodes in rack go down
- aisles_hot → +12 °C temperature boost on all nodes in aisle

5. Generate metrics
For each discovered instance, compute gauge values based on the
active behavioral profile, any active incidents, and overrides.

6. Expose on :9000
All gauges are registered with the prometheus_client library and
served at http://simulator:9000/metrics in Prometheus text format.

7. Prometheus scrapes :9000
The Prometheus container scrapes the simulator at its configured
interval. The backend then queries Prometheus with PromQL exactly
as it would against production exporters.

Enabling Demo Mode

Enabling the simulator requires a single setting in config/app.yaml. The features.demo flag no longer exists — the simulator plugin is fully self-contained.

config/app.yaml:

plugins:
simulator:
enabled: true

When the simulator is active, a small DEMO ribbon appears in the top-left corner of the UI as a visual indicator.

config/plugins/simulator/config/plugin.yaml (authoritative config, hot-reloaded every tick):

update_interval_seconds: 20
seed: null
incident_mode: light # full_ok | light | medium | heavy | chaos | custom
changes_per_hour: 2

Alternatively, navigate to Settings in the UI, find the Plugins — Simulator section, and use the Enable toggle and incident mode selector. Settings are written back to plugin.yaml and app.yaml immediately.

Hot-reload

plugin.yaml is re-read on every simulator tick. You can edit it while the stack is running and the change will take effect within one tick (default: 20 seconds). No container restart is required.


Incident Modes

The incident_mode field selects a named failure preset. On each reshuffle cycle, the simulator draws a new random set of victims within the configured bounds. The reshuffle frequency is controlled by changes_per_hour.

Available Modes

ModeNodes CRITNodes WARNRacks CRITAisles hotBest for
full_ok0000Baseline demos, UI walkthroughs, testing normal state
light1–31–500Day-to-day development, standard demos
medium1–35–1010Testing health aggregation, rack-level propagation
heavy5–1010–2021Stress-testing the dashboard, aisle cooling scenarios
chaos15 %25 %20 %25 %Maximum churn, NOC operator training
customconfigurableconfigurableconfigurableconfigurableReproducible exact-count scenarios

Changing the Incident Mode

Via the Settings UI (recommended):

Navigate to Settings → Plugins → Simulator → Advanced → Incident Mode and select from the dropdown. The change is hot-reloaded within one tick.

Via plugin.yaml (hot-reloaded every tick):

# config/plugins/simulator/config/plugin.yaml
incident_mode: heavy
changes_per_hour: 4

Custom Mode

When incident_mode: custom, the simulator uses exact counts from custom_incidents:

incident_mode: custom
changes_per_hour: 2
custom_incidents:
devices_crit: 5
devices_warn: 10
racks_crit: 1
aisles_hot: 0

Behavioral Profiles

Behavioral profiles define the baseline characteristics of different node types. The simulator assigns each instance to a profile based on its name and then generates metric values by sampling from the profile's ranges.

Profile Definitions (from plugin.yaml)

Profilebase_temptemp_rangebase_powerpower_varload_minload_max
compute24 °C±8 °C200 W±200 W40 %80 %
gpu28 °C±25 °C250 W±500 W5 %100 %
service21 °C±3 °C100 W±20 W2 %8 %
network32 °C±4 °C120 W±10 W15 %15 %

Profile assignment rules (evaluated in order, first match wins):

  • Instance name matches gpu*gpu
  • Instance name matches login*, mngt*, srv*, service*service
  • Instance name matches sw-*, *isw*, *esw*network
  • Everything else → compute

Profile values can be customized in plugin.yaml under the profiles key. These are process-only fields (not validated by the backend Pydantic model).


Incident Effects

Regardless of the active mode, all four incident types have the same observable effects:

Critical nodes (nodes_crit)

Nodes drawn into the CRIT set:

  • up set to 0
  • node_health_status set to 2 (CRIT)
  • Power drops to 50 W (standby/BMC-only draw)
  • CPU load drops to 0
  • Slurm status: down

Warning nodes (nodes_warn)

Nodes drawn into the WARN set:

  • up remains 1
  • node_health_status set to 1 (WARN)
  • Temperature, power and load unaffected
  • Slurm status: drain

Critical racks (racks_crit)

All nodes in a CRIT rack receive the same treatment as nodes_crit above. Additionally, the rack cooling metrics (Sequana3) show:

  • Inlet pressure drops to ~160 kPa
  • Leak sensor reads ~1.2
  • Board temperature rises to ~95 °C
  • PMC power output drops to 0 W

Hot aisles (aisles_hot)

All nodes in a hot aisle receive a +12 °C temperature boost:

  • Temperatures exceed WARN threshold (typically 38 °C) on most nodes
  • High-draw nodes (GPU) may exceed CRIT threshold (45 °C)
  • up is NOT affected — nodes remain reachable
  • Aisle cooling (Sequana3) shows reduced pressure and elevated leak sensor

Metrics Catalog

Catalog Files

metrics_full.yaml — Primary catalog for the default demo topology. Covers all 43 metric families: node exporter, IPMI, switches, Slurm, E-Series, Sequana3, and Raritan PDU. This is the file used by default.

metrics_slurm.yaml — Standalone Slurm catalog. Use this when you want Slurm metrics (slurm_node_status) without the full E-Series and Sequana3 families, for example on a simple air-cooled topology with no storage arrays.

metrics_examples.yaml — Not loaded in production. Contains annotated examples of every catalog feature: static labels, $placeholder variables, labels_only, rack-scoped metrics, and explicit instance lists vs wildcard patterns. Start here when writing a custom catalog.

Catalog Entry Schema

metrics:
- name: metric_name # Prometheus metric name (required)
scope: node | rack # Evaluation scope (required)
instances: ["pattern"] # fnmatch patterns — node-scoped only
racks: ["pattern"] # fnmatch patterns — rack-scoped only
labels: # Extra label key=value pairs (optional)
key: $placeholder # $var resolved at generation time
key: literal_value # static label value
labels_only: true # Omit BASE_LABELS (see below)

Instance and Rack Patterns

Patterns follow fnmatch syntax:

PatternMatches
"*"All instances
"compute*"Instances starting with compute
"*isw*"Instances containing isw (InfiniBand switches)
"da*-r02-*"E-Series arrays in rack group r02
"compute[001-010]"Expanded range: compute001 through compute010

BASE_LABELS

Unless labels_only: true is set, the simulator automatically adds the following topology context labels to every node-scoped metric:

site_id, room_id, rack_id, chassis_id, node_id, instance, job

These labels allow Rackscope's health checks to filter metrics by topology position (e.g. up{rack_id="r01-01"}).

labels_only Mode

Use labels_only: true when the target metric must match a real exporter format that does NOT include topology labels. The primary example is slurm_node_status from slurm_exporter:

- name: slurm_node_status
scope: node
instances: ["compute*", "visu*"]
labels_only: true # Only emit the labels defined below
labels:
node: $node_id
partition: $partition
status: $status

Real output matches the exporter exactly:

slurm_node_status{node="compute001",partition="compute",status="idle"} 1

Available Placeholders

Placeholders in label values are resolved at metric generation time:

PlaceholderResolved to
$node_idInstance name (e.g. compute001)
$rack_idRack identifier (e.g. r01-01)
$statusInferred state (optimal/degraded/failed for storage; idle/allocated/drain for Slurm)
$portPort number (for switch port metrics)
$cpuCPU index
$pduidPDU unit identifier
$pdunamePDU unit name
$inletidPDU inlet identifier (e.g. I1, I2)
$drive_idE-Series drive identifier
$slotSlot number
$trayTray number
$partitionSlurm partition name
$stateSequana3 HYC state string

Runtime Overrides

Overrides allow you to force specific metric values for any instance or rack at runtime — without touching any configuration file. This is the primary mechanism for interactive demos and failure injection during presentations.

Overrides are persisted to config/plugins/simulator/overrides/overrides.yaml and survive backend restarts. The simulator reads this file on every tick and applies overrides before generating the final metric values.

TTL Rules

  • ttl_seconds: 0 — Permanent. The override stays until manually deleted via the API or the Settings UI.
  • ttl_seconds: 300 — Expires after 5 minutes. The simulator skips expired overrides without deleting the file.
  • No ttl_seconds in the request — the backend uses default_ttl_seconds from plugin.yaml (default: 120 seconds).

Common Override Examples

Force a node down permanently:

curl -X POST http://localhost:8000/api/simulator/overrides \
-H "Content-Type: application/json" \
-d '{"instance": "compute001", "metric": "up", "value": 0, "ttl_seconds": 0}'

Spike temperature for 5 minutes:

curl -X POST http://localhost:8000/api/simulator/overrides \
-H "Content-Type: application/json" \
-d '{"instance": "compute001", "metric": "node_temperature_celsius", "value": 92, "ttl_seconds": 300}'

Set a node into WARN health state:

curl -X POST http://localhost:8000/api/simulator/overrides \
-H "Content-Type: application/json" \
-d '{"instance": "compute001", "metric": "node_health_status", "value": 1, "ttl_seconds": 0}'

Take down an entire rack for 2 minutes:

curl -X POST http://localhost:8000/api/simulator/overrides \
-H "Content-Type: application/json" \
-d '{"rack_id": "r02-01", "metric": "rack_down", "value": 1, "ttl_seconds": 120}'

List all active overrides:

curl http://localhost:8000/api/simulator/overrides

Delete a specific override (use the id field from the list response):

curl -X DELETE http://localhost:8000/api/simulator/overrides/compute001-up-1770032278

Clear all overrides:

curl -X DELETE http://localhost:8000/api/simulator/overrides

Overridable Metrics

The following metrics can be set via the override API:

MetricValid valuesNotes
up0 or 1Node reachability
rack_down0 or 1Requires rack_id, not instance
node_temperature_celsiusAny floatDegrees Celsius
node_power_wattsAny floatWatts
node_load_percentAny float0–100
node_health_status0, 1, or 20=OK, 1=WARN, 2=CRIT

Any metric defined in the display library (config/metrics/library/) is also accepted.


Slurm Integration

When the Slurm plugin is enabled alongside the Simulator, the simulator injects random Slurm node status transitions every tick to populate the Slurm wallboard and dashboards with realistic data.

The slurm_random_statuses block in plugin.yaml controls how many nodes are randomly placed into each non-idle state per tick:

# config/plugins/simulator/config/plugin.yaml
slurm_random_statuses:
drain: 1 # 1 random node set to drain each tick
down: 1 # 1 random node set to down each tick
maint: 1 # 1 random node set to maint each tick

slurm_random_match:
- compute* # Only apply to nodes matching these patterns
- visu*

These Slurm statuses are generated via slurm_node_status metrics with labels_only: true. The Rackscope backend queries them with the standard PromQL: slurm_node_status{node=~"..."} — exactly the same query used against a real slurm_exporter.


Custom Metrics Catalog

When to Use a Custom Catalog

  • Your topology has hardware not covered by the default catalog (e.g. a different storage vendor, custom health exporter)
  • You want to reduce the number of generated metrics for a lighter demo
  • You have a standalone topology with no E-Series or Sequana3 hardware

Adding a Custom Catalog

  1. Copy config/plugins/simulator/metrics/metrics_examples.yaml to a new file (e.g. config/plugins/simulator/metrics/my_metrics.yaml).

  2. Define your metrics using the schema described in the Metrics Catalog section above. Use metrics_examples.yaml as a reference for all supported patterns.

  3. Add the new catalog to plugin.yaml:

# config/plugins/simulator/config/plugin.yaml
metrics_catalog_path: config/plugins/simulator/metrics/metrics_full.yaml

metrics_catalogs:
- id: my_hardware
path: config/plugins/simulator/metrics/my_metrics.yaml
enabled: true

Additional catalogs are merged on top of the primary. When two catalogs define an entry with the same name, the later entry (higher in the list) takes precedence.

  1. The change takes effect on the next simulator tick — no restart needed.

Replacing the Primary Catalog

For a minimal deployment (e.g. Slurm-only demo without any storage metrics):

# config/plugins/simulator/config/plugin.yaml
metrics_catalog_path: config/plugins/simulator/metrics/metrics_slurm.yaml
metrics_catalogs: []

Discovering Available Metrics

To list all metrics that the display library can render (used by the Settings UI override form):

curl http://localhost:8000/api/simulator/metrics

To see raw Prometheus output from the simulator directly:

curl http://localhost:9000/metrics

To check whether the simulator is running and what scenario is active:

curl http://localhost:8000/api/simulator/status