Skip to main content

Example Configurations

Rackscope ships with four ready-to-use configurations that work immediately with the built-in simulator. Each example is self-contained under config/examples/ and activated with a single command.

Available examples

ExampleNodesRacksSitesSlurmUse case
homelab~2331Local lab, first discovery
small-cluster~600121University, SME, small HPC
hpc-cluster~1 900251Production HPC cluster
exascale~14 0002413Large-scale datacenter

Quick switch

Dev stack

make use-homelab
make use-small-cluster
make use-hpc-cluster
make use-exascale

# Or with an explicit argument
make use EXAMPLE=hpc-cluster

# Check what's currently active
make which-config

The active config is stored in .env (gitignored) and picked up automatically by Docker Compose on the next make up or make restart. Only the backend and simulator are restarted — no rebuild required.

Prod stack

make use-prod-homelab
make use-prod EXAMPLE=hpc-cluster

Default config

config/app.yaml points to hpc-cluster out of the box. Run make up and the HPC cluster example starts immediately — no extra command needed.


homelab

The minimal starting point. Good for first-time setup, local testing, or development.

my-lab
└── server-room
└── main-aisle
├── rack-01 — 8 × 1U compute servers (compute001–008)
├── rack-02 — 8 × 1U compute servers (compute009–016)
└── rack-03 — management + network rack
  • ~23 nodes — compute, mgmt, login, storage-head, switches
  • Rack type: rack-42u-air (PDU ×2, FAN module, PSU module at rear)
  • No Slurm — standalone compute, no workload manager
  • Simulator: incident_mode: light
  • Checks: node_up, temperature (°C), power (W), PDU load, fan state

small-cluster

A realistic small HPC cluster — university scale or departmental computing.

dc-main
└── machine-room (1 room)
├── aisle-compute — 6 racks air (2U quad-chassis, 4 nodes each)
├── aisle-gpu — 2 racks DCW (4U GPU chassis, 4 nodes each)
└── aisle-infra — login/visu/mgmt + storage + network
  • ~600 nodes — compute, GPU, login, visu, mgmt, storage, switches
  • Rack types: rack-42u-air (infra) + rack-42u-dcw (GPU — PMC + 2×HMC)
  • Slurm: enabled · partitions: cpu, gpu
  • Simulator: incident_mode: medium
  • Checks: + IB port state, switch port errors, PDU current

hpc-cluster

A production-grade HPC cluster with water-cooled compute aisles.

dc-paris (1 site)
├── machine-room-a (Compute Floor)
│ ├── aisle-compute-std — 10 racks DCW (1U 3-node chassis)
│ ├── aisle-compute-hm — 5 racks DCW (1U 2-node high-memory)
│ ├── aisle-gpu — 5 racks DCW (4U quad-GPU chassis)
│ └── aisle-storage — storage + network racks (air)
└── machine-room-b (Services)
└── aisle-services — login, visu, mgmt
CategoryRacksNodes
Standard compute (1U 3-node DCW)10~400
High-memory (1U 2-node DCW)5~200
GPU (4U quad DCW)5~200
Services + storage5~60
  • Rack types: DCW dominant (rack-42u-dcw with PMC + 2×HMC)
  • Slurm: enabled · partitions: cpu, gpu, hm, visu
  • Simulator: incident_mode: medium
  • Checks: + HMC temperature/flow/leak, PMC power

exascale

A large-scale datacenter across three sites — Paris, Toulouse, Berlin.

site-paris (Paris, 48°N 2°E)       site-toulouse (Toulouse, 44°N 1°E)
├── room-compute-a (4 aisles) ├── room-compute-a (3 aisles)
├── room-compute-b (2 aisles) ├── room-gpu-hm (2 aisles)
├── room-gpu-hm (2 aisles) └── room-infra (services+network)
└── room-infra (services+network)
site-berlin (Berlin, 52°N 13°E)
├── room-compute-a (2 aisles)
└── room-infra (services)
CategoryTotal nodes
Compute (1U 3-node DCW)~4 000
GPU (4U quad DCW)~800
High-memory (1U 2-node DCW)~520
Services (login, visu, mgmt)~150
Total~14 000
  • World Map: lat/lon on all 3 sites
  • Slurm: enabled · partitions: cpu, gpu, hm, visu, bigmem, transfer
  • Simulator: incident_mode: heavy
  • Note: the planner needs ~30s after startup to process all nodes
Performance

With 14 000 nodes, the exascale example requires max_ids_per_query: 300 (default) and a few extra seconds for the initial planner snapshot. All racks will be green once the first full cycle completes.


Rack types

Both rack templates include IB + ETH switches in each rack slot.

TemplateCoolingRear components
rack-42u-airAirPDU ×2 (sides) + FAN module + PSU module
rack-42u-dcwDirect waterPDU ×2 (sides) + PMC power module + HMC module ×2

Config structure

Each example is self-contained:

config/examples/{name}/
├── topology/ ← sites, rooms, aisles, racks, devices
├── templates/ ← device + rack + rack_component templates
├── checks/library/ ← PromQL health checks
├── metrics/library/ ← metric definitions (temp, power, PDU…)
└── plugins/
├── simulator/config/plugin.yaml ← incident mode, catalogs
└── slurm/config.yml ← Slurm labels, status map

The corresponding config/app.example.{name}.yaml points all paths to config/examples/{name}/.


Non-regression tests

A complete test suite validates all examples:

# Run full suite (~25 min)
python3 scripts/validate_examples.py all

# Run single example
python3 scripts/validate_examples.py hpc-cluster

Each example runs 2 loops:

  1. Normal mode — baseline validation (topology, metrics, rack states)
  2. Incident mode — injects 10 CRIT + 10 WARN + 1 rack CRIT to validate the check engine

See config/examples/TEST_PLAN.md for the full test specification.