Example Configurations

Rackscope ships with four ready-to-use configurations that work immediately with the built-in simulator. Each example is self-contained under config/examples/ and activated with a single command.

Available examples

Example	Nodes	Racks	Sites	Slurm	Use case
`homelab`	~23	3	1	✗	Local lab, first discovery
`small-cluster`	~600	12	1	✅	University, SME, small HPC
`hpc-cluster`	~1 900	25	1	✅	Production HPC cluster
`exascale`	~14 000	241	3	✅	Large-scale datacenter

Quick switch

Dev stack

make use-homelab
make use-small-cluster
make use-hpc-cluster
make use-exascale

# Or with an explicit argument
make use CONFIG=hpc-cluster

# Check what's currently active
make which-config

The active config is stored in .env (gitignored) and picked up automatically by Docker Compose on the next make up or make restart. Only the backend and simulator are restarted — no rebuild required.

Prod stack

make use-prod-homelab
make use-prod CONFIG=hpc-cluster

Default config

config/app.yaml points to hpc-cluster out of the box. Run make up and the HPC cluster example starts immediately — no extra command needed.

`homelab`

The minimal starting point. Good for first-time setup, local testing, or development.

my-lab
└── server-room
    └── main-aisle
        ├── rack-01  — 8 × 1U compute servers (compute001–008)
        ├── rack-02  — 8 × 1U compute servers (compute009–016)
        └── rack-03  — management + network rack

~23 nodes — compute, mgmt, login, storage-head, switches
Rack type: rack-42u-air (PDU ×2, FAN module, PSU module at rear)
No Slurm — standalone compute, no workload manager
Simulator: incident_mode: light
Checks: node_up, temperature (°C), power (W), PDU load, fan state

`small-cluster`

A realistic small HPC cluster — university scale or departmental computing.

dc-main
└── machine-room (1 room)
    ├── aisle-compute — 6 racks air (2U quad-chassis, 4 nodes each)
    ├── aisle-gpu     — 2 racks DCW (4U GPU chassis, 4 nodes each)
    └── aisle-infra   — login/visu/mgmt + storage + network

~600 nodes — compute, GPU, login, visu, mgmt, storage, switches
Rack types: rack-42u-air (infra) + rack-42u-dcw (GPU — PMC + 2×HMC)
Slurm: enabled · partitions: cpu, gpu
Simulator: incident_mode: medium
Checks: + IB port state, switch port errors, PDU current

`hpc-cluster`

A production-grade HPC cluster with water-cooled compute aisles.

dc-paris (1 site)
├── machine-room-a (Compute Floor)
│   ├── aisle-compute-std  — 10 racks DCW (1U 3-node chassis)
│   ├── aisle-compute-hm   — 5 racks DCW (1U 2-node high-memory)
│   ├── aisle-gpu          — 5 racks DCW (4U quad-GPU chassis)
│   └── aisle-storage      — storage + network racks (air)
└── machine-room-b (Services)
    └── aisle-services     — login, visu, mgmt

Category	Racks	Nodes
Standard compute (1U 3-node DCW)	10	~400
High-memory (1U 2-node DCW)	5	~200
GPU (4U quad DCW)	5	~200
Services + storage	5	~60

Rack types: DCW dominant (rack-42u-dcw with PMC + 2×HMC)
Slurm: enabled · partitions: cpu, gpu, hm, visu
Simulator: incident_mode: medium
Checks: + HMC temperature/flow/leak, PMC power

`exascale`

A large-scale datacenter across three sites — Paris, Toulouse, Berlin.

site-paris (Paris, 48°N 2°E)       site-toulouse (Toulouse, 44°N 1°E)
├── room-compute-a (4 aisles)       ├── room-compute-a (3 aisles)
├── room-compute-b (2 aisles)       ├── room-gpu-hm (2 aisles)
├── room-gpu-hm (2 aisles)          └── room-infra (services+network)
└── room-infra (services+network)
                                    site-berlin (Berlin, 52°N 13°E)
                                    ├── room-compute-a (2 aisles)
                                    └── room-infra (services)

Category	Total nodes
Compute (1U 3-node DCW)	~4 000
GPU (4U quad DCW)	~800
High-memory (1U 2-node DCW)	~520
Services (login, visu, mgmt)	~150
Total	~14 000

World Map: lat/lon on all 3 sites
Slurm: enabled · partitions: cpu, gpu, hm, visu, bigmem, transfer
Simulator: incident_mode: heavy
Note: the planner needs ~30s after startup to process all nodes

Performance

With 14 000 nodes, the exascale example requires max_ids_per_query: 300 (default) and a few extra seconds for the initial planner snapshot. All racks will be green once the first full cycle completes.

Rack types

Both rack templates include IB + ETH switches in each rack slot.

Template	Cooling	Rear components
`rack-42u-air`	Air	PDU ×2 (sides) + FAN module + PSU module
`rack-42u-dcw`	Direct water	PDU ×2 (sides) + PMC power module + HMC module ×2

Config structure

Each example is self-contained:

config/examples/{name}/
├── topology/            ← sites, rooms, aisles, racks, devices
├── templates/           ← device + rack + rack_component templates
├── checks/library/      ← PromQL health checks
├── metrics/library/     ← metric definitions (temp, power, PDU…)
└── plugins/
    ├── simulator/config/plugin.yaml  ← incident mode, catalogs
    └── slurm/config.yml              ← Slurm labels, status map

The corresponding config/app.example.{name}.yaml points all paths to config/examples/{name}/.

Non-regression tests

A complete test suite validates all examples:

# Run full suite (~25 min)
python3 scripts/validate_examples.py all

# Run single example
python3 scripts/validate_examples.py hpc-cluster

Each example runs 2 loops:

Normal mode — baseline validation (topology, metrics, rack states)
Incident mode — injects 10 CRIT + 10 WARN + 1 rack CRIT to validate the check engine

See config/examples/TEST_PLAN.md for the full test specification.

Available examples​

Quick switch​

Dev stack​

Prod stack​

Default config​

homelab​

small-cluster​

hpc-cluster​

exascale​

Rack types​

Config structure​

Non-regression tests​