Skip to main content

User Guide Overview

Rackscope provides a set of physical views and management tools for monitoring data center and HPC infrastructure through Prometheus metrics.

Rackscope Overview

The drill-down approach

Rackscope is organized around a progressive drill-down model. The operator starts from a global overview — all sites, active alerts, aggregate health — and navigates progressively toward finer levels of detail: datacenter, room, aisle, rack, device, and finally the individual instance.

At each level, health states aggregate upward from child entities. A single failing node elevates its rack to CRIT, which propagates to the room level. This makes it immediately apparent where in the physical infrastructure an issue is located, without having to cross-reference multiple tools.

The sidebar provides access to all views, organized by domain:

  • Infrastructure: World map, room views, rack views, device views
  • Workload (Slurm plugin): Overview, Nodes, Partitions, Alerts, Wallboard
  • Editors: Topology, Rack, Templates, Checks, Settings
  • Simulator (demo plugin): Scenario control, overrides

Core Concepts

Health States

Every entity in the topology has a health state:

StateMeaning
OKAll checks passing
WARNAt least one warning
CRITAt least one critical issue
UNKNOWNNo data or check error

States aggregate upward: Node → Chassis → Rack → Room → Site. The worst state wins.

Physical Hierarchy

Site
└── Room
└── Aisle
└── Rack
└── Device
└── Instance (Prometheus node)

Each level shows the aggregated health of everything below it.

Views Overview

ViewURLPurpose
World Map/views/worldmapSite overview with geolocation
Room/views/room/:idFloor plan with rack grid
Rack/views/rack/:idFront/rear rack views
Device/views/device/:rackId/:deviceIdInstance-level detail
Cluster/views/clusterCluster overview