Skip to main content

Telemetry API

Telemetry endpoints expose live health states, active alerts, and aggregated metrics derived from Prometheus queries. These endpoints are read-only — they do not modify topology or configuration.

Configuration dependency

All telemetry endpoints return UNKNOWN states (rather than errors) when the backend configuration has not finished loading. This allows the frontend to render a degraded-but-functional view during startup.


Stats

GET /api/stats/global

Returns a summary of the global infrastructure health state. Counts are derived from rack-level health aggregation via the TelemetryPlanner.

curl http://localhost:8000/api/stats/global

Response

{
"total_rooms": 4,
"total_racks": 48,
"active_alerts": 14,
"crit_count": 3,
"warn_count": 11,
"status": "CRIT"
}
FieldTypeDescription
total_roomsintegerNumber of rooms in the first site
total_racksintegerTotal rack count across all sites and rooms
active_alertsintegercrit_count + warn_count
crit_countintegerNumber of racks in CRIT state
warn_countintegerNumber of racks in WARN state
statusstringWorst state across all racks: OK, WARN, or CRIT

Error cases

  • Returns {"total_rooms": 0, "total_racks": 0, "active_alerts": 0, "crit_count": 0, "warn_count": 0, "status": "OK"} when topology is not loaded.

GET /api/stats/prometheus

Returns Prometheus client latency statistics and heartbeat timing. Useful for diagnosing connectivity issues and measuring query performance.

curl http://localhost:8000/api/stats/prometheus

Response

{
"last_ms": 42.7,
"avg_ms": 38.1,
"last_ts": 1709295600000,
"heartbeat_seconds": 60,
"next_ts": 1709295660000
}
FieldTypeDescription
last_msfloat or nullLatency of the most recent query in milliseconds
avg_msfloat or nullRolling average latency over the last 20 queries
last_tsinteger or nullUnix timestamp (ms) of the last successful query
heartbeat_secondsintegerConfigured heartbeat interval from app.yaml
next_tsinteger or nullProjected timestamp of the next scheduled heartbeat

Returns {"last_ms": null, "avg_ms": null, "last_ts": null} if no queries have been made yet.


GET /api/stats/telemetry

Returns detailed telemetry planner statistics including cache performance and in-flight query tracking. Intended for operators debugging query load and cache behavior.

curl http://localhost:8000/api/stats/telemetry

Response

{
"query_count": 1247,
"cache_hits": 1189,
"cache_misses": 58,
"in_flight": 0,
"last_batch": {
"total_ids": 128,
"query_count": 4,
"max_ids_per_query": 50,
"ts": 1709295600000
},
"last_ms": 41.3,
"avg_ms": 37.9
}
FieldTypeDescription
query_countintegerTotal Prometheus queries issued since startup
cache_hitsintegerNumber of responses served from cache
cache_missesintegerNumber of cache misses triggering a real query
in_flightintegerNumber of queries currently awaiting a Prometheus response
last_batchobject or nullMetadata from the most recent planner batch run
last_batch.total_idsintegerNumber of node/rack IDs included in the batch
last_batch.query_countintegerNumber of PromQL queries generated for the batch
last_batch.max_ids_per_queryintegerConfigured ID limit per query (planner.max_ids_per_query)
last_batch.tsfloatUnix timestamp (ms) when the batch ran
last_msfloat or nullLatency of the last query in milliseconds
avg_msfloat or nullRolling average latency over the last 20 queries

Alerts

GET /api/alerts/active

Returns all active WARN and CRIT alerts enriched with full topology context (site, room, rack, device). Combines both node-level and rack-level alert sources from the TelemetryPlanner snapshot.

curl http://localhost:8000/api/alerts/active

Response

{
"alerts": [
{
"type": "node",
"node_id": "compute042",
"state": "CRIT",
"checks": [
{ "id": "ipmi_temp_crit", "severity": "CRIT" }
],
"site_id": "dc1",
"site_name": "Primary DC",
"room_id": "dc1-r001",
"room_name": "Server Room A",
"rack_id": "a01-r03",
"rack_name": "Rack A01-R03",
"device_id": "blade-chassis-01",
"device_name": "Blade Chassis 01"
},
{
"type": "rack",
"rack_id": "a02-r07",
"state": "WARN",
"checks": [
{ "id": "pdu_current_warn", "severity": "WARN" }
],
"site_id": "dc1",
"site_name": "Primary DC",
"room_id": "dc1-r001",
"room_name": "Server Room A",
"rack_name": "Rack A02-R07"
}
]
}

Alert object fields

FieldTypePresent onDescription
typestringall"node" or "rack"
node_idstringnode onlyPrometheus instance name
rack_idstringrack onlyRack identifier
statestringallWARN or CRIT
checksarrayallFailed checks with their severities
checks[].idstringallCheck identifier (e.g., ipmi_temp_crit)
checks[].severitystringallWARN or CRIT
site_idstringallParent site identifier
site_namestringallParent site display name
room_idstringallParent room identifier
room_namestringallParent room display name
rack_idstringnode onlyParent rack identifier
rack_namestringallParent rack display name
device_idstringnode onlyParent device identifier
device_namestringnode onlyParent device display name

Returns {"alerts": []} when topology or checks are not loaded.


Rooms

GET /api/rooms

Returns all rooms across all sites with basic metadata and aisle/rack structure. This is the primary endpoint for the room list view.

curl http://localhost:8000/api/rooms

Response

[
{
"id": "dc1-r001",
"name": "Server Room A",
"site_id": "dc1",
"site_name": "Primary DC",
"aisle_count": 3,
"rack_count": 36,
"standalone_rack_count": 2
}
]

Returns [] when topology is not loaded.


GET /api/rooms/{room_id}/layout

Returns the full room object including aisle definitions, rack references, and optional floor plan metadata (grid layout, compass orientation, door markers).

curl http://localhost:8000/api/rooms/dc1-r001/layout

Response

The response is a full Room Pydantic model. Key fields:

{
"id": "dc1-r001",
"name": "Server Room A",
"aisles": [
{
"id": "a01",
"name": "Aisle 01",
"racks": [
{ "id": "a01-r01", "name": "Rack A01-R01", "u_height": 42 }
]
}
],
"standalone_racks": []
}

Error cases

  • 404 if the room ID does not exist in the loaded topology.
  • 503 if topology is not loaded.

GET /api/rooms/{room_id}/state

Returns the aggregated health state for a room, with a per-rack breakdown including node counts. This is the primary endpoint used by the room floor plan view to color-code racks.

curl http://localhost:8000/api/rooms/dc1-r001/state

Response

{
"room_id": "dc1-r001",
"state": "WARN",
"racks": {
"a01-r01": {
"state": "OK",
"node_total": 16,
"node_crit": 0,
"node_warn": 0
},
"a01-r02": {
"state": "WARN",
"node_total": 16,
"node_crit": 0,
"node_warn": 3
},
"a01-r03": {
"state": "CRIT",
"node_total": 16,
"node_crit": 1,
"node_warn": 2
}
}
}
FieldTypeDescription
room_idstringThe requested room identifier
statestringWorst rack state in the room: OK, WARN, CRIT, or UNKNOWN
racksobjectMap of rack ID to per-rack summary
racks[id].statestringRack health state
racks[id].node_totalintegerTotal number of Prometheus instances in the rack
racks[id].node_critintegerNumber of instances in CRIT state
racks[id].node_warnintegerNumber of instances in WARN state

Returns {"room_id": "...", "state": "UNKNOWN", "racks": {}} when topology or planner is not loaded.


Racks

GET /api/racks/{rack_id}

Returns the full rack object with all devices, as defined in topology. Does not include live health or metric data.

curl http://localhost:8000/api/racks/a01-r01

Response

The response is a full Rack Pydantic model. Key fields:

{
"id": "a01-r01",
"name": "Rack A01-R01",
"u_height": 42,
"template_id": "standard-42u",
"devices": [
{
"id": "blade-chassis-01",
"name": "Blade Chassis 01",
"template_id": "bullsequana-x440-quad",
"u_position": 1,
"instance": ["compute001", "compute002", "compute003", "compute004"]
}
]
}

Error cases

  • 404 if the rack ID does not exist.
  • 503 if topology is not loaded.

GET /api/racks/{rack_id}/state

The primary rack telemetry endpoint. Returns the aggregated rack health state, per-node states, check results, and — optionally — live metric values.

Performance

Use include_metrics=false (the default) for all list and grid views. Only request include_metrics=true on detail views where metric values are actually displayed.

  • Without metrics: ~30–40 ms (health states from planner snapshot only)
  • With metrics: ~743 ms (20+ additional Prometheus queries for temperature, power, and component metrics)
# Fast — health states only (default)
curl http://localhost:8000/api/racks/a01-r01/state

# Full — health + metrics
curl "http://localhost:8000/api/racks/a01-r01/state?include_metrics=true"

Query parameters

ParameterTypeDefaultDescription
include_metricsbooleanfalseWhen true, fetches temperature, power, and component metrics from Prometheus

Response (without metrics)

{
"rack_id": "a01-r01",
"state": "WARN",
"checks": [
{ "id": "pdu_current_warn", "severity": "WARN" }
],
"alerts": [
{ "id": "pdu_current_warn", "severity": "WARN" }
],
"metrics": {
"temperature": 0,
"power": 0
},
"infra_metrics": {
"components": {}
},
"nodes": {
"compute001": {
"state": "OK",
"temperature": 0,
"power": 0,
"checks": [],
"alerts": []
},
"compute002": {
"state": "WARN",
"temperature": 0,
"power": 0,
"checks": [{ "id": "ipmi_temp_warn", "severity": "WARN" }],
"alerts": [{ "id": "ipmi_temp_warn", "severity": "WARN" }]
}
}
}

Response (with include_metrics=true)

When metrics are included, the metrics and nodes fields are populated with live values from Prometheus:

{
"rack_id": "a01-r01",
"state": "WARN",
"checks": [],
"alerts": [
{ "id": "ipmi_temp_warn", "severity": "WARN" }
],
"metrics": {
"temperature": 43.7,
"power": 3240.0
},
"infra_metrics": {
"components": {
"pdu-left": {
"active_power": 1620.0,
"current": 7.3
},
"pdu-right": {
"active_power": 1620.0,
"current": 7.3
}
}
},
"nodes": {
"compute001": {
"state": "OK",
"temperature": 41.0,
"power": 380.0,
"checks": [],
"alerts": []
},
"compute002": {
"state": "WARN",
"temperature": 67.5,
"power": 410.0,
"checks": [{ "id": "ipmi_temp_warn", "severity": "WARN" }],
"alerts": [{ "id": "ipmi_temp_warn", "severity": "WARN" }]
}
}
}

Response fields

FieldTypeDescription
rack_idstringThe requested rack identifier
statestringAggregated rack state: OK, WARN, CRIT, or UNKNOWN
checksarrayRack-level check results (from rack template checks)
alertsarrayRack-level checks currently in a non-OK state
metrics.temperaturefloatAverage inlet/CPU temperature across all nodes (degrees C). 0 when not requested.
metrics.powerfloatTotal power draw across all nodes (watts). 0 when not requested.
infra_metrics.componentsobjectPer-component metrics keyed by component ID (PDUs, switches, etc.). Empty when not requested.
nodesobjectPer-instance health and metrics, keyed by Prometheus instance name
nodes[id].statestringInstance health state
nodes[id].temperaturefloatInstance temperature in degrees C. 0 when not requested.
nodes[id].powerfloatInstance power in watts. 0 when not requested.
nodes[id].checksarrayAll check results for this instance
nodes[id].alertsarrayNon-OK checks for this instance

Returns {"rack_id": "...", "state": "UNKNOWN", "metrics": {}, "nodes": {}} when topology or planner is not loaded.


GET /api/devices/{rack_id}/{device_id}/metrics

Returns live metrics for a single device, querying only the instances belonging to that device. This is faster than loading full rack metrics when only one device needs to be refreshed.

curl http://localhost:8000/api/devices/a01-r01/blade-chassis-01/metrics

Response

{
"device_id": "blade-chassis-01",
"rack_id": "a01-r01",
"metrics": {
"compute001": {
"node_temperature_celsius": 41.0,
"node_power_watts": 380.0,
"node_cpu_usage": 0.72
},
"compute002": {
"node_temperature_celsius": 67.5,
"node_power_watts": 410.0,
"node_cpu_usage": 0.85
}
}
}
FieldTypeDescription
device_idstringThe requested device identifier
rack_idstringThe parent rack identifier
metricsobjectMap of instance ID to metric values. Metric names and presence depend on the device template's metrics list.

Returns {"device_id": "...", "rack_id": "...", "metrics": {}} in all error cases:

  • Rack not found in topology
  • Device not found in the rack
  • Device template has no metrics defined
  • Topology or catalog not loaded