Health checks
A probe tells Rune whether an instance is doing its job. Failed probes mean the reconciler kills and replaces the instance. Get this right.
Probe types
Section titled “Probe types”| Type | What it does |
|---|---|
http | GETs a path. Healthy if 2xx/3xx within timeout. |
tcp | Opens a TCP connection. Healthy if the connect succeeds. |
Process-level health (PID alive) is implicit — that’s what the runner watches. Probes are for application-level health.
HTTP probe
Section titled “HTTP probe”health: liveness: type: http path: /healthz port: 8080 scheme: http # or https headers: X-Health-Check: rune initialDelaySeconds: 5 intervalSeconds: 10 timeoutSeconds: 2 failureThreshold: 3 successThreshold: 1Tunables:
| Field | Default | Notes |
|---|---|---|
initialDelaySeconds | 0 | Wait this long after container start before first probe. |
intervalSeconds | 10 | Time between probes. |
timeoutSeconds | 1 | Per-probe timeout. |
failureThreshold | 3 | Consecutive failures before marking unhealthy. |
successThreshold | 1 | Consecutive successes before marking healthy after a failure. |
TCP probe
Section titled “TCP probe”health: liveness: type: tcp port: 5432 intervalSeconds: 5 timeoutSeconds: 2Use TCP when there’s no HTTP endpoint to hit — databases, message queues, custom protocols.
Designing a good probe
Section titled “Designing a good probe”- Probe the dependency, not the framework. A
/healthzthat always returns 200 tells you nothing. A/healthzthat pings the DB and returns 503 if it can’t is useful. - Be cheap. Probes run every 10s × N instances. Don’t hit a slow query.
- Don’t probe upstream services. If your probe checks “can I reach the auth API?” then a brief auth blip kills your whole service. Keep probe scope local.
- Set
initialDelaySecondsgenerously for slow-starting services — JVM, Rails, anything with warm-up. failureThreshold * intervalSecondsis your tolerance window. Default is 30s before kill — usually right.
Inspecting probe state
Section titled “Inspecting probe state”rune health api --checksrune health instance api-instance-abc123 --checksOutput includes the most recent probe result, last failure message, and consecutive counts.
When probes lie
Section titled “When probes lie”The two failure modes:
- False negatives — your service is fine but probes fail. Cause: tight timeout, expensive endpoint, network blip. The reconciler kills working instances. Fix: relax timeouts, simplify the endpoint.
- False positives — your service is broken but probes pass. Cause: probe doesn’t actually exercise the failing path. Fix: have
/healthztouch the things that actually matter (DB connection, cache, queue).
Liveness vs. readiness
Section titled “Liveness vs. readiness”Today Rune has a single probe per service (liveness). Readiness — “ready to serve traffic but not yet probed-as-healthy” — is on the roadmap as part of multi-node ingress (RUNE-063).
For now, your liveness probe is your readiness probe. Treat it as both.