Load Balancing & Health Checks

rustunnel supports group-based load balancing for HTTP and TCP tunnels. Multiple clients can register against the same subdomain (HTTP) or share a TCP port pool, and inbound connections are dispatched at random across healthy members of the group. Optional client-side health probes automatically remove sick backends from the rotation. The model is modeled on FRP’s loadBalancer.group / healthCheck config — same shape, slightly different wire format.

Concepts

Group — a logical pool of tunnel members sharing the same subdomain (HTTP) or TCP port. Identified by a user-supplied group name plus a shared group_key. The server stores only the SHA-256 hash of the key and uses it to authorise joins; the raw key never leaves the client.
Member — one tunnel inside a group. Running two clients with the same (group, group_key) produces a 2-member pool.
Health bit — every member has a healthy flag. Dispatch routes around members whose flag is false. Without a health check configured, members are permanently healthy (the server trusts the client’s presence).
Dispatch — for each new public connection, the server picks one healthy member uniformly at random. There’s no weighting and no sticky sessions today.

                  +-> client A -> backend on :3000
public ─-->  ── group "web"
                  +-> client B -> backend on :3001

Configuration

Server (`server.toml`)

The kill switch. When false (the default), the server accepts the new fields on the wire but ignores them — every registration is a solo tunnel. When true, members sharing (subdomain, group_key_hash) (HTTP) or (group_name, group_key_hash) (TCP) form a real pool.

[load_balancing]
enabled = true

The kill switch is per-region for self-hosted multi-region deployments. Flip it on regions one at a time during a rollout — false is the safe default that preserves single-tunnel-per-key behaviour.

Client (`~/.rustunnel/config.yml`)

Add group, group_key, and optionally health_check to a tunnel definition:

server: tunnel.example.com:4040
auth_token: "your-token"

tunnels:
  a:
    proto: http
    local_port: 3000
    subdomain: pool
    group: web
    group_key: shared-secret-for-this-pool
    health_check:
      type: tcp
      interval_secs: 10
      timeout_secs: 3
      max_failed: 3

Field	Required	Default	Meaning
`group`	yes (for LB)	—	Display name of the pool. The first joiner sets `TunnelGroup.name`; later joiners are accepted regardless of what they pass.
`group_key`	yes (for LB)	—	Shared secret. SHA-256-hashed before transmission. Members of one pool MUST agree on this value; the server rejects a join with a mismatched key.
`health_check.type`	no	—	`tcp` (open a connection) or `http` (issue a `GET`). Omit to disable probing.
`health_check.path`	yes when `type: http`	—	Path to GET against the local service.
`health_check.interval_secs`	no	`10`	Probe period.
`health_check.timeout_secs`	no	`3`	Per-probe deadline.
`health_check.max_failed`	no	`3`	Consecutive failures before reporting `TunnelUnhealthy`.
`health_check.expect_2xx`	no	`true`	When `false`, any HTTP response counts as healthy.
`health_check.alert_webhook`	no	—	Per-tenant URL the server POSTs to when this group transitions to 0 healthy members. See Webhook alerts below.

Behaviour rules

HTTP groups

Members must declare the same protocol (http vs https). A mismatch is rejected with a clear error. The subdomain is the routing key — every member of a group shares one subdomain.

TCP groups

The first member of a (group, group_key) allocates a port from the configured tcp_port_range. Subsequent members reuse that port; the server returns the same assigned_port to all joiners. Members never see a Registered listener event after the first — the listener is already bound.

Solo collisions

Registering a solo (no-group) tunnel against an existing group’s subdomain is rejected with subdomain '...' is already in use. Registering a grouped tunnel against an existing solo tunnel is rejected with group key does not match. A subdomain is owned by exactly one identity at a time.

Last-leave semantics

The group entry is removed when its last member disconnects. The TCP port (if any) is returned to the pool. New registrations after that point start a fresh group with a fresh port.

Race safety

The create / join / remove paths are serialised atomically via the routing-table entry API. Two concurrent first registrations produce one group, not two.

Health checks

Probes run on the client against local_addr. The server never opens a connection to the upstream itself — it just trusts the client’s TunnelHealthy / TunnelUnhealthy reports.

TCP probe: opens a TCP connection. Success = connect within timeout_secs.
HTTP probe: sends GET <path> HTTP/1.0 and reads the status line. Success = response within timeout_secs and (when expect_2xx) status in [200, 300).

Probe state is reported only on edges:

First probe success

Emits TunnelHealthy — lifts the initial healthy=false state for members that opted into probing.

`max_failed` consecutive failures

Emits TunnelUnhealthy. The server flips the healthy bit to false and excludes the member from dispatch.

First success after a failure streak

Emits TunnelHealthy. The server resets the consecutive-failure counter and re-includes the member.

A member with no health_check is permanently healthy. A member with a spec starts unhealthy and only joins dispatch after the first successful probe.

Webhook alerts

When a load-balancing group transitions to 0 healthy members, public traffic to that subdomain or port starts returning 502. The server can POST a JSON alert to one or more URLs at the moment of that transition so an operator or tenant can react. There are two distinct destinations, each addressing a different audience:

Operator URL — `[load_balancing] alert_webhook_url` in `server.toml`

Set on the edge. Fires for every group on that edge that goes 0/N, regardless of which tenant owns the group. Useful for self-hosted deployments and for ops awareness on a managed multi-tenant edge.

[load_balancing]
enabled = true
alert_webhook_url = "https://hooks.slack.com/services/operator-channel/..."

Per-tenant URL — `health_check.alert_webhook` in the client config

Set on the client. Fires only when the group containing this tunnel goes 0/N. Each tenant points it at their Slack / PagerDuty / email gateway. The URL is sent on the wire as part of HealthCheckSpec and stored on the affected GroupMember; only the server holds it (the URL is never returned by /api/groups — dashboards see a presence-only flag).

tunnels:
  a:
    proto: http
    local_port: 3000
    subdomain: pool
    group: web
    group_key: shared-secret-for-this-pool
    health_check:
      type: tcp
      alert_webhook: "https://hooks.slack.com/services/my-team/..."

Both destinations can be configured independently. Both fire on the same 0/N transition. The server collects unique URLs from the affected group’s members (so two members of one tenant pointing at the same URL receive a single POST per transition, not two), then fans out to each unique URL plus the operator URL.

Payload

Same JSON body sent to every destination:

{
  "event": "group_zero_healthy",
  "region_id": "eu",
  "protocol": "http",
  "label": "pool",
  "group_name": "web",
  "key_hash_short": "deadbeef",
  "member_count": 2,
  "at": "2026-05-06T13:24:55+00:00"
}

key_hash_short is the first 8 hex chars of the group’s SHA-256 key hash — stable across reconnects, useful for correlating alerts when a single team runs multiple pools with the same group_name.

Debounce

The server tracks a per-group zero_healthy_alerted flag. Once an alert fires, subsequent TunnelUnhealthy frames against the same already-down group do not re-fire. The flag resets the moment any member becomes healthy again — the next 0/N transition then fires fresh. In practice: if your pool flaps badly (down → up → down → up), each downward edge generates one alert per destination. Steady-state “everyone is still down” generates none.

Delivery

Best-effort. The server uses a 5-second per-request timeout, no retry, no queue. If your webhook receiver is down at the moment of the transition, the alert is lost. For high-stakes paging, point the URL at something durable — a queueing alertmanager, or a service like Pushover with retry — rather than relying on the rustunnel server for delivery guarantees.

The fire happens in a detached tokio::spawn, so a slow webhook receiver never blocks the server’s frame-handling hot path.

Testing the feature locally

Quick end-to-end smoke test against a self-hosted edge with [load_balancing] enabled = true. Spin up two clients with the same (group, group_key), point them at separate local backends, and hammer the public URL — both backends should serve.

Build the client from source

git clone https://github.com/joaoh82/rustunnel
cd rustunnel
cargo build --release -p rustunnel-client

Drop a config that opts into a group

cat > /tmp/lb-test.yml <<'EOF'
server: tunnel.example.com:4040
auth_token: "your-token"

tunnels:
  a:
    proto: http
    local_port: 3000
    subdomain: lbtest
    group: web
    group_key: shared-secret-for-lb-test
    health_check:
      type: tcp
EOF

Start backend A on :3000

python3 -m http.server 3000

Start client A pointing at backend A

./target/release/rustunnel start --config /tmp/lb-test.yml

Start backend B on :3001

In a separate terminal:

python3 -m http.server 3001

Start client B with `local_port: 3001`

Either edit /tmp/lb-test.yml and run a second rustunnel start, or use a second config file with the same group / group_key and local_port: 3001.

Hammer the public URL

for i in $(seq 1 50); do
  curl -fsS https://lbtest.tunnel.example.com/ -o /dev/null -w "%{http_code}\n"
done

Both backends should see roughly half of the requests in their access logs.

Verify via the metrics endpoint

ssh root@tunnel.example.com 'curl -sf http://127.0.0.1:9090/metrics' \
  | grep '^rustunnel_group_'

Expect output like:

rustunnel_group_members{group="web",region="eu",healthy="true"} 2
rustunnel_group_members{group="web",region="eu",healthy="false"} 0
rustunnel_group_dispatches_total{group="web",region="eu"} 50
rustunnel_group_health_failures_total{group="web",region="eu",kind="tcp"} 0

Validate failover

Kill one of the local backends. The probe loop on that client marks it unhealthy after max_failed * interval_secs seconds; subsequent requests all land on the survivor. Restart the backend — the probe re-registers it as healthy and dispatch distributes again.

Observability

When [load_balancing] enabled = true, the Prometheus exporter on :9090 emits three additional series:

Metric	Type	Labels	What it measures
`rustunnel_group_members`	gauge	`group`, `region`, `healthy`	Count of registered members partitioned by their health bit.
`rustunnel_group_dispatches_total`	counter	`group`, `region`	Total dispatched connections, summed across the group’s members. Per-group rather than per-member to keep label cardinality bounded.
`rustunnel_group_health_failures_total`	counter	`group`, `region`, `kind`	Total `TunnelUnhealthy` frames received across the group’s members. `kind` is `tcp` / `http` / `none` based on the probe type.

The pre-existing rustunnel_active_tunnels_* and rustunnel_requests_total gauges/counters keep counting members (not groups) so historical dashboards stay accurate.

Per-tunnel timeline + live event stream

Two REST surfaces let dashboards reconstruct recent health behaviour without polling all of /api/tunnels:

GET /api/tunnels/:id/health-events — last 50 health-state transitions for that tunnel ({ at, healthy, reason }[], oldest first). Records edges only — steady-state probe reports are not stored. Use this to render a per-tunnel timeline panel.
GET /api/groups/:protocol/:label/events — Server-Sent Events stream emitting one group_event per member health-bit transition affecting the named group. 30s keep-alive ping. Use this for live dashboards that want push instead of polling. A lagged SSE event means the consumer fell behind — resync via /api/groups.

Both endpoints are gated by the same auth as /api/tunnels (admin token or DB token), and they apply the same per-tenant scope: a user-scoped DB token sees only groups containing at least one of its own members; aggregate counters reflect just the visible members; groups the caller can’t see return 404 rather than 403. Admin tokens see everything.

Limitations & non-goals

rustunnel’s load balancing is intentionally minimal. If you need any of the features below, layer them at the application or DNS level.

No weighted dispatch — random uniform only.
No sticky sessions — every new connection is dispatched independently. Long-lived WebSocket connections that need affinity must handle reconnects at the application layer.
No active session draining on member removal — in-flight connections finish naturally; new connections route elsewhere.
No UDP groups — UDP is connectionless; there’s no obvious unit to dispatch.
No P2P groups — P2P publishers are 1-to-many by design.
No cross-region pools — members must be on the same edge server. Layer DNS-based routing on top for global LB.
No groupKey rotation — once a group exists, rotating its key requires dropping all members.

Client guide

CLI flags, config file, and the full set of tunnel modes.

Architecture

How HTTP / TCP / UDP / P2P tunnels flow through the system.

Getting started

Guides

Reference

Load Balancing & Health Checks

Concepts

Configuration

Server (`server.toml`)

Client (`~/.rustunnel/config.yml`)

Behaviour rules

Health checks

Webhook alerts

Operator URL — `[load_balancing] alert_webhook_url` in `server.toml`

Per-tenant URL — `health_check.alert_webhook` in the client config

Payload

Debounce

Delivery

Testing the feature locally

Observability

Per-tunnel timeline + live event stream

Limitations & non-goals

See also

Client guide

Architecture

Getting started

Guides

Reference

Documentation Index

​Concepts

​Configuration

​Server (server.toml)

​Client (~/.rustunnel/config.yml)

​Behaviour rules

​Health checks

​Webhook alerts

​Operator URL — [load_balancing] alert_webhook_url in server.toml

​Per-tenant URL — health_check.alert_webhook in the client config

​Payload

​Debounce

​Delivery

​Testing the feature locally

​Observability

​Per-tunnel timeline + live event stream

​Limitations & non-goals

​See also

Client guide

Architecture

Concepts

Configuration

Server (`server.toml`)

Client (`~/.rustunnel/config.yml`)

Behaviour rules

Health checks

Webhook alerts

Operator URL — `[load_balancing] alert_webhook_url` in `server.toml`

Per-tenant URL — `health_check.alert_webhook` in the client config

Payload

Debounce

Delivery

Testing the feature locally

Observability

Per-tunnel timeline + live event stream

Limitations & non-goals

See also