Monitoring

Health check

GET https://<server>:7701/health

Returns:

{ "status": "ok", "db": "ok", "version": "0.5.2" }

Use for load-balancer health, Kubernetes readiness probes, uptime monitoring.

Agent liveness

Agents heartbeat every 30 s (heartbeat_interval_s). Stale threshold is ~90 s.

Query pattern:

SELECT agent_id, name, now() - last_seen AS staleness
FROM agents
WHERE last_seen < now() - interval '2 minutes'
ORDER BY staleness DESC;

Alert on:

Any daemon stale for > 5 minutes unexpectedly.
A daemon that’s been online >7 days with zero events (likely broken eBPF).

Event rate

SELECT date_trunc('minute', ts) AS minute, count(*) AS events, verdict
FROM events
WHERE ts > now() - interval '1 hour'
GROUP BY 1, 3 ORDER BY 1 DESC;

Sudden spikes of deny from a specific agent mean a policy is firing — investigate.

Drift

SELECT agent_id, count(*) AS drifts
FROM events
WHERE drift_detected = true
  AND ts > now() - interval '1 day'
GROUP BY 1 ORDER BY 2 DESC;

Any non-zero count deserves attention. Sustained drift from one agent suggests tampering or a badly-deployed policy.

PostgreSQL

Standard Postgres monitoring applies. Key queries:

Connection count (pg_stat_activity)
Table sizes — the events table is the fastest grower
Index bloat on the events(ts desc) index

Consider a DELETE FROM events WHERE ts < now() - interval '30 days' cron if you don’t need long retention. Partitioning by day is planned for high-volume deployments.

Logs

Server: structured tracing to stderr. Ship via journald → your log pipeline.

RUST_LOG=tyr_server=info,sqlx=warn,tower_http=info tyr-server

Agents: same. Per-host logs via journalctl -u tyr-agent.

Metrics (roadmap)

Prometheus /metrics endpoint on the server is planned. Key metrics will include:

tyr_events_total{verdict,kind}
tyr_agents_online
tyr_drift_detected_total
tyr_policy_reload_duration_seconds
tyr_grpc_requests_total{status}

For now, derive from SQL.

SSE for live dashboards

Building custom dashboards? Subscribe to the live event stream:

curl -sS -N -H "Authorization: Bearer $JWT" \
  https://tyr.example.com:7701/api/v1/events/stream

Each line is data: {json event}. Use it to pipe into Grafana Live, a Slack bot, etc.

→ Next: Backup & restore · Troubleshooting