Monitoring
Health check
GET https://<server>:7701/healthReturns:
{ "status": "ok", "db": "ok", "version": "0.5.2" }Use for load-balancer health, Kubernetes readiness probes, uptime monitoring.
Agent liveness
Agents heartbeat every 30 s (heartbeat_interval_s). Stale threshold is ~90 s.
Query pattern:
SELECT agent_id, name, now() - last_seen AS stalenessFROM agentsWHERE last_seen < now() - interval '2 minutes'ORDER BY staleness DESC;Alert on:
- Any daemon stale for > 5 minutes unexpectedly.
- A daemon that’s been online >7 days with zero events (likely broken eBPF).
Event rate
SELECT date_trunc('minute', ts) AS minute, count(*) AS events, verdictFROM eventsWHERE ts > now() - interval '1 hour'GROUP BY 1, 3 ORDER BY 1 DESC;Sudden spikes of deny from a specific agent mean a policy is firing — investigate.
Drift
SELECT agent_id, count(*) AS driftsFROM eventsWHERE drift_detected = true AND ts > now() - interval '1 day'GROUP BY 1 ORDER BY 2 DESC;Any non-zero count deserves attention. Sustained drift from one agent suggests tampering or a badly-deployed policy.
PostgreSQL
Standard Postgres monitoring applies. Key queries:
- Connection count (
pg_stat_activity) - Table sizes — the
eventstable is the fastest grower - Index bloat on the
events(ts desc)index
Consider a DELETE FROM events WHERE ts < now() - interval '30 days' cron if you don’t need long retention. Partitioning by day is planned for high-volume deployments.
Logs
Server: structured tracing to stderr. Ship via journald → your log pipeline.
RUST_LOG=tyr_server=info,sqlx=warn,tower_http=info tyr-serverAgents: same. Per-host logs via journalctl -u tyr-agent.
Metrics (roadmap)
Prometheus /metrics endpoint on the server is planned. Key metrics will include:
tyr_events_total{verdict,kind}tyr_agents_onlinetyr_drift_detected_totaltyr_policy_reload_duration_secondstyr_grpc_requests_total{status}
For now, derive from SQL.
SSE for live dashboards
Building custom dashboards? Subscribe to the live event stream:
curl -sS -N -H "Authorization: Bearer $JWT" \ https://tyr.example.com:7701/api/v1/events/streamEach line is data: {json event}. Use it to pipe into Grafana Live, a Slack bot, etc.
→ Next: Backup & restore · Troubleshooting