Heartbeat & Monitoring
Patterns · 5 min
The Problem
When an agent hangs or uses tokens uncontrollably, you usually only notice when the bill arrives. You need monitoring.
Heartbeat Pattern
A regular signal that tells the system: "I'm still alive".
// Heartbeat Loop (pseudocode)
every 60 seconds:
status = agent.healthCheck()
metrics.publish('agent_heartbeat', {
status: status,
timestamp: now(),
uptime: uptime(),
tokens_used: tokens.total()
})
if status != 'healthy':
alert.oncall('Agent unhealthy', status)Metrics to Capture
Agent Metrics
- Requests per minute
- Average latency
- Token usage (input/output)
- Error rate
- Queue length
System Metrics
- CPU / RAM usage
- GPU utilization (for local models)
- Disk I/O
- Network traffic
Alerting Rules
# Prometheus Alert Rules
groups:
- name: agent-alerts
rules:
- alert: AgentDown
expr: up{job="ai-agent"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Agent {{ $labels.instance }} is down"
- alert: HighTokenUsage
expr: rate(token_usage_total[1h]) > 1000000
for: 10m
labels:
severity: warning
annotations:
summary: "High token usage detected"Tools
- Grafana + Prometheus: Standard for metrics
- Uptime Kuma: Simple health check dashboard
- n8n Webhook Monitor: Built-in error tracking
Practice Tip
Start with simple health endpoint checks (HTTP 200 OK). Only when that works, extend to detailed metrics. Everything else is over-engineering.
Sources
Next step: move from knowledge to implementation
If you want more than theory: setups, workflows and templates from real operations for teams that want local, documented AI systems.
Why AI Engineering
- Local and self-hosted by default
- Documented and auditable
- Built from our own runtime
- Made in Austria
Not legal advice.