Heartbeat & Monitoring

The Problem

When an agent hangs or uses tokens uncontrollably, you usually only notice when the bill arrives. You need monitoring.

Heartbeat Pattern

A regular signal that tells the system: "I'm still alive".

// Heartbeat Loop (pseudocode)
every 60 seconds:
  status = agent.healthCheck()
  metrics.publish('agent_heartbeat', {
    status: status,
    timestamp: now(),
    uptime: uptime(),
    tokens_used: tokens.total()
  })
  
  if status != 'healthy':
    alert.oncall('Agent unhealthy', status)

Metrics to Capture

Agent Metrics

Requests per minute
Average latency
Token usage (input/output)
Error rate
Queue length

System Metrics

CPU / RAM usage
GPU utilization (for local models)
Disk I/O
Network traffic

Alerting Rules

# Prometheus Alert Rules
groups:
- name: agent-alerts
  rules:
  - alert: AgentDown
    expr: up{job="ai-agent"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Agent {{ $labels.instance }} is down"
      
  - alert: HighTokenUsage
    expr: rate(token_usage_total[1h]) > 1000000
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High token usage detected"

Tools

Grafana + Prometheus: Standard for metrics
Uptime Kuma: Simple health check dashboard
n8n Webhook Monitor: Built-in error tracking

Practice Tip

Start with simple health endpoint checks (HTTP 200 OK). Only when that works, extend to detailed metrics. Everything else is over-engineering.