Zum Inhalt springen
>_<
AI EngineeringWiki

Heartbeat & Monitoring

Patterns · 5 min

The Problem

When an agent hangs or uses tokens uncontrollably, you usually only notice when the bill arrives. You need monitoring.

Heartbeat Pattern

A regular signal that tells the system: "I'm still alive".

// Heartbeat Loop (pseudocode)
every 60 seconds:
  status = agent.healthCheck()
  metrics.publish('agent_heartbeat', {
    status: status,
    timestamp: now(),
    uptime: uptime(),
    tokens_used: tokens.total()
  })
  
  if status != 'healthy':
    alert.oncall('Agent unhealthy', status)

Metrics to Capture

Agent Metrics

  • Requests per minute
  • Average latency
  • Token usage (input/output)
  • Error rate
  • Queue length

System Metrics

  • CPU / RAM usage
  • GPU utilization (for local models)
  • Disk I/O
  • Network traffic

Alerting Rules

# Prometheus Alert Rules
groups:
- name: agent-alerts
  rules:
  - alert: AgentDown
    expr: up{job="ai-agent"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Agent {{ $labels.instance }} is down"
      
  - alert: HighTokenUsage
    expr: rate(token_usage_total[1h]) > 1000000
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High token usage detected"

Tools

  • Grafana + Prometheus: Standard for metrics
  • Uptime Kuma: Simple health check dashboard
  • n8n Webhook Monitor: Built-in error tracking

Practice Tip

Start with simple health endpoint checks (HTTP 200 OK). Only when that works, extend to detailed metrics. Everything else is over-engineering.

Sources

Next step: move from knowledge to implementation

If you want more than theory: setups, workflows and templates from real operations for teams that want local, documented AI systems.

Why AI Engineering
  • Local and self-hosted by default
  • Documented and auditable
  • Built from our own runtime
  • Made in Austria
Not legal advice.