Agent 心跳自治：让 AI 在后台持续工作的工程实践

系列 Agent 架构探索 → 第 7 篇 / 共 8 篇

“自治 AI” 听起来很酷，但大多数人的实现是这样的：写一个 while 循环，每隔几分钟让 LLM 执行一次任务，希望它不出错。

这不是自治，这是轮询加祈祷。

真正能在生产环境稳定运行的 Agent 自治，需要工程化的心跳机制。本文分享我在 Agentic Engineering System（30+ 微服务）中实践出来的方案。

心跳机制是系统里每个 Agent 的基础能力，与五层记忆体系共同构成 Agent 持续运作的骨架。

什么是心跳机制

心跳（Heartbeat）在分布式系统里是个老概念——节点定期发送信号，证明自己还活着。

在 Agent 系统里，心跳做的事情更多：

存活证明：更新 last_seen 时间戳
例行检查：扫描待办任务、检查系统状态
触发操作：执行低优先级的后台工作
约束执行：在每次循环里校验约束条件没有被破坏

一个健康的心跳循环，让 Agent 从”被动响应”变成”主动运作”。

心跳文件结构

每个 Agent 的 abilities/heartbeat.yaml 定义心跳行为：

# heartbeat.yaml
agent_id: my_project_agent
last_seen: 2026-03-16T10:00:00+08:00
status: active  # active | paused | degraded

schedule:
  interval: 30m
  timezone: Asia/Shanghai
  quiet_hours: "00:00-07:00"  # 深夜不执行

tasks:
  - id: check_requirements
    desc: 扫描 tasks.yaml，找出 in_progress 和 open 状态的需求
    priority: high

  - id: update_daily_log
    desc: 追加今日工作记录到 logs/
    priority: medium

  - id: sync_knowledge
    desc: 检查 knowledge 是否有需要更新的文件
    priority: low

constraints:
  hard:
    - "不修改任何已发布文章的 slug"
    - "不回滚 tasks.yaml 到更旧的版本"
    - "部署前必须通过 npm run build"
    - "不删除 memory/ 下的任何文件"

  soft:
    - "优先完成 in_progress 状态的需求"
    - "单次心跳不超过 3 个任务"
    - "遇到不确定的决策，记录到 daily.md 等待人工确认"

recovery:
  on_error: log_and_continue  # 单个任务失败不中断整个循环
  max_retries: 2
  notify_on_failure: feishu   # 失败时飞书通知

这个文件不只是配置，它是 Agent 的行为宪法。

触发方式：三种选择

方案一：cron + Claude Code

最简单的实现。在终端里：

# 每 30 分钟触发一次
/loop 30m "执行心跳任务：检查待办、更新日志、处理低优先级工作"

优点：零基础设施，立刻能用。缺点：依赖终端保持打开，机器睡眠会中断。

适合：个人开发者日常使用，不要求 24/7。

方案二：系统级定时任务

在服务器上用 cron 或 Windows Task Scheduler：

# Mac launchd (~/Library/LaunchAgents/com.a9.heartbeat.plist)
<key>StartInterval</key>
<integer>1800</integer>  <!-- 30分钟 -->

或者直接调用 Agent API：

# crontab -e
*/30 * * * * curl -s -X POST localhost:8000/api/agent/heartbeat

优点：真正后台运行，机器在线就执行。缺点：需要服务一直运行。

方案三：Agent 作为常驻服务

Agent 内部实现心跳循环，作为 Flask 服务的一部分：

# agent.py (简化示意)
import threading
import time

def heartbeat_loop():
    while True:
        try:
            run_heartbeat_tasks()
        except Exception as e:
            log_error(e)
            notify_feishu(f"心跳异常: {e}")
        time.sleep(1800)  # 30分钟

# 在服务启动时
threading.Thread(target=heartbeat_loop, daemon=True).start()

这是我现在用的方案——Agent 自己管理自己的心跳，不依赖外部调度器。

约束声明：防止 Agent 越界

这是整个心跳机制里最容易被忽视、也最重要的部分。

真实案例：我的心跳 Agent 在某次循环里扫描到 tasks.yaml 里有一条”过时的”需求描述，自作主张把它更新了——但那条描述是我手动写的版本说明，不是错误。

Agent 没有恶意，它在”帮忙”。但帮错了地方。

解决方案：把每一条血泪教训变成约束声明。

constraints:
  hard:
    - "不修改任何已发布文章的 slug"
    # 原因：slug 变更会导致所有外链 404，SEO 重置

    - "不回滚 tasks.yaml"
    # 原因：曾经踩坑，Agent 把手动版本说明当成错误修复了

    - "不删除 memory/ 下的任何文件"
    # 原因：记忆文件一旦删除，历史上下文永久丢失

    - "不在没有 build 验证的情况下 push"
    # 原因：vite build 不做类型检查，运行时 bug 只在部署后暴露

hard 约束是绝对禁止，不管什么情况。soft 约束是偏好，可以在有充分理由时违反。

写约束的原则：每条约束后面都应该有注释说明”为什么”。没有原因的约束，Agent 在模糊情况下不知道该如何权衡。

任务优先级：防止心跳做太多

初版心跳设计容易犯的错：任务列表太长，什么都想做。

结果是：每次心跳运行时间越来越长，一个任务出错会拖累所有任务，上下文窗口被大量”检查”类信息占满，真正重要的工作反而做不好。

实践原则：

单次心跳 ≤ 3 个任务
high priority 任务：每次必做
medium priority 任务：隔次做
low priority 任务：每周做一次

这对应了心跳文件里的 schedule 设计：

tasks:
  - id: check_critical_alerts
    priority: high
    always_run: true

  - id: update_daily_log
    priority: medium
    run_every: 2  # 每2次心跳执行一次

  - id: knowledge_audit
    priority: low
    run_on_day: monday  # 每周一

失败恢复：心跳出错怎么办

Agent 心跳失败是正常事件，不是异常情况。关键是故障隔离：

recovery:
  on_task_error: log_and_continue
  # 单个任务失败，记录错误，继续下一个任务

  on_critical_error: pause_and_notify
  # 严重错误（如文件系统不可访问），暂停心跳并通知

  max_consecutive_failures: 3
  # 连续3次失败后降级为 degraded 状态

  degraded_behavior:
    - "只执行 check_critical_alerts"
    - "每次失败都发飞书通知"
    - "等待人工确认后恢复 active"

degraded 状态是关键设计——不是直接停止，而是以最小功能运行，同时发出信号等待干预。

每日日志：心跳的记忆痕迹

每次心跳执行后，更新 logs/YYYY-MM-DD_daily.md：

### 10:00 心跳 #42
- ✅ 检查需求：3 个 in_progress，0 个 blocked
- ✅ 更新日志：追加本条记录
- ⏭ 知识审计：跳过（非周一）

### 10:30 心跳 #43
- ✅ 检查需求：同上
- ⚠️ 发现：FR-ANC-030 已完成但状态未更新，标记为 implemented
- ✅ 更新日志

这个日志有两个用途：

调试：出问题时能回溯”心跳 #X 那次做了什么”
接续：人工介入时读最近的日志，立刻了解系统状态

状态机：心跳的生命周期

Agent 的运行状态不是简单的开/关，是一个状态机：

[init] → [active] → [paused]
                 ↓         ↑
           [degraded] → [active]
                 ↓
            [stopped]

active：正常运行，按 schedule 执行任务
paused：主动暂停（比如重大变更期间），不执行任务但保持服务
degraded：检测到异常，降级运行，等待人工干预
stopped：服务停止（不同于 paused，不会自动恢复）

状态存在 heartbeat.yaml 的 status 字段里，每次心跳检查状态，决定是否执行任务。

从理论到实践

我的系统里现在有 7 个 Agent 同时在跑心跳，覆盖不同的领域和任务。

几个经验总结：

不要追求 100% 自动化。有些决策就应该等人工确认——心跳的 soft 约束里要有”不确定就记录等待”的原则。

约束要持续迭代。每次发现 Agent 做了不该做的事，第一反应不是骂 AI，而是把这个场景加进约束列表。

日志是最便宜的保险。哪怕心跳只做了三件事，也要记录清楚。调试时你会感谢自己。

心跳间隔要合理。太短（5分钟）会产生大量无意义的执行记录，太长（24小时）失去了自治的意义。30分钟是我现在用的节奏，覆盖大多数场景。

自治不是让 AI 无限制地运行，而是在明确的边界里高效地运作。边界越清晰，Agent 越可靠。

系列文章：Agent 架构探索 · 上一篇：一个人的 AI 公司

What a Heartbeat Actually Is

The naive version of an autonomous agent is a while loop:

while True:
    do_something()
    sleep(3600)

This works until it doesn’t. The agent has no way to declare constraints, handle failure gracefully, or communicate its state to anything outside the loop.

A production heartbeat is different. It’s a declarative specification of an agent’s autonomous behavior — what it does, when it does it, what it will never touch, and how it recovers when things go wrong.

The specification lives in a heartbeat.yaml file. The agent reads this file at the start of every cycle. This means you can change the agent’s behavior without restarting it — just edit the yaml and the next cycle picks up the new config.

The heartbeat.yaml Structure

Here’s a representative example:

agent: my_liaison_agent
version: "1.0"

schedule:
  mode: interval
  interval_minutes: 30
  active_hours: "07:00-23:00"

tasks:
  - id: check_feishu_inbox
    priority: high
    description: "Read unread messages, route to appropriate coordinators"
    timeout_minutes: 5

  - id: send_daily_summary
    priority: normal
    description: "Post summary of today's activity to coordinator group"
    schedule_override: "20:00"
    timeout_minutes: 3

  - id: sync_task_status
    priority: low
    description: "Update task tracker with completed items"
    timeout_minutes: 2

constraints:
  hard:
    - "Never modify files outside memory/ directory"
    - "Never send messages to external contacts without explicit approval"
    - "Never delete or overwrite existing log entries"
  soft:
    - "Prefer brief responses over comprehensive ones when time-constrained"
    - "Skip low-priority tasks if previous cycle ran long"

recovery:
  on_task_failure: log_and_continue
  on_repeated_failure: degraded_state
  degraded_state_behavior: "high_priority_tasks_only"
  alert_channel: feishu_coordinator_group

Let’s walk through each section.

Schedule

The schedule block controls when the agent runs.

mode: interval means the agent runs every N minutes. mode: cron supports standard cron expressions for more complex schedules. active_hours restricts operation to a time window — useful for agents that shouldn’t run at 3am.

Some agents run on a fixed daily schedule rather than an interval. A content agent might run once in the morning to draft a summary and once in the evening to file a log. These use schedule_override on specific tasks.

Tasks

The tasks list is ordered by priority. Each task has:

id: unique identifier, also used in logs
priority: high, normal, or low
description: plain language — this is what the agent reads to understand the task
timeout_minutes: hard cutoff; the agent stops the task and logs a timeout if exceeded

Keep the task list short. Three tasks per cycle is a good ceiling. More than that and you’re building a monolith, not an agent. If you have ten things the agent “should” do every hour, split them across multiple agents or reduce scope.

Priority is not just a suggestion — it determines what gets dropped when the cycle runs long. Low-priority tasks are first to be skipped. High-priority tasks run even if everything else is deferred.

Three Ways to Trigger a Heartbeat

1. /loop in Claude Code

The simplest approach during development. You open Claude Code in the agent’s directory, type /loop, and the agent runs its heartbeat tasks in a continuous loop until you stop it.

Good for testing and for agents that run while you’re at your computer. Not good for 24/7 operation.

2. System cron

A cron job calls a script that invokes Claude Code non-interactively. The script handles setup, invocation, and log capture.

# Example cron entry: run liaison heartbeat every 30 minutes
*/30 7-23 * * * /path/to/run_heartbeat.sh my_liaison_agent >> /var/log/liaison.log 2>&1

More reliable than manual /loop. Works on any machine. Requires Claude Code to be installed and authenticated.

3. Agent as persistent service

The agent runs as a long-lived process managed by a process supervisor (systemd, supervisor, NSSM on Windows). It maintains its own loop internally and exposes a health endpoint.

This is the most robust approach for production. It also requires the most infrastructure. The Liaison agent in my system runs this way — it’s a Python process that handles Feishu WebSocket connections and runs heartbeat tasks on its own schedule.

Constraint Declarations

This is the most important section, and the one most often skipped.

Hard constraints are absolute prohibitions. The agent must refuse a task rather than violate them. Examples from real failures:

Without this constraint: an agent asked to “organize the project files” moved source code out of version control into a backup directory, then couldn’t find it later.

- "Never move or delete files outside the memory/ directory"

Without this constraint: an agent drafting a customer response actually sent it, because the tool call for “draft” and “send” looked similar from the agent’s perspective.

- "Never call send_message without explicit human approval in this session"

Hard constraints are cheap to write and expensive to discover by omission. Write them before the agent runs, not after.

Soft constraints are preferences that can be overridden by circumstances. They guide judgment calls rather than prohibit actions. The agent reads them but doesn’t treat them as blockers.

Failure Recovery

The recovery block defines what happens when tasks fail.

log_and_continue is the right default for most tasks. The agent logs the failure with enough context to diagnose later, marks the task as failed in its state, and moves to the next task. It doesn’t retry immediately — retrying a failing task in a tight loop wastes tokens and usually doesn’t fix the underlying problem.

degraded_state activates after repeated failures. In degraded state, the agent runs only high-priority tasks and sends an alert. This prevents a broken agent from silently accumulating failures while burning through API quota.

The alert should go somewhere you actually check. For me that’s the Feishu coordinator group. A Slack channel, email, or even a file in a watched directory all work.

Daily Logs as Memory Traces

Every heartbeat cycle writes a log entry. This is not optional — it’s how the agent maintains continuity across sessions.

A minimal log entry:

### 14:30 Heartbeat cycle

Tasks run: check_feishu_inbox, sync_task_status
Tasks skipped: send_daily_summary (not scheduled until 20:00)

check_feishu_inbox: 3 messages processed, 1 routed to my_coordinator
sync_task_status: 4 tasks marked complete

No failures. Next cycle in 30 minutes.

When you resume an agent after a gap, it reads these logs to reconstruct what happened. The logs are its memory of time passing.

Keep log entries structured but brief. Verbose logs are hard to scan. The agent should be able to read the last week of logs in one context window and have a clear picture of its operational history.

State Machine

An agent with a heartbeat has four operational states:

active → paused → degraded → stopped
   ↑         |
   └─────────┘  (resume)

active: running normally on schedule paused: schedule suspended, agent will not start new cycles; resumes on command degraded: high-priority tasks only, alert sent, awaiting diagnosis stopped: process terminated, requires explicit restart

State is stored in agent_state.json in the agent’s root directory. It’s a simple key-value file the agent reads at startup and updates after significant state changes.

Pause is useful when you’re doing maintenance on a system the agent touches. Rather than stopping the agent entirely, you pause it, do your work, then resume. The agent’s memory is preserved.

Lessons from Production

A few things I’ve learned the hard way:

Don’t start with autonomy. Run the tasks manually first. Understand what the agent actually does when it executes each task. Only schedule it after you’ve seen it work.

Logs are the first thing you check when something goes wrong. If the log entries are too sparse to diagnose the failure, the log format is wrong. Fix it before the next incident.

Token cost per cycle compounds. An agent that runs every 15 minutes with a 10K context costs real money over a month. Measure it. Optimize the context passed to each cycle — the agent doesn’t need its full memory for every task.

The heartbeat spec is documentation. When you come back to an agent after three months, the heartbeat.yaml tells you what it does and why certain constraints exist. Treat it with the same care you’d give a README.

One hard constraint can save a lot of damage. The effort to write it takes two minutes. The effort to recover from the failure it prevents can take hours.