Shared mental model
The Loop
Both systems implement the same six-step loop. Learn it once, apply it to either problem.
Why a loop, not a pipeline
Pipelines run once. Loops iterate on feedback.
A pipeline takes an input and transforms it into an output. That's not enough here — both the harness and the campaign start from something suboptimal, and the signal you care about (benchmark score, campaign performance) only emerges after execution. So the loop has to execute, measure, and decide whether the latest mutation helped before proposing the next one.
The six steps below are the smallest set that makes that work. Skip any of them and you lose signal.
The six-step loop
[1] DIRECTIVE human-authored — the only file humans edit regularly
│
▼
[2] INSPECT read current state: agent.py / container state
│
▼
[3] RUN execute against tasks / campaigns
│
▼
[4] SCORE Harbor result / per-client eval (0.00–1.00)
│
▼
[5] MUTATE edit agent.py / emit RFC 6902 JSON Patch
│
▼
[6] KEEP / DISCARD append to results.tsv / append to run-history.json
│
└──▶ go to [2]
Step-by-step
1 · Directive
A Markdown file the human writes. Describes what kind of agent / campaign you're trying to build, what constraints matter, and what success looks like. Not code. The meta-agent reads this every round.
- AutoAgent:
program.md at repo root.
- Autoresearch:
DOCUMENTATION/loops/gtm-autoresearch/program.md.
2 · Inspect
The meta-agent reads the current state. For AutoAgent that's agent.py. For autoresearch it's the current GTM/sGTM container snapshot plus the last N rounds of patches replayed from KV.
3 · Run
Execute the artefact against the objective surface.
- AutoAgent:
uv run harbor run -p tasks/ ... --agent-import-path agent:AutoAgent.
- Autoresearch:
scripts/run-gtm-loop.ts kicks off Apify watcher → dataset fetch → adjacency analysis → eval scoring.
4 · Score
A single numeric score in [0, 1].
- AutoAgent: Harbor aggregates test pass rate across all tasks.
- Autoresearch: per-client eval suite generated by the
client-eval-generator skill from Meta Ads insights + GTM export.
5 · Mutate
Propose a change.
- AutoAgent: the meta-agent rewrites
agent.py directly — whatever Python change it decides on (new prompt, new tool, new sub-agent, re-routed orchestration).
- Autoresearch: the meta-agent emits an RFC 6902 JSON Patch operation list against the container state. Each op has a reverse, so rollback is deterministic.
6 · Keep or discard
Compare new score to best-so-far. Keep if improved. Discard (rollback) if not. Either way, append to the round log so you can audit the run later.
- AutoAgent:
results.tsv in the repo root (gitignored).
- Autoresearch:
data/signals/run-history.json (append-only) + per-round KV entry.
In plain terms
Hill-climbing means "take one step, check if you're higher, step back if not, try another direction." It's the simplest search strategy that works without gradient information. You don't know which edit will help — but you can always measure after the fact.
The magic is that each step is small and reversible. Nothing dramatic happens per round; dramatic improvement happens across dozens of rounds while you sleep.