Overview
The Loop
AutoAgent
Autoresearch
Harbor
JSON Patch + KV
Model Escalation
Hermes Runtime
Quick Start
Glossary + FAQ
AutoAgent gtm-autoresearch
Same loop · Two score functions

AutoAgent × Autoresearch

Two systems built on the same idea: give a coding agent a clear directive, a measurable objective, and the ability to mutate its own working material. Let it hill-climb overnight. Accept whatever it builds — provided the score cleared the bar.

AutoAgent mutates agent harnesses (agent.py) against Harbor benchmark scores. Autoresearch mutates GTM campaign configuration (via RFC 6902 JSON Patch) against per-client eval scores. Identical philosophy, different artefacts, complementary roster.

Meta-agent hill-climb
Measurable objective
Reversible mutations
Overnight cadence
Why this guide exists. The two systems share most of their mental model and a lot of their plumbing. Learning them side-by-side is faster than reading two disjoint docs — and lets you pick the right loop for a given problem instead of forcing one shape on the other.

Side-by-side

AutoAgent

Author: kevinrgu · thirdlayer.inc

  • Mutation target: agent.py — a single-file agent harness
  • Score: Harbor benchmark total score
  • Directive: program.md (human-authored)
  • Runtime stack: uv + Docker + Harbor
  • Output: an improved harness you can graduate

Autoresearch

Workspace: gtm-autoresearch · Organized-AI

  • Mutation target: GTM / sGTM container state
  • Score: per-client eval suite (0–1)
  • Directive: DOCUMENTATION/loops/gtm-autoresearch/program.md
  • Runtime stack: tsx scripts + Cloudflare KV + wrangler
  • Output: container patches + per-client reports

Explore

The Loop
Shared mental model. Directive → inspect → run → score → mutate → keep/discard → repeat.
hill-climbmeta-agent
AutoAgent deep dive
agent.py, program.md, tasks/, .agent/, Harbor integration, overnight walk-through.
agent.pyprogram.mdHarbor
Autoresearch deep dive
Pipeline · per-client evals · client registry · Obsidian + Slack surfaces.
gtm-autoresearchper-client evals
Harbor
The benchmark runner that powers AutoAgent's score. Task format, parallel runs, trajectory output.
laude-institute/harbor
JSON Patch + KV
Autoresearch's mutation protocol. RFC 6902 · drift history in Cloudflare KV · reconstruct any state.
RFC 6902Cloudflare KV
Model Escalation
Sonnet → Opus 4.6 at 0.92 (one-way). Why that threshold, what it costs, how to tune it.
Sonnet 4.6Opus 4.6
Hermes Runtime
How winning harnesses graduate. launchd-managed Pi harnesses on claws-mac-mini. Slack-facing.
HermesPi harnesslaunchd
Quick Start
Side-by-side setup. Run an AutoAgent experiment or an autoresearch round from a clean machine.
uvtsxwrangler

The Hill-Climb, at a glance

Round 0 baseline
0.42
After tool fix
0.58
After prompt rewrite
0.71
After sub-agent (reverted)
0.68 ✗
After context-budget tune
0.89
After escalation to Opus 4.6
0.94

Illustrative — actual runs look messier, with more discards and occasional regressions.

Shared mental model

The Loop

Both systems implement the same six-step loop. Learn it once, apply it to either problem.

Why a loop, not a pipeline

Pipelines run once. Loops iterate on feedback.

A pipeline takes an input and transforms it into an output. That's not enough here — both the harness and the campaign start from something suboptimal, and the signal you care about (benchmark score, campaign performance) only emerges after execution. So the loop has to execute, measure, and decide whether the latest mutation helped before proposing the next one.

The six steps below are the smallest set that makes that work. Skip any of them and you lose signal.

The six-step loop [1] DIRECTIVE human-authored — the only file humans edit regularly [2] INSPECT read current state: agent.py / container state [3] RUN execute against tasks / campaigns [4] SCORE Harbor result / per-client eval (0.00–1.00) [5] MUTATE edit agent.py / emit RFC 6902 JSON Patch [6] KEEP / DISCARD append to results.tsv / append to run-history.json └──▶ go to [2]

Step-by-step

1 · Directive

A Markdown file the human writes. Describes what kind of agent / campaign you're trying to build, what constraints matter, and what success looks like. Not code. The meta-agent reads this every round.

  • AutoAgent: program.md at repo root.
  • Autoresearch: DOCUMENTATION/loops/gtm-autoresearch/program.md.

2 · Inspect

The meta-agent reads the current state. For AutoAgent that's agent.py. For autoresearch it's the current GTM/sGTM container snapshot plus the last N rounds of patches replayed from KV.

3 · Run

Execute the artefact against the objective surface.

  • AutoAgent: uv run harbor run -p tasks/ ... --agent-import-path agent:AutoAgent.
  • Autoresearch: scripts/run-gtm-loop.ts kicks off Apify watcher → dataset fetch → adjacency analysis → eval scoring.

4 · Score

A single numeric score in [0, 1].

  • AutoAgent: Harbor aggregates test pass rate across all tasks.
  • Autoresearch: per-client eval suite generated by the client-eval-generator skill from Meta Ads insights + GTM export.

5 · Mutate

Propose a change.

  • AutoAgent: the meta-agent rewrites agent.py directly — whatever Python change it decides on (new prompt, new tool, new sub-agent, re-routed orchestration).
  • Autoresearch: the meta-agent emits an RFC 6902 JSON Patch operation list against the container state. Each op has a reverse, so rollback is deterministic.

6 · Keep or discard

Compare new score to best-so-far. Keep if improved. Discard (rollback) if not. Either way, append to the round log so you can audit the run later.

  • AutoAgent: results.tsv in the repo root (gitignored).
  • Autoresearch: data/signals/run-history.json (append-only) + per-round KV entry.
In plain terms
Hill-climbing means "take one step, check if you're higher, step back if not, try another direction." It's the simplest search strategy that works without gradient information. You don't know which edit will help — but you can always measure after the fact.

The magic is that each step is small and reversible. Nothing dramatic happens per round; dramatic improvement happens across dozens of rounds while you sleep.
kevinrgu / thirdlayer.inc

AutoAgent

Give a coding agent a task, let it build and iterate on an agent harness autonomously overnight. It modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats.

Core idea. You don't touch the harness Python files directly. You edit program.md — the Markdown file that provides context to the meta-agent and defines the agent-engineering loop. The meta-agent does the Python editing on your behalf.

The four files that matter

agent.py

The entire harness under test in a single file. Config · tool definitions · agent registry · routing · orchestration. The Harbor adapter section is explicitly fixed; everything else is the meta-agent's edit surface.

program.md

Instructions for the meta-agent + the directive. The only file a human edits. Becomes the source of truth for harness intent.

tasks/

Evaluation tasks in Harbor format. Clean baseline branches may omit payloads; benchmark-specific branches add them.

.agent/

Optional workspace artefacts — reusable instructions, notes, prompts, or skill files the meta-agent can draw on between runs.

Architecture

agent.py # single-file harness under test editable harness section # prompt · registries · tools · routing fixed adapter section # Harbor integration · trajectory serialisation program.md # meta-agent instructions + directive Dockerfile.base # base image for sandboxed runs .agent/ # optional workspace artefacts tasks/ # Harbor benchmark tasks (often branch-specific) jobs/ # Harbor job outputs (gitignored) results.tsv # experiment log — meta-agent writes this run.log # latest run output

Overnight walk-through

A single experiment — 12 hours

From directive to improved harness

  1. [evening] Human edits program.md: "build a bash-tool agent that solves Harbor's file-organizer benchmark at >80% score."
  2. [t+0] Operator prompts the coding agent: "Read program.md and let's kick off a new experiment!"
  3. [t+5min] Meta-agent inspects current agent.py, runs baseline on all tasks. Score: 0.42.
  4. [t+20min] Trajectory analysis: failures cluster around incorrect directory listing. Meta-agent adds a recursive ls tool. Score: 0.58. Kept.
  5. [t+45min] Failures now cluster on over-terse system prompt. Meta-agent rewrites prompt with explicit output shape. Score: 0.71. Kept.
  6. [t+2h] Meta-agent tries adding a verification sub-agent. Score: 0.68. Discarded.
  7. [t+3h] Context-budget tune — shorter task recap, larger tool spec. Score: 0.89. Kept.
  8. [morning] Operator reads results.tsv, picks the winning variant, merges to main. Harness delivered overnight.

Running the meta-agent

Point a coding agent at the repo and prompt:

Read program.md and let's kick off a new experiment!

Tips for writing program.md

  • State the directive loudly. One paragraph near the top describing the target harness and the success bar. The meta-agent re-reads this every round.
  • Constrain the search space. If you don't want sub-agents, say so. If certain libraries are off-limits, say so. Otherwise the meta-agent will try everything.
  • Explain the adapter boundary. Re-state that the Harbor adapter section is frozen, or the meta-agent may wander in and break the benchmark interface.
  • Provide a stop condition. "Stop when score > 0.85 for two consecutive rounds" beats running until the human interrupts.
  • Include anti-patterns. "Do not add a second LLM call per turn" prevents expensive regressions.
Common failure modes
  • Meta-agent edits the adapter section and breaks Harbor integration. Fix: re-emphasise the fixed boundary in program.md.
  • Score noise masks real gains. Run each variant on all tasks, not a subset, until you trust the score is stable.
  • Over-specialisation. The meta-agent tunes to the benchmark rather than the underlying capability. Mitigate by rotating task sets between branches.
  • Silent rollbacks. If the meta-agent doesn't log discards, you lose the "why did this not work" trace. Enforce results.tsv discipline in the directive.

Links

AutoAgent repo
kevinrgu/autoagent
thirdlayer.inc
Self-configuring agents (WIP)
Harbor
Benchmark runner
uv
Python package manager
Organized-AI · gtm-autoresearch

Autoresearch

AutoAgent for GTM campaigns. A meta-agent mutates Google Tag Manager / server-side GTM container state via RFC 6902 JSON Patch against a per-client eval suite, hill-climbing on score. Round history persists in Cloudflare KV so any prior state is reconstructible.

Pipeline

autoresearch — one round [fswatch] ~/.claude/projects/**/*.jsonl session log changes [significance-check] worth a round? [run-actor] Apify plugin watcher · optional ad-hoc webhook [Cloudflare KV] drift history · wrangler kv put [fetch-dataset] Apify dataset → local SQLite [analyze-adjacency] gap detection vs. known skills/plugins [generate-experiments] Obsidian note · skill stub · marketplace stub [score round] per-client eval · 0.00–1.00 [emit RFC 6902 patch] mutation against container state [append run manifest] data/signals/run-history.json

Per-client evals

The client-eval-generator skill materialises an eval set + business profile for each client from:

  • Meta Ads insights — campaign / adset / ad level; last-30d + historical.
  • GTM + sGTM container exports — tags · triggers · variables · clients · transformations.
  • Client profile — the Markdown business description the operator commits with the client.

Today's client roster

HRE Beauty

Shopify ecom DTC. Profile: content/clients/hre-beauty/profile.md.

BLADE Web

Lead-gen + high-ticket. Profile: content/clients/blade-web/profile.md.

BLADE Server

sGTM CAPI-focused. Profile: content/clients/blade-server/profile.md.

Where the code lives

PathRole
DOCUMENTATION/ARCHITECTURE.mdSystem diagram · per-round data flow · mutation protocol · KV schema
DOCUMENTATION/loops/gtm-autoresearch/program.mdClient registry · strategy order · constraints · stop conditions
scripts/run-gtm-loop.tsMain round runner — orchestrates phases + publishes reports
scripts/generate-client-reports.tsMaterialise per-client report HTML + deploy via wrangler
scripts/lib/kv-store.tsIdempotent KV writes via wrangler
scripts/lib/client-reports.tsClient-report templating
.claude/skills/client-eval-generator/SKILL.mdEval + profile generator skill
data/experiments.sqliteExperiment log — rounds, scores, patches
data/signals/run-history.jsonRun manifest (append-only)

Conventions (from CLAUDE.md)

  • All scripts in scripts/ run via npx tsx scripts/<name>.ts.
  • Errors logged to data/errors/{timestamp}.log — never crash silently.
  • Outputs are idempotent — re-running never duplicates.
  • Console logs use phase prefix: [Phase0], [Phase1], ….
  • Every run appends to data/signals/run-history.json.
  • scripts/run-all.sh chains the full pipeline in order.
  • Never modify data/signals/known-*.json manually — auto-maintained.

Obsidian + Slack surfaces

Round output lands in two places humans read:

  • Obsidian vault$OBSIDIAN_VAULT_PATH/Planning/experiments/. Each round generates a dated experiment note with the directive, the patch, the score delta, and trajectories.
  • Slack — the Hermes Pi harness posts round outcomes to a designated channel. See Hermes Runtime.

Links

gtm-autoresearch
Organized-AI/gtm-autoresearch
Guide (fine-tune pipeline)
gtm-autoresearch-guide.pages.dev
RFC 6902
JSON Patch spec
Apify
Actor + dataset runtime
laude-institute / harbor

Harbor — The Objective Function

Harbor is the benchmark runner AutoAgent uses to score every round. A Harbor task is a directory with a setup, an agent entry point, and a test suite; Harbor runs your agent against each task and emits a score.

In plain terms
You don't write the benchmark — you write the directive. The agent writes the harness. Harbor measures the result. The meta-agent hill-climbs.

Task format

tasks/my-task/ setup/ # fixture — files, env, state task.yaml # name, description, evaluator tests/ # pass/fail assertions README.md # human context

Tasks are typically added in benchmark-specific branches so baseline branches stay clean.

Running

# One task uv run harbor run \ -p tasks/ \ --task-name "file-organizer" \ -l 1 -n 1 \ --agent-import-path agent:AutoAgent \ -o jobs --job-name latest # All tasks in parallel (concurrency 100) uv run harbor run \ -p tasks/ \ -n 100 \ --agent-import-path agent:AutoAgent \ -o jobs --job-name latest

What Harbor emits per run

  • Per-task score — pass / fail / partial.
  • Total score — aggregate used by the meta-agent for hill-climbing.
  • Trajectory — full record of agent decisions, tool calls, and reasoning. The meta-agent diffs trajectories across rounds to identify failure clusters.
  • Cost / timing — useful for gating "did this variant get slower or more expensive."

Why trajectories matter

Read the traces

Score tells you if something changed. Trajectories tell you what.

The meta-agent reads trajectories between rounds to identify why specific tasks are failing. Without that, every mutation is blind. With it, mutations are targeted — "the agent got lost in directory recursion, let me add a tool that avoids that" — and the hill-climb converges faster.

Mutation protocol · drift history

JSON Patch + Cloudflare KV

Autoresearch's answer to "how do we mutate without breaking the rollback guarantee." Every proposed change is an RFC 6902 JSON Patch operation list. Every applied patch is logged to Cloudflare KV so prior states are reconstructible.

Why JSON Patch

Reviewable

  • Diffs are small and readable.
  • Each op maps 1:1 to a container change.
  • Human can sanity-check patches before apply.

Reversible

  • Each op has a canonical reverse.
  • A failed round rolls back deterministically.
  • No "state drift" across rounds.

Replayable

  • KV stores patch sequence per client.
  • Replay from any anchor snapshot to reach any prior state.
  • Audit = scroll through patch history.

Composable

  • Multiple patches batch into one round.
  • Attribution preserved per op.
  • Lets the meta-agent propose multi-part changes without sacrificing auditability.

Example patch

[ { "op": "replace", "path": "/tags/ga4_purchase/parameter/value_currency", "value": "USD" }, { "op": "add", "path": "/triggers/purchase_qualified", "value": { "filter": [ { "variable": "ecom_value", "op": "gte", "value": 50 } ] } } ]

KV schema

KeyValue
client/<slug>/anchorBaseline container snapshot (full state)
client/<slug>/round/<n>Patch applied at round n + score + meta
client/<slug>/headPointer to latest successful round
client/<slug>/eval/<n>Eval output for round n

Reconstructing prior state

# Replay any prior state anchor = kv.get("client/hre-beauty/anchor") state = anchor for n in range(1, target_round + 1): patch = kv.get(f"client/hre-beauty/round/{n}")["patch"] state = apply_rfc6902(state, patch) # `state` is now the container state at `target_round`

Writer

scripts/lib/kv-store.ts is a thin wrapper around wrangler kv {put,get,list}. Idempotent — re-running the same write for the same key is a no-op unless the value changes. Relies on local wrangler being logged in against your Cloudflare account.

Model escalation

Sonnet → Opus 4.6 at 0.92

Both loops use the same model escalation policy. Claude Sonnet drives most rounds. The first time a round's score crosses ≥ 0.92, the loop escalates to Claude Opus 4.6 — and stays there. One-way.

Default driverClaude Sonnet — all rounds while score < 0.92
Escalation triggerFirst round whose score ≥ 0.92 — one-way promotion
Escalated driverClaude Opus 4.6 — all subsequent rounds
Why 0.92

Sonnet finds the hill. Opus polishes the summit.

Sonnet is fast and cheap enough to run dozens of rounds overnight while the score is still climbing the easy gradient. Once you're in the top decile, each additional point gets harder, and you want a stronger driver that can reason about subtle interactions in the harness / campaign.

0.92 is empirically where the returns to model quality start to dominate the returns to more rounds. Below it, you want iteration speed. Above it, you want reasoning depth.

Why one-way

Don't re-earn the threshold

If the escalation were two-way, a single regression would bump the loop back to Sonnet, and you'd spend cheap rounds re-climbing ground Opus already covered. One-way means once you've earned the right to Opus, you keep it.

This also makes cost predictable — you know exactly how many Opus rounds a project has consumed by counting rounds since the first ≥ 0.92.

Tuning

  • Different threshold. Projects with noisier score functions can raise to 0.94 to prevent accidental promotion. Projects with cheap Opus tokens can lower to 0.88.
  • Two-way, with cooldown. Variant: allow demotion to Sonnet, but only after 5 consecutive rounds below the threshold. Useful in long-running loops.
  • Per-phase driver. Different loop phases (mutate, score, summarise) can pin different models. Summary-model is already google/gemini-3-flash-preview in the Hermes config.
Graduation path

Hermes as Runtime

AutoAgent and autoresearch both produce artefacts. Neither runs them in production. That's what Hermes does — take a proven harness or a proven autoresearch loop and host it as a long-lived Pi harness on claws-mac-mini.

Pi = Persistent, Independent. Each Pi harness has its own launchd label (ai.hermes.harness.<name>), its own Slack handle, its own narrow tool surface, and its own session store. A Pi harness crashing never takes the main Hermes gateway down.

Graduation flow

From experiment to production [AutoAgent / Autoresearch] hill-climb until score ≥ bar [graduate checkpoint] freeze agent.py / patch sequence [Hermes synthesise] ├─▶ ~/.hermes/harnesses/<name>.yaml ├─▶ ~/Library/LaunchAgents/ai.hermes.harness.<name>.plist ├─▶ Slack app handle · scopes · channel bindings └─▶ MCP wiring + env vars [launchctl kickstart] Pi harness live [ongoing operations] logs · restart · graceful upgrade

Target roster of Pi harnesses

autoresearch

Watches ~/.claude/projects/**/*.jsonl, runs the gtm-autoresearch loop, posts findings. Replaces scripts/run-all.sh cron trigger.

repo-ingest

GitHub URL in Slack → repo-digest + Gemma summary reply. Runs on Gemma tier — zero marginal cost.

client-eval

Wraps the client-eval-generator skill. Scheduled eval regen + Cloudflare Pages publish.

harness-foundry

Long-running AutoAgent driver. Accepts new program.md directives via Slack, kicks off experiments, posts results.

Why this split

  • Iteration and production have different needs. Iteration wants fast, cheap, frequent runs with disposable containers. Production wants uptime, memory, and human-facing surfaces.
  • Hermes already solves production. Launchd supervision, Slack Socket Mode, MCP plumbing, memory persistence, self-heal. Reusing it beats rebuilding those for each experiment output.
  • AutoAgent / autoresearch stay pure. They don't need to know about launchd or Slack. Their output is an artefact. Hermes takes it from there.

Claws host specifics

KeyValue
Hostclaws-mac-mini · 100.82.244.127
Userclaw
Gatewayai.hermes.gateway (LaunchAgent)
Logs~/.hermes/logs/*.log
ModelsCodex OAuth primary · Gemma-4 fallback (:8080)
Shell toolsgh · gitingest · repo-digest · wrangler

See the standalone Hermes Pi Harness guide for host-level operations.

Side-by-side

Quick Start

Two independent setups. Pick the loop you need; the other runs happily on the same machine later.

AutoAgent

Requirements: Docker · Python 3.10+ · uv · model-provider creds.

# 1. Install uv curl -LsSf https://astral.sh/uv/install.sh | sh # 2. Clone + deps git clone https://github.com/kevinrgu/autoagent cd autoagent uv sync # 3. Env cat > .env <<'EOF' OPENAI_API_KEY=... EOF # 4. Base image docker build -f Dockerfile.base -t autoagent-base . # 5. Add Harbor tasks under tasks/ (benchmark-specific branches) # 6. Baseline run (all tasks, parallel 100) rm -rf jobs; mkdir -p jobs && \ uv run harbor run -p tasks/ -n 100 \ --agent-import-path agent:AutoAgent \ -o jobs --job-name latest > run.log 2>&1 # 7. Kick off the meta-agent (in your coding agent) Read program.md and let's kick off a new experiment!

Autoresearch

Requirements: Node 20+ · tsx · wrangler logged in · Apify + Meta creds in .env.

# 1. Clone git clone https://github.com/Organized-AI/gtm-autoresearch cd gtm-autoresearch # 2. Env — see .env.example cp .env.example .env && $EDITOR .env # 3. Install deps npm install # 4. Auth wrangler (one-off) wrangler login wrangler whoami # 5. Run a single round npx tsx scripts/run-gtm-loop.ts # 6. Chain the full pipeline bash scripts/run-all.sh # 7. Publish per-client reports (deploys to Cloudflare Pages) npx tsx scripts/generate-client-reports.ts

Useful cross-cutting commands

# Tail round history tail -f data/signals/run-history.json # KV operations (autoresearch) wrangler kv list --namespace-id <id> wrangler kv get "client/hre-beauty/head" --namespace-id <id> # Harbor trajectory dump (autoagent) cat jobs/latest/<task-name>/trajectory.json | jq .
Terms + Frequently Asked

Glossary + FAQ

The words that come up a lot when the two loops show up in conversation.

Glossary

Meta-agent
A coding agent (Claude Code, Codex, etc.) pointed at the AutoAgent or autoresearch repo with a directive. It reads program.md, edits the artefact, runs the benchmark, and iterates. Human doesn't touch Python / config directly.
Directive
The human-authored Markdown description of what the loop is trying to achieve. The only file humans edit regularly. Lives in program.md.
Harness
In AutoAgent: the single-file agent under test (agent.py). In Hermes: a long-lived Pi harness — an always-on agent instance with its own launchd label, Slack handle, and tool surface.
Hill-climb
A search strategy that proposes a local change, measures whether the score went up, keeps it if so and reverts it if not. No gradient info required. Simplest thing that works when you can only measure after the fact.
Harbor
Benchmark runner from the Laude Institute. Powers AutoAgent's score. Tasks are directories with a setup, an entry point, and a test suite.
Trajectory
Full record of an agent's decisions, tool calls, and reasoning during a Harbor run. The meta-agent diffs trajectories between rounds to identify failure clusters.
RFC 6902 JSON Patch
IETF spec for describing JSON document changes as a list of operations (add / replace / remove / move / copy / test). Each op has a canonical reverse, which is what makes autoresearch's mutations deterministically reversible.
Drift history
The sequence of JSON Patches applied to a client's GTM container state. Stored in Cloudflare KV. Replay from an anchor snapshot to reconstruct any prior state.
Per-client eval
Score suite generated by the client-eval-generator skill from Meta Ads insights + GTM export + client profile Markdown. Produces a 0–1 score per round.
Pi harness
Persistent, Independent — a long-lived agent process hosted by Hermes on claws-mac-mini (or any fleet node). Own launchd label, own Slack handle, own tool surface. Not the same thing as a Raspberry Pi.
Graduation
The act of taking an experiment output (a winning AutoAgent checkpoint or a stable autoresearch loop) and installing it as a Pi harness in Hermes. Turns an iterating experiment into a production surface.
Sonnet → Opus 4.6
One-way model escalation policy. Claude Sonnet drives rounds while score < 0.92. First time a round crosses ≥ 0.92, the loop promotes to Opus 4.6 and stays there. Optimises cost vs. reasoning depth.
Anchor snapshot
A full container state captured at a known-good moment. Autoresearch replays patches from the anchor to reach any later state — so the KV never needs to store the full state per round, only deltas.

FAQ

Is AutoAgent a framework or a repo?

Both. The repo kevinrgu/autoagent is the reference harness — a single-file agent.py plus the program.md + tasks/ + .agent/ convention. The framework is the pattern: edit a directive, let a coding agent iterate on the harness, score with Harbor, hill-climb.

Do I need Harbor to run autoresearch?

No. Autoresearch uses its own scoring surface — per-client evals generated from Meta Ads + GTM exports. Harbor is specific to AutoAgent.

Can I use one loop without the other?

Yes. They're independent systems. The shared mental model makes them easier to learn together, but you can adopt either one alone. Many teams start with autoresearch (because GTM pain is concrete) and add AutoAgent later (because harness engineering is more abstract).

What if score is noisy?

Run each variant against the full task / eval set before deciding. If noise is still high, require N consecutive rounds above the best-so-far before promoting. AutoAgent's program.md can encode that rule; autoresearch's program.md can too.

How do I stop a runaway loop?

AutoAgent: interrupt the coding agent session; the harness stays at the last kept variant. Autoresearch: touch .pause (checked at the top of each round) or launchctl stop on the Pi harness label. Both write a manifest per round so you always know where you are.

Why RFC 6902 instead of just storing full state per round?

Patch sequences compress better, diff better, and review better. Replay from anchor is O(round_count) but rounds per client are in the hundreds, not millions — cheap. Full-state-per-round loses the mutation intent, which is exactly what trajectories + patches make legible.

Does the meta-agent need to be Claude?

No. Any coding agent that can read a repo, edit files, and run shell commands will work. Claude Sonnet is the default here because the escalation policy targets Opus 4.6 at 0.92 — but pointing Codex or a local model at program.md is supported.

Where do I see a round's result?

AutoAgent: results.tsv in the repo root (gitignored). Autoresearch: data/signals/run-history.json (append-only) plus the per-round KV entry. Once the Pi harness is live, both surface to a dedicated Slack channel.

autoagent × autoresearch
hill-climb / Harbor / RFC 6902
Sonnet → Opus 4.6 @ 0.92
autoagent-autoresearch-guide.pages.dev/#home