22 KiB
GOAL-001: LXD Codex Orchestration + Multi-Hardware CI
Status: active
Owner: arch:codex
Last updated: 2026-02-12
Objective
Establish a durable workflow where:
- the Architecture Agent (
arch) in this repository remains the code-authoring and coordination control plane, - the Architecture Agent can drive an Infrastructure Agent (
infra) for host/container/VM maintenance, - the Architecture Agent can drive a Runner Agent (
runner) inside the provisioned VM for CI/CD and hardware execution, - the LXD VM becomes the operational base for hardware-aware CI, and
- multiple hardware-under-test (HUT) nodes can be flashed/tested in parallel with safe, locked power control.
Follow-On Goals
documentation/short-term/active/GOAL-002-hut-surface-first-test.mdfor first cross-slot HUT surfacing execution.documentation/short-term/active/GOAL-003-agent-doc-lint-ci.mdfor documentation linting and gardening flow in CI/CD.
CI Environment Split (required contract)
There are two CI lanes and they must remain distinct:
- Upstream GitHub lane (no physical hardware):
- compile, package, static checks, unit/integration tests that do not require physical devices
- never assume serial ports, relay access, or USB-attached targets
- Local LXD hardware lane (physical HUT available):
- flash/monitor, hardware integration tests, soak runs, power-cycle recovery flows
Seamless operation requirement:
- hardware lane consumes the same commit SHA/artifacts produced by upstream lane (or rebuilds deterministically from same SHA),
- result reporting flows back to shared status artifacts and short-term board,
- no hidden environment-only behavior differences outside explicitly documented hardware dependencies.
Note:
- CI lanes define where tests run.
- Agent roles define which Codex agent controls which environment.
- These are orthogonal and must not be conflated.
Agent Roles (required contract)
Short names (use these in board ownership and reports):
arch: Architecture Agent (main engineering/control plane)infra: Infrastructure Agent (platform/host/VM operations)runner: Runner Agent (VM CI/CD + test execution)
Role responsibilities:
arch- owns product-repo development, task coordination, and status reporting governance
- delegates remote work to role-local Codex runtimes when available
(
infra:codex,runner:codex); SSH is treated as a transport layer, not as the execution model - mentors and improves
infra/runneroperating efficiency over time; must propose improvements when asked or when major inefficiencies degrade team throughput - governs documentation quality, tracking hygiene, and baseline environment decisions
- keeps governance documentation enabling and operational, not artificially restrictive
infra- manages host/container/VM provisioning and maintenance
- never clones or edits this product repository
- maintains its own repository and AGENT documentation
- does not communicate directly with
runnerexcept when only local host commands can reach the VM
runner- executes CI/CD and test workflows inside VM scope
- reports structured artifacts/status back to
arch - maintains its own repository and AGENT documentation
Owner Semantics (Prevent Drift)
Owner values encode where execution happens:
<role>:codex- Codex is installed/authenticated in that role environment.
- Coordination means prompting that remote Codex runtime to execute work in its role repository (not SSH-direct command execution).
<role>:agent- The
arch:codexcontrol-plane is acting as a temporary proxy executor for that role (typically via SSH transport). - Allowed only for bootstrap/emergency.
- Must be called out in handoff summaries as
executor=ssh_direct.
- The
Repository Boundary Contract (required target)
- Product repository (
squeezelite-esp32) forarch. - Infrastructure repository for
infra. - Runner orchestration repository for
runner.
Required behavior:
- each role has a dedicated repository
- each remote role (
infra,runner) has independentAGENTS.md, workstream board, and handoff log - strict execution boundary:
archmust never implementrunnerworkstreams by modifying or running commands inside the runner repository on the development machine (example local path:/workspaces/codex-runner-agent)runnerworkstreams are executed only byrunner:codexinside the runner VM, with results reported back via runner evidence + commit SHA
- cross-agent handoffs include commit SHA, agent role, status, and artifact/log pointers
- optional local GitLab can be used as shared upstream for portability and document access when the operator enables it
Operator Touchpoints
- Operator assistance is explicitly expected for:
- runner Codex interactive authentication
- optional local GitLab setup/onboarding for agent repos
archshould request operator help when these gates are reached instead of introducing brittle workarounds.
GitLab Credential Contract (ADHOC-20260212-02)
This contract standardizes secret location, auth mode, and verification
for agent-repo bootstrap on git.lecsys.net.
Canonical secret root:
${XDG_CONFIG_HOME:-$HOME/.config}/codex/credentials/gitlab/git.lecsys.net/
Expected role files:
codex.env,arch.env,infra.env,runner.env
Minimum variables per role file:
GITLAB_HOSTGITLAB_USERGITLAB_PASSWORD(bootstrap-only)GITLAB_PAT(required for non-interactive flows)
Required controls:
- permissions: root
0700, files0600 - auth mode: PAT over HTTPS for automation; password only for bootstrap
- propagation: transfer role-local file only to its target runtime
- evidence hygiene: record command outcome, never raw secret values
Agent Repo Remote Bootstrap (ADHOC-20260212-03)
Bootstrapped remotes and default branch:
infrarepo:http://git.lecsys.net/infra/infra-agent.git(main)runnerrepo:http://git.lecsys.net/runner/runner-agent.git(main)
Bootstrap checkout contract:
- clone role repo to its canonical workspace path
- checkout
main - verify
originremote URL and branch tracking - verify remote reachability with
git ls-remote
Evidence locations:
/workspaces/codex-infra-agent/README.md/workspaces/codex-runner-agent/README.mddocumentation/short-term/coordination/handoff_log.md
Container Agent Pickup (first 15 minutes)
- Read:
documentation/short-term/coordination/workstream_board.mddocumentation/short-term/coordination/handoff_log.md
- Claim one workstream by updating board owner/status.
- Execute only the claimed workstream scope.
- Log handoff with concrete next step before stopping.
Workstreams
- WS12: Codex App Server Transport (evaluate -> prototype -> implement)
- Objective: evaluate Codex App Server as a more structured transport
than SSH for prompting
runner:codex/infra:codex, especially for parallel multi-step threads and stable context reuse. - Mechanism lock (required):
- WS12 must produce and maintain a single transport decision record
in long-lived docs:
documentation/agents/remote_transport_lock.md
- Once
remote_transport_lock.mdsetsstatus: locked, follow-on WS12 work must implement that locked mechanism (no re-litigating transport per session). - SSH remains the break-glass fallback unless explicitly retired in the lock record.
- WS12 must produce and maintain a single transport decision record
in long-lived docs:
- Acceptance (v1):
- Evaluation:
- documented transport decision criteria and current status
(evaluating/locked) in
remote_transport_lock.md
- documented transport decision criteria and current status
(evaluating/locked) in
- Prototype:
- a minimal proof-of-viability plan that can be executed by the
appropriate role(s) without
archbecoming the executor
- a minimal proof-of-viability plan that can be executed by the
appropriate role(s) without
- Implementation:
- documented invocation pattern(s) (no secrets) that preserve the
repo boundary contract and keep
thread_refcorrelation usable across sessions and in parallel
- documented invocation pattern(s) (no secrets) that preserve the
repo boundary contract and keep
- Evaluation:
- Objective: evaluate Codex App Server as a more structured transport
than SSH for prompting
Delegation Gate (required before remote-role execution)
Before executing any infra or runner workstream:
- assign concrete owner (
infra:<id>orrunner:<id>) - verify role activation state is at least
assigned - log first role-context handoff entry
- record operator gate for the action:
operator_required=yesoroperator_required=no
- record execution mechanism in the handoff summary:
executor=remote_codex(prompt remote Codex runtime), orexecutor=ssh_direct(bootstrap/emergency only)
Handoff lines should include:
context:arch-local|infra-live|runner-liveaction_type:scaffold|delegate|execute|replanoperator_required:yes|no
Ad-hoc request reflex:
- ask user:
Track this as a ticket? (yes/no) - if
yes, create/updatedocumentation/short-term/coordination/ad_hoc_ticket_queue.md - if
no, continue and includeticket_tracking=declinedin handoff summary
Evidence naming convention:
arch_*,infra_*,runner_*
WS1: Infrastructure SSH + Codex passthrough
Outcome
arch can reliably trigger/drive commands on the infra runtime over
SSH.
Tasks
- Validate key-based SSH path and non-interactive command execution.
- Standardize connection variables in
.lxd.env(host/user/key/path). - Create a minimal remote control wrapper script (local side) for repeatable calls.
- Add heartbeat check command (
hostname,uptime,whoami, disk availability).
Acceptance
- One command from
archcontext can run remote shell oninfrawithout manual interaction. - Failures return non-zero status and readable diagnostics.
WS2: Multi-agent operating model
Outcome
arch can delegate infrastructure operations to infra and CI/CD
execution to runner, then collect structured status back without role
ambiguity.
Tasks
- Define role-responsibility contract for
arch/infra/runnerincluding explicit no-product-repo boundary forinfra. - Define contract for remote job invocation:
- target role (
infraorrunner) - job id
- working directory
- command bundle
- expected artifacts/log paths
- target role (
- Define report format (single machine-readable summary + human note)
including
agent_role,commit_sha, andartifact_paths. - Add guardrails:
- one remote job per lock file
- timeout policy
- cancellation behavior
- Define minimum required files for each agent-role repository:
AGENTS.mdwith startup path and discovery-first checks- workstream board/handoff format for cross-session continuation
- Smoke test one
inframaintenance command throughbuild-scripts/lxd_remote.shand log evidence.
Acceptance
- Remote execution contract documents all three agent roles and their boundaries.
infrainstructions explicitly avoid assuming existing host/VM resources.- At least one
infrasmoke task is executed with evidence.
WS2 Contract Draft (v1)
Role Responsibility Contract
arch(Architecture Agent, this repo):- prepares job request and selects target role
- may edit product code
- collects reports and updates coordination docs
infra(Infrastructure Agent):- manages host/VM lifecycle and host maintenance only
- must not edit product code repositories directly
- must run discovery-first checks before mutating host state
runner(Runner Agent inside VM):- executes CI/CD and hardware workflows in VM scope
- may operate on product repo clones/worktrees inside VM
- does not own host virtualization lifecycle
Invocation Schema (machine-readable)
{
"schema_version": "ws2.invocation.v1",
"job_id": "GOAL-001-WS2-0001",
"target_role": "infra",
"requested_by": "arch",
"request_utc": "2026-02-12T15:30:00Z",
"commit_sha": "<git sha or none>",
"repo": "squeezelite-esp32",
"workdir": "/home/codexsvc",
"command": [
"hostname",
"whoami",
"uptime"
],
"timeout_sec": 600,
"lock_key": "goal001-ws2-smoke",
"expected_artifacts": [
"test/build/log/lxd_ws2_smoke_YYYYMMDD.log"
],
"notes": "WS2 role routing smoke test"
}
Required fields:
schema_version,job_id,target_role,requested_by,request_utc,command,timeout_sec,lock_key.
Allowed target_role values:
infrarunner
Report Schema (machine-readable + human note)
{
"schema_version": "ws2.report.v1",
"job_id": "GOAL-001-WS2-0001",
"agent_role": "infra",
"status": "success",
"start_utc": "2026-02-12T15:30:10Z",
"end_utc": "2026-02-12T15:30:12Z",
"duration_sec": 2,
"exit_code": 0,
"lock_key": "goal001-ws2-smoke",
"executor": {
"hostname": "hpi5-2",
"user": "codexsvc"
},
"commit_sha": "<git sha or none>",
"artifact_paths": [
"test/build/log/lxd_ws2_smoke_YYYYMMDD.log"
],
"human_summary": "Infra-role smoke command completed successfully.",
"next_action": "Proceed with role-specific AGENT repo skeletons."
}
Required fields:
schema_version,job_id,agent_role,status,start_utc,end_utc,duration_sec,exit_code,lock_key,human_summary.
Allowed status values:
successfailedtimeoutcancelled
Guardrail Defaults
- Locking:
- lock file path pattern:
/tmp/codex-<target_role>-<lock_key>.lock - only one active job per
(target_role, lock_key)tuple
- lock file path pattern:
- Timeout defaults:
infra: 600 seconds default, 3600 seconds maxrunner: 1800 seconds default, 14400 seconds max
- Cancellation behavior:
- cancellation request writes
/tmp/codex-cancel-<job_id>.flag - executor checks cancellation between command steps and returns
status
cancelled
- cancellation request writes
- Non-zero exits:
- any non-zero command exit returns report
status=failed - partial outputs are still published in
artifact_paths
- any non-zero command exit returns report
Minimum Files For Each Agent-Role Repository
AGENTS.md- startup read order
- role boundaries
- discovery-first checks
documentation/coordination/workstream_board.md- owner/status/blocker/next action table
documentation/coordination/handoff_log.md- timestamped handoff lines with evidence pointers
scripts/- executable wrappers for invocation, reporting, and locks
WS2 Evidence
- Smoke command log:
test/build/log/lxd_ws2_smoke_20260212.log
- Structured report:
test/build/log/lxd_ws2_report_20260212.json
WS3: Infrastructure agent bootstrap
Outcome
infra has a dedicated repository, baseline AGENTS.md, and
coordination artifacts so host/VM operations can run independently from
this product repository.
Tasks
- Create/initialize the
infrarepository. - Add role-specific
AGENTS.mdwith discovery-first startup checks. - Add
workstream_board.mdandhandoff_log.mdin the infra repo. - Document explicit boundary:
inframust never clonesqueezelite-esp32.
Acceptance
- Infra repo exists with required files and startup path.
- Role boundary rules are documented and testable.
WS3 Evidence
- Infra repository path:
/workspaces/codex-infra-agent
- Infra governance files:
/workspaces/codex-infra-agent/AGENTS.md/workspaces/codex-infra-agent/documentation/coordination/workstream_board.md/workspaces/codex-infra-agent/documentation/coordination/handoff_log.md
WS4: Runner VM provisioning
Outcome
infra can provision the runner VM with deterministic naming and
baseline dependencies.
Tasks
- Provision VM instance for
runnerworkloads. - Record VM identity (name/IP/image/resources) in infra tracking docs.
- Install baseline packages required for remote management.
Acceptance
- VM is created and reachable from host.
- Provisioning commands and outputs are captured in infra evidence logs.
WS5: Runner SSH reachability
Outcome
arch can reach the runner VM over SSH using non-interactive key auth.
Tasks
- Configure SSH service/user/key on runner VM.
- Validate non-interactive SSH from
archcontext to runner VM. - Capture heartbeat evidence (
hostname,whoami,uptime, disk).
Acceptance
- Non-interactive SSH to runner VM succeeds from
arch. - Failed SSH attempts return clear non-zero diagnostics.
WS6: Runner Codex authentication (operator-assisted)
Outcome
Runner VM Codex runtime is authenticated and ready for delegated tasks.
Tasks
- Execute Codex auth bootstrap on runner VM.
- Request operator assistance for interactive auth step when needed.
- Record completion evidence and residual access risks.
Acceptance
- Runner Codex auth is confirmed and timestamped.
- Operator-assist step is logged when used.
WS7: Runner agent bootstrap
Outcome
runner has its own repository with AGENT contract and reporting
format, ready to accept delegated CI/test tasks.
Tasks
- Initialize runner repository and baseline docs.
- Add
AGENTS.md, workstream board, and handoff log. - Define runner reporting contract back to
arch.
Acceptance
- Runner repo and governance docs exist.
- A smoke report from runner to arch is captured.
WS8: Multi-HUT hardware CI topology
Outcome
Runner-based CI can schedule and run hardware jobs across multiple attached boards in parallel with deterministic ownership.
Lane Boundary Rules
- Hardware job definitions stay out of mandatory upstream checks.
- Upstream jobs must succeed without any hardware runner availability.
- Hardware jobs are triggered from local lane and mapped back to same commit/branch context.
Topology (recommended)
- Runner VM runs orchestration:
- GitHub self-hosted runners (or local runner service)
- artifact store path
- health checks
- One runner label per HUT slot, e.g.:
esp32-hut-01esp32-hut-02
- Each HUT slot has stable serial identity:
/dev/serial/by-id/...
- Job routing uses labels and lock files to prevent slot collision.
Acceptance
- At least two HUT slots can run independent jobs without resource conflict.
Required Runner Artifacts (WS8 completion evidence)
Produced in runner repository (by runner:codex on the runner VM):
- topology documentation describing labels + lock strategy
- per-slot lock wrapper (
flock) and a self-test proving:- same-slot operations serialize
- different-slot operations can run concurrently
- slot inventory evidence capturing
/dev/serial/by-idvisibility - a local (ignored) slot mapping config with at least two slots defined
(
hut-01,hut-02) mapping to stable/dev/serial/by-id/...
Current Status (WS8)
- Runner VM: two stable
/dev/serial/by-id/*devices are visible.- Evidence (runner repo):
runner_hil_topo_inventory_20260213_215705_utc.log - Evidence (runner repo):
runner_hil_topo_lock_selftest_20260213_215839_utc.log
- Evidence (runner repo):
- GitLab: WS8 closeout pushed to
runner/runner-agent@c03a9a0.
WS9: Hard power-cycle via Home Assistant relay
Outcome
Power control is scripted and lock-protected so recovery sequences are deterministic and safe.
Service Contract
- Inputs:
- relay entity id
- slot id
- on/off durations
- Safety:
- per-slot lock (
/var/lock/hut-<id>.lock) - max retry count
- cooldown interval
- per-slot lock (
- Output:
- structured result (
ok,timeout,ha_error)
- structured result (
Acceptance
- CI job can request power-cycle for a slot and receive deterministic status.
- Concurrent power-cycle requests for the same slot are serialized.
WS10: Branch/agent parallelism model
Outcome
High parallel throughput with low conflict/overhead across agents.
Recommended Pattern
- Keep GitHub as source of truth (no mandatory local git server).
- Use local bare mirror/cache only as accelerator (optional).
- Spawn one worktree per agent/task:
~/workspace/wt/<task-id>
- Run isolated build/test output folders per worktree.
- Standardize artifact naming by task id and commit hash.
Seamless Lane Handoff
- Upstream lane publishes build artifacts keyed by commit SHA.
- Local lane pulls/uses artifact for that same SHA when possible.
- Hardware result summary references:
- commit SHA
- lane (
upstreamorlocal-hil) - HUT slot id
- power-cycle count/recovery events
Acceptance
- Multiple agents can run concurrently without workspace contamination.
Achievable "Dream" Environment (pragmatic target)
Base VM image
- Ubuntu 24.04 LTS (LXD VM, not container, for better device/systemd behavior)
Core dependencies
git,openssh-client,curl,jq,python3,pip- Docker engine + buildx + compose plugin
- ESP-IDF toolchain usage via project container (
espressif/idf:release-v5.5derivative) - Optional observability:
tmux,htop,nvme-cli/disk monitors
CI/Cd strategy
- Primary CI remains GitHub-hosted + self-hosted hardware jobs.
- Optional local preflight pipeline on LXD for fast feedback before push.
- Trigger model:
- on push to feature branches: local preflight + optional hardware smoke
- on PR: gated hardware jobs by label/comment trigger
Why no mandatory local git server?
Local Git service adds operational overhead and split-source risk. Prefer:
- GitHub canonical origin
- optional local mirror for speed/caching only
Long-run autonomy target for role-based Codex
infrahas:- persistent auth
- host/VM bootstrap scripts
- health checks
- clearly scoped sudo permissions
runnerhas:- persistent auth
- CI/CD task runner scripts
- artifact/report emission to
arch
archcan delegate operations toinfra/runnerand receive structured reports without direct host-manual intervention.- Human operator only needed for:
- policy decisions
- hardware maintenance
- credential rotation
Deliverables Checklist
- SSH passthrough and remote command wrapper verified
- Multi-agent contract documented and smoke-tested
- Infrastructure agent repository + governance bootstrap completed
- Runner VM provisioned with reproducible baseline
- Architecture-to-runner SSH connectivity verified
- Runner Codex authentication completed (operator-assisted)
- Runner agent repository + governance bootstrap completed
- Multi-HUT runner topology implemented with label routing
- Home Assistant relay power-cycle lock and retry logic implemented
- Parallel worktree branch workflow documented and operational
- Closure notes moved to archive when complete