Mirror/squeezelite-esp32

Fork 0

mirror of https://github.com/sle118/squeezelite-esp32.git synced 2026-03-13 22:12:44 +03:00

Files

Sebastien L debc8c58b7 docs: close GOAL-001 WS8 (runner push unblocked)

2026-02-13 22:08:20 +00:00

22 KiB

Raw Blame History

GOAL-001: LXD Codex Orchestration + Multi-Hardware CI

Status: active
Owner: arch:codex
Last updated: 2026-02-12

Objective

Establish a durable workflow where:

the Architecture Agent (arch) in this repository remains the code-authoring and coordination control plane,
the Architecture Agent can drive an Infrastructure Agent (infra) for host/container/VM maintenance,
the Architecture Agent can drive a Runner Agent (runner) inside the provisioned VM for CI/CD and hardware execution,
the LXD VM becomes the operational base for hardware-aware CI, and
multiple hardware-under-test (HUT) nodes can be flashed/tested in parallel with safe, locked power control.

Follow-On Goals

documentation/short-term/active/GOAL-002-hut-surface-first-test.md for first cross-slot HUT surfacing execution.
documentation/short-term/active/GOAL-003-agent-doc-lint-ci.md for documentation linting and gardening flow in CI/CD.

CI Environment Split (required contract)

There are two CI lanes and they must remain distinct:

Upstream GitHub lane (no physical hardware):
- compile, package, static checks, unit/integration tests that do not require physical devices
- never assume serial ports, relay access, or USB-attached targets
Local LXD hardware lane (physical HUT available):
- flash/monitor, hardware integration tests, soak runs, power-cycle recovery flows

Seamless operation requirement:

hardware lane consumes the same commit SHA/artifacts produced by upstream lane (or rebuilds deterministically from same SHA),
result reporting flows back to shared status artifacts and short-term board,
no hidden environment-only behavior differences outside explicitly documented hardware dependencies.

Note:

CI lanes define where tests run.
Agent roles define which Codex agent controls which environment.
These are orthogonal and must not be conflated.

Agent Roles (required contract)

Short names (use these in board ownership and reports):

arch: Architecture Agent (main engineering/control plane)
infra: Infrastructure Agent (platform/host/VM operations)
runner: Runner Agent (VM CI/CD + test execution)

Role responsibilities:

arch
- owns product-repo development, task coordination, and status reporting governance
- delegates remote work to role-local Codex runtimes when available (infra:codex, runner:codex); SSH is treated as a transport layer, not as the execution model
- mentors and improves infra/runner operating efficiency over time; must propose improvements when asked or when major inefficiencies degrade team throughput
- governs documentation quality, tracking hygiene, and baseline environment decisions
- keeps governance documentation enabling and operational, not artificially restrictive
infra
- manages host/container/VM provisioning and maintenance
- never clones or edits this product repository
- maintains its own repository and AGENT documentation
- does not communicate directly with runner except when only local host commands can reach the VM
runner
- executes CI/CD and test workflows inside VM scope
- reports structured artifacts/status back to arch
- maintains its own repository and AGENT documentation

Owner Semantics (Prevent Drift)

Owner values encode where execution happens:

<role>:codex
- Codex is installed/authenticated in that role environment.
- Coordination means prompting that remote Codex runtime to execute work in its role repository (not SSH-direct command execution).
<role>:agent
- The arch:codex control-plane is acting as a temporary proxy executor for that role (typically via SSH transport).
- Allowed only for bootstrap/emergency.
- Must be called out in handoff summaries as executor=ssh_direct.

Repository Boundary Contract (required target)

Product repository (squeezelite-esp32) for arch.
Infrastructure repository for infra.
Runner orchestration repository for runner.

Required behavior:

each role has a dedicated repository
each remote role (infra, runner) has independent AGENTS.md, workstream board, and handoff log
strict execution boundary:
- arch must never implement runner workstreams by modifying or running commands inside the runner repository on the development machine (example local path: /workspaces/codex-runner-agent)
- runner workstreams are executed only by runner:codex inside the runner VM, with results reported back via runner evidence + commit SHA
cross-agent handoffs include commit SHA, agent role, status, and artifact/log pointers
optional local GitLab can be used as shared upstream for portability and document access when the operator enables it

Operator Touchpoints

Operator assistance is explicitly expected for:
- runner Codex interactive authentication
- optional local GitLab setup/onboarding for agent repos
arch should request operator help when these gates are reached instead of introducing brittle workarounds.

GitLab Credential Contract (ADHOC-20260212-02)

This contract standardizes secret location, auth mode, and verification for agent-repo bootstrap on git.lecsys.net.

Canonical secret root:

${XDG_CONFIG_HOME:-$HOME/.config}/codex/credentials/gitlab/git.lecsys.net/

Expected role files:

codex.env, arch.env, infra.env, runner.env

Minimum variables per role file:

GITLAB_HOST
GITLAB_USER
GITLAB_PASSWORD (bootstrap-only)
GITLAB_PAT (required for non-interactive flows)

Required controls:

permissions: root 0700, files 0600
auth mode: PAT over HTTPS for automation; password only for bootstrap
propagation: transfer role-local file only to its target runtime
evidence hygiene: record command outcome, never raw secret values

Agent Repo Remote Bootstrap (ADHOC-20260212-03)

Bootstrapped remotes and default branch:

infra repo: http://git.lecsys.net/infra/infra-agent.git (main)
runner repo: http://git.lecsys.net/runner/runner-agent.git (main)

Bootstrap checkout contract:

clone role repo to its canonical workspace path
checkout main
verify origin remote URL and branch tracking
verify remote reachability with git ls-remote

Evidence locations:

/workspaces/codex-infra-agent/README.md
/workspaces/codex-runner-agent/README.md
documentation/short-term/coordination/handoff_log.md

Container Agent Pickup (first 15 minutes)

Read:
- documentation/short-term/coordination/workstream_board.md
- documentation/short-term/coordination/handoff_log.md
Claim one workstream by updating board owner/status.
Execute only the claimed workstream scope.
Log handoff with concrete next step before stopping.

Workstreams

WS12: Codex App Server Transport (evaluate -> prototype -> implement)
- Objective: evaluate Codex App Server as a more structured transport than SSH for prompting runner:codex / infra:codex, especially for parallel multi-step threads and stable context reuse.
- Mechanism lock (required):
  - WS12 must produce and maintain a single transport decision record in long-lived docs:
    - documentation/agents/remote_transport_lock.md
  - Once remote_transport_lock.md sets status: locked, follow-on WS12 work must implement that locked mechanism (no re-litigating transport per session).
  - SSH remains the break-glass fallback unless explicitly retired in the lock record.
- Acceptance (v1):
  - Evaluation:
    - documented transport decision criteria and current status (evaluating/locked) in remote_transport_lock.md
  - Prototype:
    - a minimal proof-of-viability plan that can be executed by the appropriate role(s) without arch becoming the executor
  - Implementation:
    - documented invocation pattern(s) (no secrets) that preserve the repo boundary contract and keep thread_ref correlation usable across sessions and in parallel

Delegation Gate (required before remote-role execution)

Before executing any infra or runner workstream:

assign concrete owner (infra:<id> or runner:<id>)
verify role activation state is at least assigned
log first role-context handoff entry
record operator gate for the action:
- operator_required=yes or operator_required=no
record execution mechanism in the handoff summary:
- executor=remote_codex (prompt remote Codex runtime), or
- executor=ssh_direct (bootstrap/emergency only)

Handoff lines should include:

context: arch-local | infra-live | runner-live
action_type: scaffold | delegate | execute | replan
operator_required: yes | no

Ad-hoc request reflex:

ask user: Track this as a ticket? (yes/no)
if yes, create/update documentation/short-term/coordination/ad_hoc_ticket_queue.md
if no, continue and include ticket_tracking=declined in handoff summary

Evidence naming convention:

arch_*, infra_*, runner_*

WS1: Infrastructure SSH + Codex passthrough

Outcome

arch can reliably trigger/drive commands on the infra runtime over SSH.

Tasks

Validate key-based SSH path and non-interactive command execution.
Standardize connection variables in .lxd.env (host/user/key/path).
Create a minimal remote control wrapper script (local side) for repeatable calls.
Add heartbeat check command (hostname, uptime, whoami, disk availability).

Acceptance

One command from arch context can run remote shell on infra without manual interaction.
Failures return non-zero status and readable diagnostics.

WS2: Multi-agent operating model

Outcome

arch can delegate infrastructure operations to infra and CI/CD execution to runner, then collect structured status back without role ambiguity.

Tasks

Define role-responsibility contract for arch/infra/runner including explicit no-product-repo boundary for infra.
Define contract for remote job invocation:
- target role (infra or runner)
- job id
- working directory
- command bundle
- expected artifacts/log paths
Define report format (single machine-readable summary + human note) including agent_role, commit_sha, and artifact_paths.
Add guardrails:
- one remote job per lock file
- timeout policy
- cancellation behavior
Define minimum required files for each agent-role repository:
- AGENTS.md with startup path and discovery-first checks
- workstream board/handoff format for cross-session continuation
Smoke test one infra maintenance command through build-scripts/lxd_remote.sh and log evidence.

Acceptance

Remote execution contract documents all three agent roles and their boundaries.
infra instructions explicitly avoid assuming existing host/VM resources.
At least one infra smoke task is executed with evidence.

WS2 Contract Draft (v1)

Role Responsibility Contract

arch (Architecture Agent, this repo):
- prepares job request and selects target role
- may edit product code
- collects reports and updates coordination docs
infra (Infrastructure Agent):
- manages host/VM lifecycle and host maintenance only
- must not edit product code repositories directly
- must run discovery-first checks before mutating host state
runner (Runner Agent inside VM):
- executes CI/CD and hardware workflows in VM scope
- may operate on product repo clones/worktrees inside VM
- does not own host virtualization lifecycle

Invocation Schema (machine-readable)

{
  "schema_version": "ws2.invocation.v1",
  "job_id": "GOAL-001-WS2-0001",
  "target_role": "infra",
  "requested_by": "arch",
  "request_utc": "2026-02-12T15:30:00Z",
  "commit_sha": "<git sha or none>",
  "repo": "squeezelite-esp32",
  "workdir": "/home/codexsvc",
  "command": [
    "hostname",
    "whoami",
    "uptime"
  ],
  "timeout_sec": 600,
  "lock_key": "goal001-ws2-smoke",
  "expected_artifacts": [
    "test/build/log/lxd_ws2_smoke_YYYYMMDD.log"
  ],
  "notes": "WS2 role routing smoke test"
}

Required fields:

schema_version, job_id, target_role, requested_by, request_utc, command, timeout_sec, lock_key.

Allowed target_role values:

infra
runner

Report Schema (machine-readable + human note)

{
  "schema_version": "ws2.report.v1",
  "job_id": "GOAL-001-WS2-0001",
  "agent_role": "infra",
  "status": "success",
  "start_utc": "2026-02-12T15:30:10Z",
  "end_utc": "2026-02-12T15:30:12Z",
  "duration_sec": 2,
  "exit_code": 0,
  "lock_key": "goal001-ws2-smoke",
  "executor": {
    "hostname": "hpi5-2",
    "user": "codexsvc"
  },
  "commit_sha": "<git sha or none>",
  "artifact_paths": [
    "test/build/log/lxd_ws2_smoke_YYYYMMDD.log"
  ],
  "human_summary": "Infra-role smoke command completed successfully.",
  "next_action": "Proceed with role-specific AGENT repo skeletons."
}

Required fields:

schema_version, job_id, agent_role, status, start_utc, end_utc, duration_sec, exit_code, lock_key, human_summary.

Allowed status values:

success
failed
timeout
cancelled

Guardrail Defaults

Locking:
- lock file path pattern: /tmp/codex-<target_role>-<lock_key>.lock
- only one active job per (target_role, lock_key) tuple
Timeout defaults:
- infra: 600 seconds default, 3600 seconds max
- runner: 1800 seconds default, 14400 seconds max
Cancellation behavior:
- cancellation request writes /tmp/codex-cancel-<job_id>.flag
- executor checks cancellation between command steps and returns status cancelled
Non-zero exits:
- any non-zero command exit returns report status=failed
- partial outputs are still published in artifact_paths

Minimum Files For Each Agent-Role Repository

AGENTS.md
- startup read order
- role boundaries
- discovery-first checks
documentation/coordination/workstream_board.md
- owner/status/blocker/next action table
documentation/coordination/handoff_log.md
- timestamped handoff lines with evidence pointers
scripts/
- executable wrappers for invocation, reporting, and locks

WS2 Evidence

Smoke command log:
- test/build/log/lxd_ws2_smoke_20260212.log
Structured report:
- test/build/log/lxd_ws2_report_20260212.json

WS3: Infrastructure agent bootstrap

Outcome

infra has a dedicated repository, baseline AGENTS.md, and coordination artifacts so host/VM operations can run independently from this product repository.

Tasks

Create/initialize the infra repository.
Add role-specific AGENTS.md with discovery-first startup checks.
Add workstream_board.md and handoff_log.md in the infra repo.
Document explicit boundary: infra must never clone squeezelite-esp32.

Acceptance

Infra repo exists with required files and startup path.
Role boundary rules are documented and testable.

WS3 Evidence

Infra repository path:
- /workspaces/codex-infra-agent
Infra governance files:
- /workspaces/codex-infra-agent/AGENTS.md
- /workspaces/codex-infra-agent/documentation/coordination/workstream_board.md
- /workspaces/codex-infra-agent/documentation/coordination/handoff_log.md

WS4: Runner VM provisioning

Outcome

infra can provision the runner VM with deterministic naming and baseline dependencies.

Tasks

Provision VM instance for runner workloads.
Record VM identity (name/IP/image/resources) in infra tracking docs.
Install baseline packages required for remote management.

Acceptance

VM is created and reachable from host.
Provisioning commands and outputs are captured in infra evidence logs.

WS5: Runner SSH reachability

Outcome

arch can reach the runner VM over SSH using non-interactive key auth.

Tasks

Configure SSH service/user/key on runner VM.
Validate non-interactive SSH from arch context to runner VM.
Capture heartbeat evidence (hostname, whoami, uptime, disk).

Acceptance

Non-interactive SSH to runner VM succeeds from arch.
Failed SSH attempts return clear non-zero diagnostics.

WS6: Runner Codex authentication (operator-assisted)

Outcome

Runner VM Codex runtime is authenticated and ready for delegated tasks.

Tasks

Execute Codex auth bootstrap on runner VM.
Request operator assistance for interactive auth step when needed.
Record completion evidence and residual access risks.

Acceptance

Runner Codex auth is confirmed and timestamped.
Operator-assist step is logged when used.

WS7: Runner agent bootstrap

Outcome

runner has its own repository with AGENT contract and reporting format, ready to accept delegated CI/test tasks.

Tasks

Initialize runner repository and baseline docs.
Add AGENTS.md, workstream board, and handoff log.
Define runner reporting contract back to arch.

Acceptance

Runner repo and governance docs exist.
A smoke report from runner to arch is captured.

WS8: Multi-HUT hardware CI topology

Outcome

Runner-based CI can schedule and run hardware jobs across multiple attached boards in parallel with deterministic ownership.

Lane Boundary Rules

Hardware job definitions stay out of mandatory upstream checks.
Upstream jobs must succeed without any hardware runner availability.
Hardware jobs are triggered from local lane and mapped back to same commit/branch context.

Topology (recommended)

Runner VM runs orchestration:
- GitHub self-hosted runners (or local runner service)
- artifact store path
- health checks
One runner label per HUT slot, e.g.:
- esp32-hut-01
- esp32-hut-02
Each HUT slot has stable serial identity:
- /dev/serial/by-id/...
Job routing uses labels and lock files to prevent slot collision.

Acceptance

At least two HUT slots can run independent jobs without resource conflict.

Required Runner Artifacts (WS8 completion evidence)

Produced in runner repository (by runner:codex on the runner VM):

topology documentation describing labels + lock strategy
per-slot lock wrapper (flock) and a self-test proving:
- same-slot operations serialize
- different-slot operations can run concurrently
slot inventory evidence capturing /dev/serial/by-id visibility
a local (ignored) slot mapping config with at least two slots defined (hut-01, hut-02) mapping to stable /dev/serial/by-id/...

Current Status (WS8)

Runner VM: two stable /dev/serial/by-id/* devices are visible.
- Evidence (runner repo): runner_hil_topo_inventory_20260213_215705_utc.log
- Evidence (runner repo): runner_hil_topo_lock_selftest_20260213_215839_utc.log
GitLab: WS8 closeout pushed to runner/runner-agent@c03a9a0.

WS9: Hard power-cycle via Home Assistant relay

Outcome

Power control is scripted and lock-protected so recovery sequences are deterministic and safe.

Service Contract

Inputs:
- relay entity id
- slot id
- on/off durations
Safety:
- per-slot lock (/var/lock/hut-<id>.lock)
- max retry count
- cooldown interval
Output:
- structured result (ok, timeout, ha_error)

Acceptance

CI job can request power-cycle for a slot and receive deterministic status.
Concurrent power-cycle requests for the same slot are serialized.

WS10: Branch/agent parallelism model

Outcome

High parallel throughput with low conflict/overhead across agents.

Recommended Pattern

Keep GitHub as source of truth (no mandatory local git server).
Use local bare mirror/cache only as accelerator (optional).
Spawn one worktree per agent/task:
- ~/workspace/wt/<task-id>
Run isolated build/test output folders per worktree.
Standardize artifact naming by task id and commit hash.

Seamless Lane Handoff

Upstream lane publishes build artifacts keyed by commit SHA.
Local lane pulls/uses artifact for that same SHA when possible.
Hardware result summary references:
- commit SHA
- lane (upstream or local-hil)
- HUT slot id
- power-cycle count/recovery events

Acceptance

Multiple agents can run concurrently without workspace contamination.

Achievable "Dream" Environment (pragmatic target)

Base VM image

Ubuntu 24.04 LTS (LXD VM, not container, for better device/systemd behavior)

Core dependencies

git, openssh-client, curl, jq, python3, pip
Docker engine + buildx + compose plugin
ESP-IDF toolchain usage via project container (espressif/idf:release-v5.5 derivative)
Optional observability:
- tmux, htop, nvme-cli/disk monitors

CI/Cd strategy

Primary CI remains GitHub-hosted + self-hosted hardware jobs.
Optional local preflight pipeline on LXD for fast feedback before push.
Trigger model:
- on push to feature branches: local preflight + optional hardware smoke
- on PR: gated hardware jobs by label/comment trigger

Why no mandatory local git server?

Local Git service adds operational overhead and split-source risk. Prefer:

GitHub canonical origin
optional local mirror for speed/caching only

Long-run autonomy target for role-based Codex

infra has:
- persistent auth
- host/VM bootstrap scripts
- health checks
- clearly scoped sudo permissions
runner has:
- persistent auth
- CI/CD task runner scripts
- artifact/report emission to arch
arch can delegate operations to infra/runner and receive structured reports without direct host-manual intervention.
Human operator only needed for:
- policy decisions
- hardware maintenance
- credential rotation

Deliverables Checklist

SSH passthrough and remote command wrapper verified
Multi-agent contract documented and smoke-tested
Infrastructure agent repository + governance bootstrap completed
Runner VM provisioned with reproducible baseline
Architecture-to-runner SSH connectivity verified
Runner Codex authentication completed (operator-assisted)
Runner agent repository + governance bootstrap completed
Multi-HUT runner topology implemented with label routing
Home Assistant relay power-cycle lock and retry logic implemented
Parallel worktree branch workflow documented and operational
Closure notes moved to archive when complete

22 KiB Raw Blame History

GOAL-001: LXD Codex Orchestration + Multi-Hardware CI

Objective

Follow-On Goals

CI Environment Split (required contract)

Agent Roles (required contract)

Owner Semantics (Prevent Drift)

Repository Boundary Contract (required target)

Operator Touchpoints

GitLab Credential Contract (ADHOC-20260212-02)

Agent Repo Remote Bootstrap (ADHOC-20260212-03)

Container Agent Pickup (first 15 minutes)

Workstreams

Delegation Gate (required before remote-role execution)

WS1: Infrastructure SSH + Codex passthrough

Outcome

Tasks

Acceptance

WS2: Multi-agent operating model

Outcome

Tasks

Acceptance

WS2 Contract Draft (v1)

Role Responsibility Contract

Invocation Schema (machine-readable)

Report Schema (machine-readable + human note)

Guardrail Defaults

Minimum Files For Each Agent-Role Repository

WS2 Evidence

WS3: Infrastructure agent bootstrap

Outcome

Tasks

Acceptance

WS3 Evidence

WS4: Runner VM provisioning

Outcome

Tasks

Acceptance

WS5: Runner SSH reachability

Outcome

Tasks

Acceptance

WS6: Runner Codex authentication (operator-assisted)

Outcome

Tasks

Acceptance

WS7: Runner agent bootstrap

Outcome

Tasks

Acceptance

WS8: Multi-HUT hardware CI topology

Outcome

Lane Boundary Rules

Topology (recommended)

Acceptance

Required Runner Artifacts (WS8 completion evidence)

Current Status (WS8)

WS9: Hard power-cycle via Home Assistant relay

Outcome

Service Contract

Acceptance

WS10: Branch/agent parallelism model

Outcome

Recommended Pattern

Seamless Lane Handoff

Acceptance

Achievable "Dream" Environment (pragmatic target)

Base VM image

Core dependencies

CI/Cd strategy

Why no mandatory local git server?

Long-run autonomy target for role-based Codex

Deliverables Checklist

22 KiB

Raw Blame History