In-shed build debugging¶
The publish-images workflow only runs on tag push to GitHub. When the
flow regresses (a stale --source-ref, a --platform mistake, a
buildx-driver compatibility break) the only feedback signal is a slow
CI failure. This page documents how to reproduce that flow locally
inside a shed so you can iterate in minutes instead of hours.
Why a shed (and not the host)?¶
The publish flow targets linux/arm64 rootfs images. Building those on
an Apple Silicon Mac through docker buildx --platform linux/arm64
works in theory, but the rest of the flow assumes a Linux host: the
shed CLI shells out to mkfs.ext4, mksquashfs, and other tools that
either don't exist or behave differently on macOS. Running everything
inside a Linux shed eliminates that drift; the only difference between
scripts/publish-images-local.sh and the GitHub Actions job is the
target registry.
The buildx network-host trick¶
docker buildx create defaults to the docker-container driver, which
runs BuildKit inside a container with its own network namespace. When
the build needs to reach a registry on the shed's loopback (the local
registry:2 we spin up to receive the push), the default driver
cannot: localhost:5050 inside BuildKit refers to BuildKit's own
container, not the shed.
The one-line fix is to ask the driver to share the shed's network namespace:
That option does two things:
- Sets
--network hoston the BuildKit container so it shares the shed's loopback with the registry. - Boots BuildKit with
--allow-insecure-entitlement=network.host, which is also needed for anyRUN --network=hostdirectives in the Dockerfile.
You can confirm both by running docker buildx inspect <name> and
looking for Driver Options: network="host" plus
BuildKit daemon flags: --allow-insecure-entitlement=network.host.
One-shot validation¶
The publish-flow simulation is wrapped up in
scripts/publish-images-local.sh. It boots a buildx builder, a
local registry:2, builds the shed-overlay initramfs, and then for
each variant:
- Runs
shed image buildwith--source-refand--platformset exactly the way the publish workflow does. - Asserts the on-disk manifest's
io.shed.source-refannotation matches the registry ref we're about to push to. - Runs
shed image push --localto stream the OCI layout to the in-shed registry. - Fetches the manifest back from the registry over HTTP and asserts the annotation survived.
- For the first variant, runs
shed image pullinto a brand-new OCI store and re-checks the annotation; this is what a remote shed-server does when servingshed create foo --image baseafter a publish.
The exit code is the verdict: 0 for PASS, 1 for FAIL. The line
preceding FAIL: identifies which variant tripped.
Setup¶
- Boot a validation shed with enough disk for buildx + intermediate
layers. The
fullimage variant has docker + buildx pre-installed;--upper-size 40Gis the minimum that comfortably fits a single variant build (a buildx builder image,registry:2, the ~3 GB intermediate ext4 rootfs, plus headroom). For all three variants in one run, use 80 GB.
- Open SSH to the shed:
- Neutralise the docker credential helper. The shed-agent inside the
shed installs a
credsStorethat talks to the host over vsock; inside a validation shed there is no host to answer, so anydocker pull/docker pushhangs forever waiting for creds.
- Cross-build the in-VM binaries on the host and stage everything into the shed:
# On the host:
GOOS=linux GOARCH=arm64 go build -o vz/shed-agent ./cmd/shed-agent
GOOS=linux GOARCH=arm64 go build -o vz/shed-firstboot ./cmd/shed-firstboot
GOOS=linux GOARCH=arm64 go build -o /tmp/shed ./cmd/shed
scp -F /tmp/publish-test-ssh -r . shed-publish-test:/home/shed/work
scp -F /tmp/publish-test-ssh /tmp/shed shed-publish-test:/tmp/shed
ssh -F /tmp/publish-test-ssh shed-publish-test \
'sudo install -m 0755 /tmp/shed /usr/local/bin/shed'
- Run the validation script inside the shed:
ssh -F /tmp/publish-test-ssh shed-publish-test \
'cd /home/shed/work && ./scripts/publish-images-local.sh'
To exercise only one variant (faster on a small upper layer):
ssh -F /tmp/publish-test-ssh shed-publish-test \
'cd /home/shed/work && VARIANTS=base ./scripts/publish-images-local.sh'
Interpreting results¶
PASS line:
means: every requested variant built successfully, the OCI manifest
written by shed image build carried the right
io.shed.source-ref annotation, the registry received the manifest
byte-perfect, and a fresh shed image pull into a clean store still
sees the same annotation.
FAIL lines come in a few flavours:
FAIL: base: local manifest source-ref mismatch ...- the bug PR #94 fixed has come back;shed image buildis baking the wrong annotation into the manifest. Check--source-refhandling incmd/shed/image.goand the OCI conversion path ininternal/vmimage/convert.go.FAIL: base: registry-side source-ref mismatch ...- the on-disk manifest is right butshed image pushis rewriting it on the way out. Check the push handler ininternal/api/server.goandinternal/vmimage/manager.go.FAIL: base: round-trip source-ref mismatch ...- the push succeeded butshed image pullis dropping or rewriting the annotation. Check the OCI ingest path ininternal/vmimage/registry.go.FAIL: ... gzip: stdin: Input/output errororEXT4-fs error ... Remounting filesystem read-only- your shed ran out of disk in the writable upper layer. Delete the shed and create a new one with a larger--upper-size.
Gotchas¶
- The buildx builder and
registry:2are torn down on script exit, even on failure. The OCI store at/tmp/publish-images-local-store/is left in place so you can inspect blobs and manifests after a failed run. - If you forget the
network=hostdriver-opt, the symptom isdial tcp 127.0.0.1:5050: connect: connection refusedduring theshed image buildphase - BuildKit reaches its own loopback, not the shed's. - Docker daemon in the shed ships with
"bridge": "none"indaemon.json, sodocker runwithout--network hostwill not give the container network access. The script accounts for this by using--network hoston the registry container as well. - The 'full' image variant has docker + buildx, but it does NOT have
Go installed. Cross-build the shed CLI on the host and
scpit in rather than trying togo buildinside the shed. - Don't be tempted to point the script at a real ghcr.io path with a
fake credential; the
--localpush path doesn't authenticate, but the assertion step does an unauthenticated HTTPGETon the manifest. Run it against a localregistry:2and only flip toghcr.ioonce you actually want to publish.