AI-Ready Homelab: Centralized IaC with Ansible & Docker

Most home labs are a patchwork of one-off scripts. A Bash file here, a manual docker compose up there, and a README note that says "run this first." It works until it doesn't. When something breaks at 11 p.m., you're left reconstructing what you did six months ago to get a service running.

There's a better way. Treat every service in your home lab the same: same deployment process, same security model, same observability. Not because it's merely elegant, but because consistency is what makes a home lab maintainable.

This post walks through how I built that system, and more importantly, why you should too.

What Is Infrastructure as Code?

Before getting into the specifics, it's worth establishing what infrastructure as code actually means in practice, because the term gets thrown around loosely.

The idea is simple: every configuration decision that describes your infrastructure is captured in a file, committed to version control, and applied by a machine rather than typed by hand. Instead of SSH-ing into a server and running commands, you write a playbook that describes the desired state, and a tool like Ansible makes it so.

This isn't just about automation. The deeper value is that your infrastructure becomes auditable, reproducible, and moveable. Anyone looking at your repository can understand exactly what a server does and why. Rebuilding a failed node is re-running a playbook, not piecing together notes from memory. Migrating a service to new hardware is changing one variable.

Done right, you should never have a server you're afraid to wipe.

Stop Writing Deployment Scripts Per Service

Here's the trap most people fall into: they write deployment logic inside each service's repository. Service A has its own deploy.sh. Service B has a different one. Service C has a GitHub Action. After a while, you have six services and six different deployment models. You fix a security bug in one and forget to apply it to the others.

The solution is a single, centralized pipeline repository that contains all the deployment logic for your entire cluster. Every service points at it. Every service benefits when it improves.

I call mine The Construct. It's a single Ansible roles library. When I bring a new service into the cluster, I don't write new deployment logic. I write a short configuration file that hands off to the same roles every other service uses.

When I fixed the secret handling to use RAM instead of disk, every service in the cluster got that fix on its next deploy. When I added automatic health checks, every service got health checks. One improvement, total coverage.

That's the leverage you get from centralizing.

The Architecture: Core & Satellite

The Construct follows a Core & Satellite model. The central Construct repository holds all the shared Ansible roles. Individual Code repos each contain a single deploy.yml that imports those roles and supplies service-specific configuration: the hostname, the project path, the secrets, the health check endpoints.

The Construct (shared roles, one place, one version)
└── roles/
├── docker-deploy      ← handles every Docker service
├── lxc-provision      ← provisions the host itself
├── nfs-mount          ← mounts storage
├── service-validation ← health checks
├── mqtt-notify        ← telemetry
├── recordDeploymentMetadata ← audit trail
└── image-prune        ← cleanup

The Code (e.g., Caliper)
└── deploy.yml               ← ~50 lines of config, no logic

The Architect is the orchestrator, self-hosted and local with no cloud CI. When a deploy triggers, The Architect clones both The Construct and the Code repo into a workspace and hands control to Ansible. It also supplies the vault secrets (API keys, database passwords) that services need at runtime.

The execution targets live in The Galaxy: Alpine or Ubuntu LXC containers running on PVE nodes, each dedicated to a single service and named after a Star Wars planet. Ansible reaches them over SSH through the bastion host via ProxyJump. Nothing in the cluster is directly reachable from the management plane.

Built to Move

The other discipline worth enforcing from day one: nothing in your infrastructure should be hard to rebuild or migrate.

This is where IaC pays off beyond just automation:

Hosts are code. Every node in The Galaxy is provisioned by the lxc-provision role. Its VMID, hostname, IP, OS template, and NFS configuration live in a playbook. Reprovisioning a failed node is one Ansible run.
The server holds a container. The repo holds the truth. Service configuration lives in Git, not in the server's filesystem. If the server disappears, the configuration doesn't.
Secrets are portable. They live in The Architect's variable groups, not on disk, not in the repo. Moving to a new Architect instance means exporting the vault. Nothing is trapped.
Every deploy is idempotent. Running the same playbook twice with no changes produces no changes. Re-deploying is always safe, which means rollbacks and re-runs are never scary.

If a PVE node failed today, I could rebuild every service it hosted by running playbooks against a new node. No tribal knowledge required. No reconstructing configuration from memory. The code is the documentation.

Aim for that. If you have a server you'd be afraid to wipe and rebuild, that's a problem to fix.

The Eight-Phase Deployment

Every service in the cluster runs through the same eight-phase sequence. Having a standard flow means there are no surprises when something goes wrong. You know exactly where in the pipeline a failure occurred.

Notify: MQTT status published as "Updating". Home Assistant sees this immediately.
NFS Validation: If the service uses network storage, the Black-Hole Check runs before anything else proceeds.
Alpine Dependencies: For Alpine nodes, rsync and nfs-utils are installed dynamically if needed.
Sync: The Code repo is rsynced to /opt/<service>/ on the node, excluding anything in .deployignore.
Secret Injection & Docker Up: The atomic block: write to RAM, launch the stack, wipe from RAM.
Health Checks: HTTP responses, DNS resolution, and database readiness (Postgres, MySQL, Redis) are probed against configured endpoints.
Audit Log: The Git SHA, deployer identity, branch, and a UTC timestamp are written to /etc/wizard/deployments/ on the node.
Image Prune & Final Notify: Dangling Docker layers are removed, and a success event publishes to MQTT.

Every phase runs inside block/rescue. If anything breaks, MQTT publishes a failure event before the error propagates. The broker always reflects current state.

Deep Dive: Phase 5 & The Secret Handling Problem

Here's a specific problem worth solving deliberately: where do your secrets go during a deploy?

The obvious answer is a .env file. Copy it to the server, docker compose picks it up. It works. It's also a quiet security failure. That file sits on disk indefinitely, shows up in backups, and travels wherever your rsync does. Most home labs have a graveyard of .env files in /opt/ that nobody remembers putting there.

The better answer: secrets should never touch disk at all.

The Construct implements a RAM-only secrets model. When a service needs secrets injected at deploy time, the docker-deploy role writes the .env content to /dev/shm (a tmpfs mount that exists only in RAM), launches the stack, then immediately deletes the file in an always block that runs whether the deploy succeeded or failed.

always:
  - name: SECURE CLEANUP - Remove .env from RAM
    ansible.builtin.file:
      path: "/dev/shm/{{ mqtt_topic }}.env"
      state: absent
    when: envContent | default('') != ''

The file never touches the physical disk of the Galaxy node. If the deploy crashes mid-run, the next run sweeps orphaned files before doing anything else. Ansible logs never capture the secret content because no_log: true suppresses the task output entirely.

If you're wondering how containers stay authenticated once that file vanishes, it's because docker compose up -d completely parses the .env file at instantiation. It injects those variables directly into the isolated process memory space of the newly spawned containers. Once the >stack is initialized, the container holds onto those keys internally, allowing the pipeline to safely shred the physical file from RAM without interrupting the active runtime environment.

This is the kind of thing that's easy to implement once and hard to remember to implement in every one-off script. A centralized pipeline means you make the right call once and it applies everywhere.

The Black-Hole NFS Check

If your services use network storage, add this check to your pipeline. When an NFS mount fails silently, Docker writes to the local root filesystem instead, and you won't notice until something fills up.

The trap with typical directory checks (like Bash’s -d /mnt/nfs) is that if a remote share drops or fails to mount, the local mount-point folder itself still technically exists on the host's root disk—it's just completely empty. A standard script sees the folder, assumes everything is fine, and lets Docker blindly dump gigabytes of data directly onto your host's root OS drive.

By using Ansible's stat module to compare the underlying storage device IDs (stat.dev), we can programmatically verify whether the mount point has successfully decoupled from the system root:

- name: Stat NFS mount point
  ansible.builtin.stat:
    path: "{{ nfsMountPath }}"
  register: mount_stat

- name: Stat system root
  ansible.builtin.stat:
    path: "/"
  register: root_stat

- name: Fail if NFS not mounted separately
  ansible.builtin.fail:
    msg: "NFS mount check failed: {{ nfsMountPath }} is on the root filesystem."
  when: mount_stat.stat.dev == root_stat.stat.dev

If dev matches root, the NFS share isn't mounted. The deployment aborts before anything runs. It's a one-time addition to your pipeline that protects every service that uses shared storage.

MQTT as the Telemetry Bus

Every deployment phase in The Construct publishes to the internal MQTT broker, the same broker handling sensor state and automation events throughout the home. If you're already running Home Assistant with an MQTT broker, this is worth wiring up.

The benefit is that deployment events flow into the same system you're already watching. Deployment state appears in dashboards alongside temperature and presence data. But it goes further than visibility.

Because the broker is a universal event bus, any service that speaks MQTT can react to deployment events. n8n (a self-hosted workflow automation tool) can subscribe to deployment topics and trigger follow-on processes automatically: running database migrations after a successful deploy, posting a notification to a channel, or kicking off integration tests. The Construct publishes the event; n8n decides what to do with it.

The other side is failure alerting. When a deployment breaks, the rescue block fires an MQTT event before re-raising the failure. That event routes to a personal push notification with the service name, the host, and the timestamp, so I know something went wrong without log diving.

/ansible/caliper      → "Updating - 05/22/2026 14:32:01"
/deployment/caliper   → "05/22/2026 14:32:44 - ✅ Deployment of caliper completed on endor"
/deployment/caliper   → "05/22/2026 14:33:12 - 🛑 Failed to deploy caliper on endor"

The same system that tells you a door was left open can tell you a deployment failed.

Testing: Every Role Has Molecule

If you build a shared pipeline library, test it. Each core Ansible role powering these deployment phases has a Molecule test suite that runs on every push via Gitea Actions.

Infrastructure code has an honest testing problem: you can't fully simulate a real deployment in CI. You can't SSH into a PVE node, mount a real NFS share, or reach the MQTT broker. Some things can only be validated against real hardware.

What Molecule catches is everything else, and that's still a lot. Idempotency violations (running the same role twice should produce no changes on the second run), incorrect conditionals, broken variable references, logic that works on Ubuntu but fails on Alpine. These are the errors that would otherwise surface mid-deployment against a live service. For a shared library that every service in the cluster depends on, catching them early is worth the test setup.

The real-hardware gaps get validated the old-fashioned way: against actual nodes before merging anything that touches those code paths.

The One Rule That Doesn't Bend

Whatever pipeline you build, establish one non-negotiable: secrets never touch persistent storage. Not /opt/. Not /etc/. Not /tmp/. RAM only.

Disk is durable and RAM isn't. A reboot clears /dev/shm. A forensic investigation of the filesystem finds nothing. A backup of /opt/caliper/ contains no credentials. This rule is easy to implement once in a centralized pipeline and impossible to enforce consistently across a dozen one-off scripts.

Onboarding with an AI Agent

The Construct is well-documented, but documentation still requires someone to read and apply it correctly. To close that gap, I built a Claude Code skill that fully understands The Construct: the role catalog, the variable schema, The Architect's configuration model, and the naming conventions.

When bringing a new service into the cluster, the workflow is now: describe what the service does, whether it needs NFS, what endpoints it exposes, and what secrets it requires. The skill generates a complete deploy.yml from those answers, correctly wired to The Construct's roles. It also walks through exactly what to configure in The Architect (which repository to register, how to structure the variable group, how to set up the template and inventory) so the first deploy runs without touching the documentation.

This works because the pipeline is consistent. Because every service is configured the same way, an agent that understands the pattern can onboard any service that follows it. The centralization that makes The Construct maintainable is the same thing that makes it teachable to an AI.

Build The Construct First

If there's one thing to take from this: create your centralized pipeline repo before you create your second service. Once deployment logic is scattered across individual repos, consolidating it is a project. Starting with a shared library means every service you add makes the whole system more valuable, not more complex.

Your pipeline should have one job: take a service from "code in a repo" to "running container on a node" in a repeatable, auditable, secure way. Everything else (observability, storage validation, health checks, audit trails) follows naturally once that foundation exists.

Build it once. Deploy everything with it.

From Scripts to Single Source: The Construct for Reproducible Home Infrastructure

What Is Infrastructure as Code?

Stop Writing Deployment Scripts Per Service

The Architecture: Core & Satellite

Built to Move

The Eight-Phase Deployment

Deep Dive: Phase 5 & The Secret Handling Problem

The Black-Hole NFS Check

MQTT as the Telemetry Bus

Testing: Every Role Has Molecule

The One Rule That Doesn't Bend

Onboarding with an AI Agent

Build The Construct First

Comments

The Blueprint

Silent by Design: Blinds Automation That Disappears

More from this blog

Silent by Design: Blinds Automation That Disappears

Zone Defense: How I Built a Zone‑Based Firewall That Actually Secures a Smart Home

Five Unbreakable Rules for a Smart Home

Command Palette

What Is Infrastructure as Code?

Stop Writing Deployment Scripts Per Service

The Architecture: Core & Satellite

Built to Move

The Eight-Phase Deployment

Deep Dive: Phase 5 & The Secret Handling Problem

The Black-Hole NFS Check

MQTT as the Telemetry Bus

Testing: Every Role Has Molecule

The One Rule That Doesn't Bend

Onboarding with an AI Agent

Build The Construct First

Comments

The Blueprint

Silent by Design: Blinds Automation That Disappears

More from this blog