Changelog¶
All notable changes to this project are documented here.
[Unreleased]¶
v0.14.7 — Empirically-verified API key setup, hardening guide, metrics-server docs, regression guards¶
Security / Documentation¶
- Rewrite API key setup after empirical verification against TrueNAS 25.10.1 — Prior docs told users to create the API key under Credentials > Local Users > root > API Keys, which ties the provider's audit trail to interactive root activity and can't be revoked without affecting root login. An earlier attempt (also in Unreleased) recommended a scoped-roles custom privilege with 13 roles instead; that recommendation was based on partial information and does not actually work because the Talos ISO upload endpoint (
/_upload) enforces theSYS_ADMINaccount attribute on top of the role system, andSYS_ADMINis granted only viabuiltin_administratorsgroup membership. Replaced with the verified-working recipe: dedicated non-root user +builtin_administratorsgroup membership. All doc surfaces updated (README.md,AGENT.md,docs/truenas-setup.md,docs/quickstart.md,docs/getting-started.md,docs/index.md,docs/troubleshooting.md,llms-full.txt,.env.example,deploy/docker-compose.yaml). The newdocs/truenas-setup.md#5-api-keydocuments the empirical findings including why scoped privileges alone don't work, with cross-links to the two upstream bug reports.
Documentation¶
- New
docs/hardening.md— Practical security hardening guide for the provider, organized as eight rungs from highest-feasibility-today to aspirational. Covers: dedicated non-root TrueNAS user (with thebuiltin_administratorsrequirement explained), API key rotation flow, scoped privilege caveats with cross-link to the upstream bug, network-level controls (management VLAN, firewall allow-list), secret storage for Kubernetes / Docker Compose / standalone, TLS hygiene, audit log + Prometheus alert ingestion, and per-zvol ZFS encryption. Includes a Mermaid threat model diagram, aSecurityContextsnippet for Kubernetes, asecurity_optsnippet for Compose, a cosign verification one-liner, and a printable hardening checklist. Linked fromdocs/truenas-setup.mdand added to mkdocs nav under Operations. - Metrics Server guide for Talos clusters — New
docs/getting-started.mdStep 7 (plus pointer from Step 4) and matching blocks inAGENT.md,llms-full.txt, andllms.txtdocument the Talos-specific install recipe: cluster config patchmachine.kubelet.extraArgs.rotate-server-certificates: trueplus thekubelet-serving-cert-approverandmetrics-servermanifests delivered via Omni Extra Manifests. Covers both bootstrap-at-cluster-creation (preferred) and patch-existing-cluster paths. Follows the upstream Sidero guide. - Grafana dashboard marketplace descriptions — New
deploy/observability/dashboards/README.mdwith ready-to-submit entries (Name, Summary, Description, Panels, Tags, Required data sources) for the four bundled dashboards: Overview, VM Provisioning, TrueNAS API Performance, and Cleanup & Maintenance. Intended as the Description field when uploading to grafana.com/grafana/dashboards.
Tools¶
scripts/verify-api-key-roles— New Go probe that exercises every JSON-RPC method and the/_uploadendpoint the provider calls, using an API key you supply, and prints a pass/fail matrix for the 13 recommended roles (orFULL_ADMIN). The probe creates and tears down a throw-away dataset, 1 MB test zvol, and a stopped test VM — no persistent state on success, no VMs are started, no existing data is touched. Lets operators verify a scoped privilege before assigning it to the provider. Cross-referenced fromdocs/truenas-setup.mdanddocs/hardening.md.
Upstream bug reports¶
docs/upstream-bugs/truenas-role-recursion.md(NEW) — TrueNAS 25.10.1middlewared/role.py:362-363has no cycle detection inRoleManager.roles_for_role(). Saving a custom privilege with a meta-role (e.g.FULL_ADMIN,READONLY_ADMIN,FILESYSTEM_FULL_CONTROL) alongside its transitively-included child roles triggersRecursionError: maximum recursion depth exceededon every subsequentauth.login_*call for any user bound to that privilege. Middleware restart doesn't fix it because the bad privilege is persisted in the config DB. Recovery requires editing the privilege viamidcltfrom another admin account. Report includes full stack trace, minimal reproduction, proposed fix (visited-set guard), and user-side workarounds. File this upstream at iXsystems.docs/upstream-bugs/truenas-upload-role-gap.md(NEW) — TrueNAS 25.10.1/_uploadHTTP endpoint ignoresFILESYSTEM_DATA_WRITErole and returns HTTP 403 unless the user is inbuiltin_administrators. Inconsistent with the JSON-RPCfilesystem.putmethod which the role is documented to cover. Report includes a pass/fail matrix showing every other filesystem operation authorized by the same roles succeeding for the same user,auth.mediff between working (admin) and failing (scoped) users isolatingSYS_ADMINas the only differing attribute, reproduction script path, and proposed fix options. File this upstream at iXsystems.
Tests¶
- Regression guards — New tests pinning invariants that were found missing during the v0.14.3–v0.14.6 investigation.
TestCreateConfigPatch_AlwaysUsesPatchNameHelper(AST walk failing any bare-string-literalCreateConfigPatchcall),TestStepCreateVM_WiresAllExpectedPatches(4 patch kinds present instepCreateVM),TestDefaultExtensions_RequiredEntries(iscsi-tools/util-linux-tools/qemu-guest-agent),TestBuildOTLPExporters_ProtocolSelection(4 cases covering gRPC/HTTP selection),TestBuildOTLPExporters_UnsupportedProtocolFailsFast,TestBuildHTTPExporters_UsesSignalEndpointWiring(source-grep thatsignalEndpointis actually called),TestChangelog_VersionEntriesUseBracketFormat(release-workflow awk extractor compat),TestChangelog_EveryVersionHasReferenceLink,TestEnvDefaults_SafetyCriticalSettings(6 sub-cases:PROVIDER_SINGLETON_ENABLED=true,TRUENAS_INSECURE_SKIP_VERIFY=false,OMNI_INSECURE_SKIP_VERIFY=false,OTEL_EXPORTER_OTLP_PROTOCOL=grpc,GRACEFUL_SHUTDOWN_TIMEOUT=30,MAX_ERROR_RECOVERIES=5).
CI¶
- Release workflow asserts Dockerfile + image invariants — New "Verify image and Dockerfile invariants" step between smoke test and multi-arch push: asserts
Config.User == 65534:65534on the built image (catches silent base-image default drift), and grepsDockerfileforCOPY --chmod=0755and^USER 65534:65534(catches refactor drift that the runtime smoke test alone wouldn't detect in isolation). Any drift fails the build before anything reaches GHCR.
v0.14.6 — Fix every storage-side gap that made Longhorn silently broken¶
Storage-side hardening release. Three independent bugs in v0.13.0–v0.14.5 left users with non-functional or silently-broken Longhorn deployments. This release fixes all three plus adds the Talos-side operational config the
install-longhorn.shscript used to apply, sostorage_disk_size: 100in a MachineClass is now sufficient for a Longhorn-ready worker. Migration required for any existing cluster — see end of entry.
Fixes¶
- Drop
maxSize: 0from emittedUserVolumeConfig— The patch builder added in v0.14.3 emittedmaxSize: 0intending "unbounded", but Talos parses 0 as a literal byte count and rejects the document withUserVolumeConfig/longhorn: min size is greater than max size. Any worker that received the patch was stuck at Talosstage: 3withconfiguptodate: falseand never finished joining the cluster. Per Talos v1.12 docs, the correct way to express "fill the disk" is to omitmaxSizeand rely ongrow: true. Fixed inbuildUserVolumePatch. Pinned byTestBuildUserVolumePatch_SingleDisk_Longhorn(now also assertsmaxSizeis absent from the YAML). - Fix
CreateConfigPatchname collision acrossMachineRequests— The Omni SDK'sprovision.Context.CreateConfigPatch(ctx, name, data)uses the literalnameas the resource ID and upserts on every call. Every MachineRequest reconciling with the same unqualified name (e.g."data-volumes") wrote to the SAMEConfigPatchRequestresource — last writer wins, the other 5 of 6 machines silently went without their patch. Verified on a real cluster: 6 MachineRequests, 1 survivingdata-volumesConfigPatchRequest labeled for whichever request reconciled last. Same bug applied tonic-mtuandadvertised-subnetspatches. Fixed by introducingpatchName(kind, requestID)helper and threading the request ID into all 4 call sites instepCreateVM. Pinned byTestPatchName_IncludesRequestID,TestPatchName_DistinctAcrossRequests,TestPatchName_DistinctAcrossKinds. - Auto-emit Longhorn operational patch when a disk is named
longhorn— From v0.13.0 to v0.14.5 the provider attached the Longhorn data disk and (from v0.14.3) mounted it at/var/mnt/longhorn, but the Talos-side bits that make the node Longhorn-ready had to be applied byscripts/install-longhorn.sh— which most users either forgot to run or ran with the broken self-bind from v0.13.0–v0.14.2. The provider now emits alonghorn-ops-<requestID>patch alongside theUserVolumeConfigwhenever any disk is namedlonghorn(set implicitly bystorage_disk_size, explicitly byadditional_disks: [{name: longhorn, ...}]). The patch loads theiscsi_tcpkernel module (without it, Longhorn iSCSI replica attachment fails and PVCs stay Pending forever), binds/var/mnt/longhorn→/var/lib/longhornwithbind,rshared,rw(without it, Longhorn writes replicas to Talos's ephemeral root partition — silent data loss on node replace), and setsvm.overcommit_memory: "1"(recommended for replica process stability). After v0.14.6,helm install longhornis the only remaining user step. Pinned by 5 new test cases asserting source≠destination on the bind mount,rsharedoption present,iscsi_tcpmodule loaded,vm.overcommit_memory=1set, andLonghornVolumeNameconstant equals"longhorn".
Migration¶
Existing clusters provisioned on v0.13.0–v0.14.5 need cleanup before v0.14.6 starts emitting the new patches:
# 1. Delete the stuck data-volumes ConfigPatchRequest (collision artifact, has bad maxSize:0)
omnictl delete configpatchrequest data-volumes --namespace=infra-provider
# 2. If you have a manual Longhorn patch (e.g. longhorn-data-disk), delete it —
# the provider now emits an equivalent patch automatically.
# Two UserVolumeConfigs both named "longhorn" applied to the same machine
# will be rejected by Talos.
omnictl delete configpatch longhorn-data-disk
# 3. Reprovision worker VMs so they pick up the per-request patches and the
# operational patch on first boot. Easiest path: scale the worker
# MachineRequestSet down to 0 then back up to N.
# 4. After workers come back up, install/upgrade Longhorn via Helm.
# scripts/install-longhorn.sh is now optional — it still works (the Talos
# patch it applies is a superset of what the provider emits, so it's a
# no-op merge), but the only step that matters going forward is the Helm
# install itself.
helm install longhorn longhorn/longhorn -n longhorn-system --create-namespace \
--set defaultSettings.defaultDataPath=/var/lib/longhorn
v0.14.5 — Fix Grafana Cloud OTLP 404s (for real this time) + run as uid 65534¶
Fixes¶
- Fix OTLP 404s on Grafana Cloud (the v0.14.1 fix was wrong) — v0.14.1 claimed to honor
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobufby forwardingOTEL_EXPORTER_OTLP_ENDPOINTthroughotlptracehttp.WithEndpointURL(url), under the (incorrect) assumption that the SDK would append/v1/traces,/v1/metrics,/v1/logsto the path. It doesn't:WithEndpointURLin the Go OTEL SDK uses the URL path verbatim — it implements theOTEL_EXPORTER_OTLP_TRACES_ENDPOINTper-signal-URL semantic, not theOTEL_EXPORTER_OTLP_ENDPOINTbase-URL semantic. So when a user setOTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-east-3.grafana.net/otlp, every OTLP request went to.../otlp(no signal suffix) and Grafana Cloud returned404 Not Found. Observed as repeatingfailed to send logs to https://.../otlp: 404 Not Found/traces export: ... 404lines with no telemetry reaching the gateway. Fixed by introducingsignalEndpoint(base, "/v1/<signal>")that appends the per-signal path before callingWithEndpointURL. Covered byTestSignalEndpoint_AppendsPath(6 cases including Grafana Cloud base URL, trailing slash, host-only, root path, and invalid-URL fallback) andTestSignalEndpoint_InvalidURL_PassesThrough.
Behavior Changes¶
- Container runs as uid/gid 65534 (
nobody) instead of 65532 — The Dockerfile now setsUSER 65534:65534explicitly, overriding the distroless:nonroottag's default uid 65532. On TrueNAS hosts,nobodyis uid 65534 by default, so bind-mounted volumes from the host now align with the container user without needing achown. Container-only installs (pure Docker Compose, Kubernetes PVCs) are unaffected as long as volume ownership matches 65534 (most default volume-plugins create volumes owned by the container's uid). Manual migration may be required for existing deployments where volumes were pre-created and chown'd to 65532 (the old default): eitherchown -R 65534:65534 <volume-path>on the host, or override withdocker run --user 65532:65532to keep the old behavior. The binary is statically linked Go — no username lookups — so the fact that uid 65534 has no/etc/passwdentry in the distroless image is harmless.
v0.14.4 — Fix container permission denied + add image smoke test; yank v0.14.3¶
v0.14.4 = v0.14.3 + permission-denied fix + pipeline smoke test. All of v0.14.3's fixes ship here (UserVolumeConfig auto-emission for additional disks,
install-longhorn.shbind-mount correction) — the v0.14.3 release was yanked because its Docker image failed to start. Upgrading from v0.14.2 to v0.14.4 gives you every v0.14.3 fix plus a working binary. See the v0.14.3 entry below for the full storage fix details.
Fixes¶
- Fix
exec: permission deniedon container startup (v0.14.1–v0.14.3 images are broken) — The parallelize-builds refactor in v0.14.1 introduced a silent regression:actions/upload-artifact@v4packages files as ZIP and strips the execute bit on upload;actions/download-artifact@v4restores them without+x. The Dockerfile'sCOPYthen preserved the zero-permission file, so every Docker image published for v0.14.1, v0.14.2, and v0.14.3 fails immediately on startup withOCI runtime create failed: exec: "/usr/local/bin/omni-infra-provider-truenas": permission denied. Two-part fix: (1) Dockerfile now usesCOPY --chmod=0755to set the execute bit at build time regardless of source file mode, and (2) the release workflow runschmod +x _out/omni-infra-provider-truenas-*right afterdownload-artifactso the signed binaries uploaded to the release page are also directly executable for users downloading them outside the container. Users on v0.14.1–v0.14.3 must upgrade to v0.14.4. Pinning tov0.13.xalso works as a fallback (pre-regression), but v0.13.x is missing the v0.14.x fixes (WebSocket-only transport, Longhorn iscsi-tools extension, OTEL protocol honoring, boot-order fix, UserVolumeConfig auto-emission).
CI¶
- Add image smoke test to the release pipeline — New step in the release workflow builds the image for
linux/amd64into the local Docker daemon before the multi-arch push, then runsdocker run --rm smoke-test:<tag> --versionand asserts the output matches the tag. A broken binary (missing execute bit, corrupted cross-compile, failedldflags) fails the workflow before anything reaches GHCR, cosign, or the GitHub release page. The multi-arch push only runs if the smoke test passes, so users cannot pull a broken image even transiently. Also adds a--version/-v/versionflag to the CLI itself (prints version and exits 0) — separate fromrun()so no Omni/TrueNAS config is required, which makes it safe to invoke from CI with no env.
v0.14.3 — Fix additional disks never reaching Talos (Longhorn was running on the root disk) — YANKED¶
⚠️ Yanked — do not use v0.14.3. The published Docker image fails at startup with
OCI runtime create failed: exec: "/usr/local/bin/omni-infra-provider-truenas": permission deniedbecauseactions/upload-artifactstripped the execute bit from the compiled binary. The same regression affects v0.14.1 and v0.14.2 images. The GitHub Release page for v0.14.3 has been removed; users on v0.14.3 (or v0.14.1/v0.14.2) should upgrade to v0.14.4, which carries all v0.14.3 fixes plus the permission-denied repair and a pipeline smoke test that prevents this class of regression.
Fixes¶
- Auto-emit Talos
UserVolumeConfigfor additional disks — Settingadditional_disks(or thestorage_disk_sizeshorthand) attached the disk as a VM device on TrueNAS but never emitted the Talos config patch needed to format and mount it inside the guest. The disk showed up as a raw unformatted block device (/dev/vdb,/dev/vdc, ...) invisible to Longhorn, local-path-provisioner, and every other Kubernetes storage driver. Users had to apply a customUserVolumeConfigpatch manually for every MachineClass. Fixed by emitting aUserVolumeConfigpatch per additional disk instepCreateVM— filesystemxfs(default) orext4, mounted at/var/mnt/<name>, with a CEL selector keyed to each zvol's exact byte-size (±1 MiB tolerance for block-alignment) so multiple same-sized disks assign 1:1 to volumes in discovery order. Two newAdditionalDiskfields:name(defaults todata-N, 1-indexed) andfilesystem(defaults toxfs).storage_disk_sizeexpansion now auto-setsname: longhornso the volume mounts at/var/mnt/longhornto match Longhorn'sdefaultDataPath. Validation rejects duplicate volume names (two disks can't mount at the same path) and unknown filesystems. AddedTestBuildUserVolumePatch_*,TestStorageDiskSize_ExpandsWithLonghornVolumeName,TestAdditionalDisks_DefaultsFillNameAndFilesystem, and three new validation tests. - Fix
install-longhorn.shbind mount — Longhorn was silently running on the ephemeral root disk — The Talos config patch inscripts/install-longhorn.shdeclaredsource: /var/lib/longhornanddestination: /var/lib/longhorn: a self-bind that was effectively a no-op. It exposed the path under Talos's read-only/varoverlay without mounting the attached data disk, so Longhorn has been writing replica data to Talos's ephemeral root partition instead of thestorage_disk_sizezvol since v0.13.0. Everystorage_disk_sizezvol on every existing Longhorn cluster has been attached, unformatted, and unused for two releases. Fixed tosource: /var/mnt/longhornto bind the provider's now-auto-emittedUserVolumeConfigmount into the path Longhorn's pods expect. Combined with theUserVolumeConfigauto-emission above, new clusters provisioned on this release get Longhorn running on the intended data disk out of the box. Existing clusters need to re-run the script (idempotent — the config patch gets replaced) after reprovisioning their worker VMs on this release so the UserVolumeConfig mount exists before the bind references it. Migrating data off the ephemeral root is Longhorn's problem: drain replicas to new nodes, remove old nodes, rebalance.
v0.14.2 — Fix UEFI boot order trapping Talos in halt_if_installed¶
Fixes¶
- Boot order: root disk before CDROM — Provisioned VMs set CDROM
order=1000and root diskorder=1001, which in bhyve's UEFI boot manager means "CDROM first, disk second". The initial install worked because Talos installs from the ISO, reboots, and the disk then has a bootloader — but any subsequent reboot where UEFI re-entered the CDROM caused the VM to halt withtask haltIfInstalled: Talos is already installed to disk but booted from another media and talos.halt_if_installed kernel parameter is set. Re-ordered to root disk1000, additional disks1001+, CDROM1500, NIC2001. Now UEFI tries the disk first and only falls through to the CDROM on a fresh VM where the disk is empty. AddedTestBootOrder_DiskBeforeCDROMto pin the invariant. Migration required for VMs provisioned on v0.14.1 or earlier — bump each CDROM'sorderfrom1000to1500(TrueNAS UI: VM → Devices → CDROM → Device Order; ormidclt call vm.device.update <id> '{"order": 1500}'). New VMs provisioned on v0.14.2 and later are unaffected. See Troubleshooting and Upgrading.
Removed¶
- TrueNAS app catalog packaging — Deleted the
truenas-app/directory (app.yaml, questions.yaml, ix_values.yaml, docker-compose template, migrations stub). The provider is no longer being submitted to the TrueNAS community apps catalog. Installation on TrueNAS is still supported via Apps > Discover > Install via YAML with the compose YAML documented inREADME.mdanddocs/quickstart.md— the removed files were only used for catalog-format submission. Affected doc language was updated from "TrueNAS App (Recommended)" to "Docker Compose on TrueNAS (Recommended)" inREADME.md,docs/index.md,docs/quickstart.md,AGENT.md,llms.txt, andllms-full.txt. Bug report template's deployment-method field updated accordingly.
Documentation¶
- New control plane sizing guide (
docs/sizing.md) — When to bump CP VM resources, with concrete observable triggers (apiserver p99 > 1s, etcdapply request took too longwarnings, kube-apiserver OOMKilled,kubectl topCPU/mem > 70% sustained, etcd DB > 2 GiB, heavy operator installs like ArgoCD / Crossplane / service meshes). Includes a sizing table from homelab (2 vCPU / 2 GiB) up to 50+ node clusters, an HA rolling-replace procedure (drain → delete → scale up → repeat) with a Mermaid sequence diagram, single-CP in-place resize viamidclt, and a note that etcd fsync latency is a ZFS/SLOG problem — bumping CPU/RAM won't fix it. Linked fromindex.md,getting-started.md,quickstart.mdMachineClass config table, and mkdocs nav under Operations.
CI¶
- Restore Grafana dashboards + alert rules as release assets — The parallelize-builds refactor in v0.14.1 inadvertently dropped the dashboard bundling step added for v0.14.0 discoverability. Re-added: the release workflow now uploads
overview.json,provisioning.json,api-performance.json,cleanup.json, a combinedgrafana-dashboards.zip, andtruenas-provider.rules.ymlas release assets on every tag. Users can grab them directly from the GitHub release page for import into Grafana Cloud / self-hosted.
v0.14.1 — Fix OTEL_EXPORTER_OTLP_PROTOCOL for Grafana Cloud¶
Fixes¶
- Honor
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf— TheOTELProtocolconfig field was declared butinitOTELonly wired up the gRPC exporters, so settinghttp/protobufsilently fell back to gRPC. When users pointedOTEL_EXPORTER_OTLP_ENDPOINTat a Grafana Cloud OTLP gateway URL (https://otlp-gateway-...grafana.net/otlp), the gRPC name resolver rejected thehttps://scheme and loggedfailed to upload metrics: exporter export timeout: rpc error: code = Unavailable desc = name resolver error: produced zero addresseson repeat. Fixed by branching onOTEL_EXPORTER_OTLP_PROTOCOL:grpc(default) uses the existing gRPC exporters;http/protobuf(orhttp) uses the OTLP/HTTP exporters viaWithEndpointURL, which accepts full URLs and appends/v1/traces,/v1/metrics,/v1/logsto the base path as the spec requires. Unknown protocol values now fail fast with a clear error instead of silently defaulting.
Internal¶
- Update Grafana dashboard title assertions in
TestGrafanaDashboards_ValidJSONto match the grafana.com-ready names shipped in v0.14.0. - Add multi-size logo assets (128/256/512) for grafana.com plugin catalog upload.
v0.14.0 — WebSocket-Only Transport, Longhorn Default¶
Breaking / Behavior Changes¶
- Drop Unix socket transport — WebSocket + API key required — TrueNAS 25.10 removed implicit authentication on the
middlewared.sockUnix socket. Every JSON-RPC call now returnsENOTAUTHENTICATEDunless the client has authenticated first, which means the "zero-auth Unix socket" path is no longer possible. The transport auto-detection logic, thesocketTransport,TRUENAS_SOCKET_PATHenv var, and the socket mount have all been removed.TRUENAS_HOSTandTRUENAS_API_KEYare now required in all deployments. When running as a TrueNAS app, setTRUENAS_HOST=localhostandTRUENAS_INSECURE_SKIP_VERIFY=true.
Features¶
- Console OTEL exporters (opt-in) — Set
OTEL_CONSOLE_EXPORT=trueto emit traces, metrics, and logs to stdout in addition to the configured gRPC endpoint. Off by default to avoid log spam in production. Traces and logs use pretty-printed JSON; metrics print every 60s. Useful for local debugging without wiring up a collector. - Startup log includes TrueNAS host and TLS verify status — The
TrueNAS client connectedlog line now showshost=<truenas-host>andtls_verify=<bool>to make misconfiguration easier to spot. - Add
siderolabs/iscsi-toolsto default extensions — Longhorn (the default storage) uses iSCSI internally to attach replicas to pods. Previously users had to manually addiscsi-toolsto their MachineClassextensionslist or PVCs would sit in Pending forever. Now it's baked in alongsideqemu-guest-agentandutil-linux-tools. - Longhorn install script loads
iscsi_tcpkernel module —scripts/install-longhorn.shnow includesmachine.kernel.modules: [iscsi_tcp]in the Talos config patch. Required for Longhorn to establish iSCSI sessions between replicas and pods.
Removed¶
socketTransportimplementation and all Unix-socket-specific code pathsTRUENAS_SOCKET_PATHenvironment variableSocketPathfield onclient.Config- Unix socket host mount from the TrueNAS app definition
siderolabs/nfs-utilsfrom default Talos extensions — the provider no longer manages NFS storage, so the NFS client is no longer needed in every VM. Users who want democratic-csi NFS mode or manual NFS mounts can addsiderolabs/nfs-utilsto their MachineClassextensionsfield.
CI¶
- Parallelize release binary builds via matrix strategy — Release workflow now cross-compiles the four target platforms (
linux/amd64,linux/arm64,darwin/amd64,darwin/arm64) on separate runners in parallel via GitHub Actionsstrategy.matrix, instead of sequentially on a single runner. Each matrix job uploads its binary as an artifact; the release job downloads all four before signing and publishing. Cuts wall-clock time on the build stage roughly 4x. - Drop duplicate compile in release gate — The
testjob inrelease.yamlno longer runsmake build.go testalready compiles the packages, so the separate build step was pure duplication. Saves ~30s per release.
v0.13.2 — Fix Unix Socket Transport for TrueNAS 25.10 (SUPERSEDED — use v0.14.0+)¶
⚠️ KNOWN BROKEN. The Unix socket fix in v0.13.2 was incomplete. TrueNAS 25.10's middleware requires authentication on every JSON-RPC call, so the "zero-auth Unix socket" path is no longer viable. Upgrade to v0.14.0, which uses WebSocket with mandatory API key authentication.
Bug Fixes¶
- Fix Unix socket transport for TrueNAS 25.10+ — TrueNAS 25.10 (Goldeye) changed the middleware Unix socket from raw JSON-RPC to JSON-RPC 2.0 over WebSocket. The provider now uses WebSocket-over-Unix with pure JSON-RPC 2.0 framing (no DDP handshake), matching
midclt'sJSONRPCClient. Without this fix, the provider crash-loops withinvalid character 'H' looking for beginning of valueori/o timeoutwhen deployed as a TrueNAS app.
CI¶
- Eliminate QEMU from Docker builds — The Dockerfile no longer compiles Go inside the container. Pre-built binaries from Go's native cross-compilation are
COPYed directly into distroless, removing the QEMU emulation bottleneck for arm64. Release builds that took 10+ minutes now complete in under 30 seconds.
Housekeeping¶
- Remove unused raw JSON-RPC request/response types (superseded by WebSocket protocol)
- Add reconnect with exponential backoff to Unix socket transport (matches WebSocket transport behavior)
v0.13.1 — Grafana Cloud Observability¶
⚠️ Incompatible with TrueNAS SCALE 25.10 (Goldeye). Upgrade to v0.14.0 if you're on 25.10+.
Features¶
- Grafana Cloud observability support — OTEL exporters now accept
OTEL_EXPORTER_OTLP_HEADERSfor authenticated endpoints (e.g., Grafana Cloud OTLP gateway). Pyroscope client supportsPYROSCOPE_BASIC_AUTH_USERandPYROSCOPE_BASIC_AUTH_PASSWORDfor Grafana Cloud Profiles. Both local dev stacks and Grafana Cloud work with the same provider binary — just different env vars.
Housekeeping¶
- Reserve removed proto field
nfs_dataset_path(field 10) to prevent accidental reuse - Remove stale
configureStorageand NFS panels from Grafana provisioning dashboard
v0.13.0 — Multi-Disk VMs, Singleton Lease, Deterministic MACs, Circuit Breaker & Storage¶
⚠️ Incompatible with TrueNAS SCALE 25.10 (Goldeye). Upgrade to v0.14.0 if you're on 25.10+.
Breaking / Behavior Changes¶
- Longhorn is now the only supported storage path — NFS auto-storage has been fully removed (see Removed section below). Add a dedicated data disk via
storage_disk_sizein your MachineClass, then install Longhorn via Helm. Seedocs/storage.mdfor setup steps. - Deterministic MAC addresses are now always on for additional NICs — the per-NIC
deterministic_macopt-in field onadditional_nicshas been removed. All NICs (primary and additional) now unconditionally receive a stable MAC derived from the machine request ID so DHCP reservations survive reprovisioning on every interface, not just the primary. ExistingMachineClassconfigs withdeterministic_mac: truestill work (the field is silently ignored via unknown-field warning); configs withdeterministic_mac: falsewill start getting deterministic MACs on next reprovision.
Bug Fixes¶
- Drop
mtufrom NIC device create — TrueNAS 25.10 rejectsmtuonvm.device.createwith[EINVAL] vm_device_create.attributes.NIC.mtu: Extra inputs are not permitted, which blocked provisioning of any additional NIC whose MachineClass set anmtuvalue (typical for jumbo-frame storage networks).NICConfig.MTUis now ignored on the hypervisor call — MTU is still applied inside the guest via the existing MAC-matched Talos config patch (buildMTUPatch), which is the correct layer for it. Same shape as the v0.12.0vlanattribute removal.
Features¶
- Singleton enforcement via distributed lease — The provider now claims an exclusive lease on startup via annotations on the
infra.ProviderStatusresource, preventing two processes with the samePROVIDER_IDfrom racing on VM creation, zvol creation, and ISO upload. The Omni SDK has no built-in leader election, so two instances with the same ID would both receive everyMachineRequestand execute provisioning steps concurrently against TrueNAS — typically resulting in duplicate VM names, failed zvol creates, and half-provisioned machines. The lease fails fast when a fresh heartbeat is observed from another instance (surfacing duplicate-provider misconfigurations loudly) and takes over automatically when the prior holder is ungracefully killed and its heartbeat goes stale (default: 45s). Opt-out viaPROVIDER_SINGLETON_ENABLED=falsefor debugging or advanced sharding. Tunable viaPROVIDER_SINGLETON_REFRESH_INTERVAL(default 15s) andPROVIDER_SINGLETON_STALE_AFTER(default 45s). Seedocs/architecture.md#singleton-enforcementanddocs/troubleshooting.mdfor operational details. Kubernetes rolling deploys should usestrategy.type=RecreateormaxSurge=0to avoid overlap windows. - Additional disk support (multi-disk VMs) — Attach extra data disks beyond the root disk via
additional_disksin MachineClass config. Each disk can target a different ZFS pool and independently toggle encryption. Enables dedicated etcd disks on fast SSD pools, bulk data disks on HDD pools, and is a prerequisite for node-local distributed storage (Longhorn). Max 16 additional disks per VM. Paths tracked in protobuf state for automatic cleanup on deprovision. - Additional disk resize — Additional disks grow automatically when the
sizeinadditional_disksconfig increases, matching the root disk resize behavior. Shrinking is prevented (ZFS limitation). storage_disk_sizeconvenience field — New MachineClass schema field that adds a dedicated data disk for persistent storage (Longhorn). Settingstorage_disk_size: 100is equivalent toadditional_disks: [{size: 100}]but simpler in the Omni UI.- MTU / jumbo frames for additional NICs — Optional
mtufield onadditional_nicsitems. Applied as a Talos machine config patch using MAC-based interface matching. Set to 9000 for jumbo frames on storage networks. - Deterministic MAC addresses — All NICs (primary and additional) get a stable MAC derived from the machine request ID, so DHCP reservations survive reprovision. Collision detection queries the same network segment before attaching.
- Node auto-replace circuit breaker — VMs stuck in ERROR state are automatically deprovisioned after exceeding
MAX_ERROR_RECOVERIES(default: 5) consecutive failed recoveries. Omni's reconciliation loop then provisions a fresh replacement. Configurable via env var; set to-1to disable. - Longhorn install script —
scripts/install-longhorn.sh <cluster>one-command Longhorn setup: applies Talos config patch via omnictl, Helm installs Longhorn, sets default StorageClass, verifies with test PVC. Idempotent.
Observability¶
- Add
truenas.vms.auto_replacedmetric — counts VMs deprovisioned by the circuit breaker - Add ”VMs Auto-Replaced” stat panel to provisioning Grafana dashboard
- Add
TrueNASVMAutoReplacedPrometheus alert rule — fires when circuit breaker triggers, severity: warning
Removed¶
- Remove NFS auto-storage — The
configureStorageprovision step,auto_storageMachineClass field,AUTO_STORAGE_ENABLED/NFS_HOSTenv vars, NFS client methods (CreateNFSShare,GetNFSShareByPath,DeleteNFSShare,EnsureNFSService,SetDatasetPermissions), NFS config patch builder, and all related tests have been fully removed. NFS had too many issues in Kubernetes: networking complexity (port 2049 reachability, firewall rules), broad application incompatibility (PostgreSQL, Redis, Elasticsearch, and any WAL/Raft-based system corrupt data on NFS), no support for Kubernetes-native VolumeSnapshots, and the underlying provisioner (nfs-subdir-external-provisioner) has been unmaintained since 2022. Use Longhorn withstorage_disk_sizeinstead — it's self-contained, supports snapshots, and works in any network topology. - Remove ZFS snapshot/rollback code — Talos nodes are immutable; the correct recovery path is to replace a failed VM (Omni reprovisions automatically), not to roll back a zvol. Removed:
CreateSnapshot,ListSnapshots,DeleteSnapshot,RollbackSnapshotclient methods,snapshotBeforeUpgradeandenforceSnapshotRetentionprovisioner logic,last_upgrade_snapshotprotobuf field, snapshot telemetry counters, and all related tests. TheSnapshottype and pre-upgrade snapshot workflow introduced in v0.6.0–v0.8.0 are fully removed.
Documentation¶
- Rewrite storage guide (
docs/storage.md) — Longhorn as recommended default, NFS removed as provider-managed option, democratic-csi as advanced alternative - Add Velero CSI snapshot integration to backup guide (
docs/backup.md) — VolumeSnapshotClass setup for Longhorn and democratic-csi, CSI Snapshot Data Movement for off-site S3 - Add disaster recovery runbook to backup guide — 5 scenarios with step-by-step procedures and recovery time table
- Add backup & disaster recovery guide (
docs/backup.md) — control plane backup via Omni, workload/PVC backup via Velero to remote S3 - Add jumbo frames / MTU guide to networking docs (
docs/networking.md) - Remove snapshot rollback documentation from upgrading guide
v0.12.0 — VM Identity Fix, Per-Zvol Encryption, Health Endpoint & Hardening¶
Bug Fixes¶
- Fix VM identity duplication — VMs now get a provider-generated SMBIOS UUID passed to
vm.create, ensuring the bhyve UUID matches what the provider reports to Omni. Previously, bhyve assigned a random UUID causing Talos to register as a separate machine, resulting in ghost "Provisioned/Waiting" entries alongside the real nodes. - Fix pool free space reporting — now queries root dataset (
pool.dataset.query) for usable space that matches TrueNAS UI, instead of raw pool stats that ignore ZFS overhead/parity/metadata. - Fix ZFS encryption API compatibility — use
AES-256-GCM(uppercase) and setinherit_encryption: falsefor TrueNAS 25.04+ compatibility. - Fix
UserPropertiesformat — use list-of-objects ([{key, value}]) instead of map for TrueNAS 25.10+ compatibility. - Fix pool validation errors — suggest
dataset_prefixwhen user passes a dataset path as pool name. - Fix
checkExistingVM— resetCdromDeviceIdalongsideVmIdwhen VM is deleted externally. - Keep CDROM attached after provisioning — removing it required stopping the VM, which killed Talos mid-install. CDROM is now cleaned up only on deprovision.
- Remove
vlanattribute from NIC device creation — TrueNAS 25.10 rejects VM-level VLAN tagging viavm.device.create. VLAN tagging is handled at the host level by attaching to VLAN interfaces (e.g.,vlan666) - Switch UUID generation from hand-rolled v4 to
google/uuidv7 - Fix orphan cleanup deleting all VMs after provider restart — replaced in-memory VM tracking (lost on restart) with TrueNAS state queries. Orphan VMs are now detected by checking if their backing zvol (tagged with
org.omni:managed) still exists. Orphan zvols are detected by checking if their corresponding VM still exists. No in-memory state needed — safe across restarts
Features¶
- Add multiple NIC support via
additional_nicsin MachineClass config - Add
advertised_subnetsconfig patch support — automatically generates and applies Talos machine config patches for etcdadvertisedSubnetsand kubeletnodeIP.validSubnetswhen set in MachineClass config - Auto-detect primary NIC subnet when
advertised_subnetsis not set but additional NICs are configured — queries TrueNASinterface.queryfor the primary NIC's IPv4 CIDR and applies the config patch automatically - Add per-zvol auto-generated encryption passphrases — replaces global
ENCRYPTION_PASSPHRASEenv var. Each encrypted zvol gets a unique cryptographically random passphrase stored as a ZFS user property (org.omni:passphrase), enabling auto-unlock after TrueNAS reboots without a shared secret. - Add graceful VM shutdown on deprovision (ACPI signal with configurable timeout before force-stop)
- Add HTTP health endpoint (
/healthz,/readyz) for Kubernetes liveness/readiness probes — verifies actual TrueNAS connectivity instead of just process liveness. Configurable viaHEALTH_LISTEN_ADDR(default:8081) - Add VM existence health check step — replaces
removeCDROMstep withhealthCheckthat verifies VMs still exist on TrueNAS and resets state for re-provision if deleted externally - Add TrueNAS version check at startup — fails with clear error on versions below 25.04
- Add memory overcommit pre-check — blocks VMs requesting >80% of host RAM
- Add unknown field detection in MachineClass config — warns when unrecognized fields are present (typos, removed fields)
- Add
dataset_prefixsupport for organizing VM storage under nested ZFS datasets - Add
GetDatasetUserProperty()client method for reading ZFS user properties - Add CDROM swap logic for Talos version upgrades — note: currently non-functional because the Omni SDK does not re-run provision steps after a machine reaches
PROVISIONEDstage (siderolabs/omni#2646)
Observability¶
- Add 17 new OTEL metrics: per-step provision/deprovision durations, error categorization, ISO cache hits/misses, cleanup counters, WebSocket reconnects, rate limit queue depth, graceful shutdown outcomes
- Add OTEL log-trace correlation via otelzap bridge (trace_id/span_id in structured logs)
- Split monolithic Grafana dashboard into 4 focused dashboards (overview, provisioning, API performance, cleanup)
- Add 4 new Prometheus alerting rules (health check failures, WebSocket reconnects, forced shutdowns, orphan VMs)
- Add Loki log aggregation config to observability stack
Security & Hardening¶
- Pin Docker base images to SHA256 digest to prevent supply chain tag mutation
- Switch Docker runtime from Alpine to distroless/static-debian12 (no shell, smaller attack surface)
- Inject version into Docker image via build arg (was always "dev")
- Add OCI LABEL metadata (title, vendor, source, license)
- Add
SecretStringtype that redacts API keys from logs and fmt output - Default
TRUENAS_INSECURE_SKIP_VERIFYtofalse(wastrue) - Add security comments to TrueNAS app template and Kubernetes secret manifest
- Replace placeholder API key in
.env.test.examplewith non-secret value - Add betterleaks secret scanning: pre-push hook, CI job with pinned version + checksum
Deployment¶
- Replace
pgrepliveness probe with HTTP health checks in Kubernetes deployment manifest - Add readiness probe to Kubernetes deployment
- Remove
ENCRYPTION_PASSPHRASEfrom env config, secrets, and deployment manifests
Quality¶
- 314 tests (up from 196)
- Replace
go vet + gofmtin CI with golangci-lint v2.11.4 via official action - Fix all golangci-lint v2 issues (errcheck, gocritic, gofmt, staticcheck, unused)
- Update
.golangci.ymlfor v2 (gofmtmoved to formatters,gosimplemerged intostaticcheck) - Add protobuf compatibility test suite (
api/specs/compat_test.go) - Add config patch tests, unknown fields tests, VM lifecycle tests, step sequence tests, step integration tests
- Add WebSocket chaos tests (
internal/client/ws_chaos_test.go) - Add health endpoint tests (
internal/health/health_test.go) - Add E2E CI workflow (
.github/workflows/e2e.yaml) - Add UUID integration test verifying TrueNAS accepts and persists the
uuidfield onvm.create - Add 27 cleanup tests including integration test with mixed active/orphan/non-omni resources and crash recovery scenarios
- Tune log levels (routine operations Info→Debug, NVRAM failures Warn→Error)
- Add
make scanandmake setup-hookstargets
Upstream Discussions¶
- Opened discussion on pressure-based autoscaling patterns with infrastructure providers (siderolabs/omni#2647)
Documentation & SEO¶
- Add multi-homing guide (
docs/multihoming.md): Traefik with internal + DMZ subnets, MetalLB DMZ pool, firewall rules, DHCP reservations, storage network variation - Add MkDocs Material docs site with GitHub Pages deployment
- Add CITATION.cff, FAQ page, FUNDING.yml
- Expand llms.txt and llms-full.txt with Q&A pairs for AI/answer engine optimization
- Add 7 GitHub topics (homelab, self-hosted, bare-metal, etc.)
- Backfill CHANGELOG.md with all releases from v0.1.0 through v0.10.0
- Restructure release workflow for immutable releases (single atomic upload, CHANGELOG.md-sourced notes)
v0.11.1 — Pool Validation, MAC Address Logging, Networking Guide¶
- Add
validatePool()with clear errors for missing pools and dataset-path-as-pool mistakes - Log VM NIC MAC address after creation for DHCP reservation setup
- Add comprehensive networking guide (
docs/networking.md): bridge setup, DHCP reservations (UniFi, pfSense, OPNsense, Mikrotik), MetalLB, VIP, VLAN isolation - Add CNI selection guide (
docs/cni.md): Flannel, Cilium, Calico with Talos-specific setup - Add integration test CI feasibility analysis (
docs/integration-test-ci.md) - Update troubleshooting guide with "stuck on Provisioning" debug steps
- 196 tests
v0.10.0 — ZFS Encryption, Zvol Tagging & Supply Chain Hardening¶
- Add ZFS native AES-256-GCM encryption at rest for VM disks (
encrypted: truein MachineClass) - Add automatic unlock of encrypted zvols on provider restart
- Tag all provider-managed zvols with ZFS user properties (
org.omni:managed,org.omni:provider,org.omni:request-id) - Release pipeline now triggers only on manual tag push
- SBOM cryptographically attested to Docker image digest
- Release binaries signed with cosign (
.sig+.cert) - SLSA provenance in Docker images
- 191 tests
v0.9.4 — Supply Chain Signing Fix¶
- Fix release pipeline to include SBOM attestation, binary signing, and SLSA provenance in a single workflow run
v0.9.3 — Supply Chain Hardening¶
- Attest SBOM to Docker image digest via
cosign attest - Sign all release binaries with cosign (
.sig+.certper binary) - Add SLSA provenance metadata to Docker images via buildx
v0.9.2 — Docker Tag Fix¶
- Add
v-prefixed Docker image tags alongside bare version tags (v0.9.2and0.9.2)
v0.9.1 — Container Image Signing & SBOM¶
- Sign all Docker images with cosign via Sigstore keyless signing (GitHub OIDC)
- Generate SPDX SBOM for every release, attached as release asset
v0.9.0 — Observability & Operations¶
- Add host health monitoring: CPU cores, memory, pool free/used space, pool health, disk count, running VMs (OTEL gauges every 30s)
- Add automatic pool selection — picks the healthy pool with the most free space when MachineClass doesn't specify one
- Add 7 Prometheus alerting rules (VM errors, API latency, pool space, pool health, no VMs, ISO slow, provision slow)
- Add 12-panel Grafana dashboard with auto-provisioning
- 179 tests (up from 147)
v0.8.0 — Talos Upgrade Orchestration & Documentation¶
- Add Talos upgrade orchestration and NVRAM recovery
- Add beginner getting-started tutorial (NAS to running cluster, no prior experience)
- Add upgrade guide, CNI selection guide, storage guide, networking guide
- Add comprehensive documentation, AI discoverability files (llms.txt, AGENT.md), and community health files
v0.7.0 — Production-Grade Test Suite¶
- Comprehensive QA overhaul with 147 tests and full E2E coverage
- Full provision/deprovision E2E against real TrueNAS hardware
- WebSocket auto-reconnect verified against real connection
- 8 TrueNAS API contract tests
- Chaos, failure injection, and load/stress tests
- Fix:
filesystem.statreturnsrealpathnotname
v0.6.0 — Disk Resize¶
- Add disk resize support
- Add tests for extension merge (defaults only, custom additions, duplicates)
v0.5.0 — Rate Limiting & Pre-checks¶
- Add API rate limiting to prevent TrueNAS overload (default: 8 concurrent calls, configurable via
TRUENAS_MAX_CONCURRENT_CALLS) - Add resource pre-checks before provisioning (pool space validation)
- Add
SystemMemoryAvailable()for future host memory checks - 72 tests (up from 63)
v0.4.0 — Cleanup & Reliability¶
- Add background cleanup for stale ISOs and orphan VMs/zvols
- Add human-readable error mapping for TrueNAS API errors in Omni UI
- Wire cleanup loop into main with active resource tracking
- Add exported
MockClientfor cross-package testing - 63 tests (up from 36)
v0.3.0 — WebSocket Reconnect & Graceful Shutdown¶
- Add WebSocket auto-reconnect on connection loss (exponential backoff, max 30s, 3 attempts)
- Add graceful shutdown on SIGTERM/SIGINT (10s drain timeout for in-flight API calls)
- Reduce cognitive complexity across main.go, ws.go, steps.go, deprovision.go
- Extract JSON-RPC method string literals into constants
- Add
Data.ApplyDefaults()to centralize default value logic - Update recommended MachineClass sizes (10 GiB control plane, 100 GiB worker)
v0.2.0 — Observability & Auto CDROM Removal¶
- Add OpenTelemetry tracing for every provision step and TrueNAS API call
- Add OpenTelemetry metrics (
truenas.vms.provisioned,truenas.provision.duration, etc.) - Add Pyroscope continuous profiling (CPU, memory, goroutine flame graphs)
- Add local dev observability stack (Grafana, Tempo, Prometheus, Pyroscope, OTEL Collector)
- Automatically detach ISO CDROM after Talos installs to disk (eliminates 7s GRUB delay)
- Add default storage extensions (
nfs-utils,util-linux-tools) alongsideqemu-guest-agent
v0.1.0 — Initial Release¶
- TrueNAS SCALE JSON-RPC 2.0 client with Unix socket and WebSocket transports
- 3-step provision flow: schematic generation, ISO upload, VM creation
- Deprovision with full cleanup (stop VM, delete VM, delete zvol)
- MachineClass config with per-class overrides (pool, NIC, boot method, arch)
- Default Talos extensions (qemu-guest-agent, nfs-utils, util-linux-tools)
- TrueNAS app packaging with custom questions.yaml
- CI/CD pipeline with GitHub Actions (test, lint, multi-arch Docker build, GitHub Release)
- Kubernetes and Docker Compose deployment manifests
- HOST-PASSTHROUGH CPU mode for full host CPU features
- ISO caching with SHA-256 deduplication
- 36 unit tests + 10 integration tests