Troubleshooting¶
Common issues and their solutions when running the Omni TrueNAS provider.
Startup Failures¶
"TrueNAS API unreachable"¶
The provider cannot connect to TrueNAS on startup.
- Verify
TRUENAS_HOSTis reachable from the provider container:curl -k https://<host>/websocket - Confirm
TRUENAS_API_KEYis valid — generate one for a dedicated non-root user with scoped roles; see TrueNAS Setup > API Key for the setup and minimum role list - If using a self-signed cert, ensure
TRUENAS_INSECURE_SKIP_VERIFY=true - When running the container on the TrueNAS host itself, set
TRUENAS_HOST=localhostandTRUENAS_INSECURE_SKIP_VERIFY=true - Check that TrueNAS middleware is running:
midclt call core.pingon the TrueNAS host
"pool not found on TrueNAS"¶
The configured DEFAULT_POOL or MachineClass pool doesn't exist.
- Common mistake: Using a dataset path (e.g.,
tank/my-vmsordefault/previewk8) instead of the pool name (e.g.,tankordefault). Thepoolfield must be a top-level ZFS pool, not a dataset. - If you want VMs under an existing dataset, use
dataset_prefix. For example, if your layout isdefault/previewk8, setpool: "default"anddataset_prefix: "previewk8". - List available pools:
zpool listormidclt call pool.query | jq '.[].name'(on TrueNAS) - Update
DEFAULT_POOLor the MachineClasspoolfield to match an existing pool name - Pool names are case-sensitive
"network interface target not found"¶
The configured DEFAULT_NETWORK_INTERFACE interface doesn't exist.
- List available choices:
midclt call vm.device.nic_attach_choices(on TrueNAS) - Common values:
br0,br100,vlan100,enp5s0 - Bridge interfaces must be created manually in TrueNAS UI under Network > Interfaces before use
"OMNI_ENDPOINT is required"¶
The OMNI_ENDPOINT environment variable is not set.
- Set it to your Omni instance URL (e.g.,
https://omni.example.com) - If using
.env, make sure the file is in the working directory or mounted into the container
"singleton lease acquire failed" / "another provider instance holds the singleton lease"¶
Two processes are trying to run as the same PROVIDER_ID. The provider refuses
to start when it detects a fresh heartbeat from another instance because
running two provisioners in parallel causes races on VM creation, zvol
creation, and ISO upload.
Expected fields in the error:
instance %q— the random UUID of the process that currently holds the leaseheartbeat N ago at ...— how long ago the other instance last refreshedprovider %q— yourPROVIDER_ID
Diagnosis:
- Find the other instance. Match the
singleton_instance_idlog field across your running processes (kubectl logs,journalctl -u,docker logs, etc.). If no other process has that instance-id, it is almost certainly a stale pod that waskill -9'd before it could release. - If the other instance is legitimate, stop it cleanly (
SIGTERM) — the outgoing process clears the lease annotation so the successor can acquire immediately without waiting forPROVIDER_SINGLETON_STALE_AFTERto elapse. - If the other instance was killed ungracefully and the heartbeat is frozen,
the successor will take over automatically once
PROVIDER_SINGLETON_STALE_AFTER(default 45s) passes.
Kubernetes rolling deploys: use strategy.type=Recreate or
strategy.rollingUpdate.{maxSurge: 0, maxUnavailable: 1} so the old pod is
fully terminated before the new one starts. With the default maxSurge=25%
strategy the new pod can start while the old pod is still in its
terminationGracePeriodSeconds, and the new pod will crashloop on the
preflight check.
Debugging / advanced sharding: to bypass the check entirely, set
PROVIDER_SINGLETON_ENABLED=false. Only do this when you are certain that no
two instances are servicing the same provider ID — the provider will log a
warning on startup as a reminder.
Provisioning Issues¶
Omni shows "Provisioning" forever with no error¶
Omni's UI shows the machine stuck in "Provisioning" state but no error message. This happens because the provider retries failed steps every 60 seconds, and each retry clears the error briefly.
How to see the actual error:
-
Check provider logs — the error is always logged:
-
Check MachineRequestStatus via omnictl — catches the error between retries:
-
Common causes:
- Pool doesn't exist:
pool "previewk8" not found— you specified a dataset name instead of a pool name. Use the top-level pool (e.g.,default,tank), not a dataset path. - network interface invalid: the bridge or VLAN doesn't exist on TrueNAS.
- Pool full: no space for the zvol.
- TrueNAS unreachable: WebSocket connection dropped.
The provider will keep retrying until the issue is fixed. Once you correct the MachineClass config (e.g., fix the pool name), the next retry will succeed automatically.
VMs are created but don't join Omni¶
The VM boots but never appears in Omni.
- Check VM console in TrueNAS UI — is Talos booting? Look for kernel output.
- Network connectivity — the VM needs outbound internet access to reach Omni via SideroLink (WireGuard on port 443). Verify:
- The network interface target has internet access
- No firewall blocking outbound WireGuard traffic
- DNS resolution works from the VM's network
- Wrong boot method — if the VM shows a BIOS/UEFI shell instead of booting, try switching
boot_methodbetweenUEFIandBIOS - ISO not attached — check the VM devices in TrueNAS UI. There should be a CDROM device with the Talos ISO.
"schematic generation failed"¶
The provider failed to generate a Talos image schematic.
- Verify internet access from the provider container (it needs to reach
factory.talos.dev) - Check if a custom extension name is misspelled in the MachineClass
extensionsfield - Set
LOG_LEVEL=debugfor detailed error output
ISO download hangs or fails¶
- The provider downloads ISOs from
factory.talos.dev— ensure outbound HTTPS access - Large ISOs (~100 MB) may take time on slow connections
- Check available disk space on the TrueNAS pool (ISOs are stored at
<pool>/talos-iso/)
VM creation succeeds but VM won't start¶
- Insufficient resources — TrueNAS needs enough free memory for the VM. Check
memoryin MachineClass config vs. available RAM. - zvol allocation — ensure the pool has enough free space for the
disk_sizespecified - CPU mode — the provider uses
HOST-PASSTHROUGHCPU mode. Verify the host CPU supports virtualization (VT-x/AMD-V)
VM halts on reboot with "Talos is already installed to disk but booted from another media"¶
Log output (repeats every 30s until the VM is shut down):
[talos] task haltIfInstalled (1/1): Talos is already installed to disk but booted
from another media and talos.halt_if_installed kernel parameter is set. Please
reboot from the disk.
Cause. On provider versions ≤ v0.14.1, the VM's CDROM was attached with UEFI
boot order=1000 and the root disk with order=1001. bhyve's UEFI boot manager
tries the entry with the lowest order first, so on every reboot UEFI re-entered
the Talos ISO instead of the disk. The initial install worked because the disk
was empty on first boot, but once Talos was installed any reboot (manual stop,
TrueNAS restart, host reboot) re-entered the ISO — and the ISO's
talos.halt_if_installed=1 kernel parameter halts the boot as a safeguard
against overwriting an existing installation.
Fix for existing VMs — bump the CDROM order above the root disk's
order (1000). Recommended value: 1500.
TrueNAS UI:
- Virtualization > Virtual Machines > your VM > Devices
- Edit the CDROM device
- Change Device Order from
1000to1500 - Save, then start the VM
TrueNAS shell (faster if you have many VMs):
# Find the CDROM device ID for a VM (replace <VM_ID>)
midclt call vm.device.query '[["vm","=",<VM_ID>]]' | jq '.[] | select(.attributes.dtype=="CDROM") | {id, order}'
# Update the order
midclt call vm.device.update <CDROM_DEVICE_ID> '{"order": 1500}'
# Start the VM
midclt call vm.start <VM_ID>
Fix for new VMs — upgrade the provider to a version that includes the correct boot order (root disk = 1000, additional disks = 1001+, CDROM = 1500, NIC = 2001). New VMs provisioned after the upgrade boot correctly without manual intervention.
Do not detach the CDROM. Talos may still be mid-install when you notice the issue, and detaching a device requires stopping the VM — which interrupts the install. Reordering is always safe.
Deprovision Issues¶
Orphan VMs or zvols after deletion¶
The background cleanup process handles this automatically. If you see stale resources:
- Check provider logs for cleanup errors
- Manually remove via TrueNAS UI: Virtualization > VMs (delete VM) and Storage > Datasets (delete zvol)
- ISOs are cleaned up automatically when no longer referenced by active VMs
Debugging¶
Enable debug logging¶
This logs all JSON-RPC calls, provision step progress, and transport-level details.
Check provider health¶
The provider reports health to Omni. If it shows as unhealthy in the Omni UI:
- Check provider logs for health check errors
- Verify TrueNAS is reachable (the health check pings the API, checks the pool, and validates the NIC)
- Restart the provider container
View raw JSON-RPC calls¶
With LOG_LEVEL=debug, every JSON-RPC request and response is logged. This is useful for diagnosing TrueNAS API issues.
Common Mistakes¶
| Mistake | Fix |
|---|---|
| Using TrueNAS SCALE < 25.10 | Upgrade to 25.10+ (Goldeye) — v0.13.2+ requires the JSON-RPC 2.0 WebSocket API |
Omitting TRUENAS_HOST / TRUENAS_API_KEY when running on TrueNAS |
Set TRUENAS_HOST=localhost, create an API key, and set TRUENAS_INSECURE_SKIP_VERIFY=true |
Missing network_mode: host in Docker |
Add network_mode: host — required for the provider to reach localhost:443 |
| Pool name mismatch | Pool names are case-sensitive — check with pool.query |
| No bridge interface created | Create one in TrueNAS UI under Network > Interfaces first |