Skip to content

Troubleshooting

Common issues and their solutions when running the Omni TrueNAS provider.

Startup Failures

"TrueNAS API unreachable"

The provider cannot connect to TrueNAS on startup.

  • Verify TRUENAS_HOST is reachable from the provider container: curl -k https://<host>/websocket
  • Confirm TRUENAS_API_KEY is valid — generate one for a dedicated non-root user with scoped roles; see TrueNAS Setup > API Key for the setup and minimum role list
  • If using a self-signed cert, ensure TRUENAS_INSECURE_SKIP_VERIFY=true
  • When running the container on the TrueNAS host itself, set TRUENAS_HOST=localhost and TRUENAS_INSECURE_SKIP_VERIFY=true
  • Check that TrueNAS middleware is running: midclt call core.ping on the TrueNAS host

"pool not found on TrueNAS"

The configured DEFAULT_POOL or MachineClass pool doesn't exist.

  • Common mistake: Using a dataset path (e.g., tank/my-vms or default/previewk8) instead of the pool name (e.g., tank or default). The pool field must be a top-level ZFS pool, not a dataset.
  • If you want VMs under an existing dataset, use dataset_prefix. For example, if your layout is default/previewk8, set pool: "default" and dataset_prefix: "previewk8".
  • List available pools: zpool list or midclt call pool.query | jq '.[].name' (on TrueNAS)
  • Update DEFAULT_POOL or the MachineClass pool field to match an existing pool name
  • Pool names are case-sensitive

"network interface target not found"

The configured DEFAULT_NETWORK_INTERFACE interface doesn't exist.

  • List available choices: midclt call vm.device.nic_attach_choices (on TrueNAS)
  • Common values: br0, br100, vlan100, enp5s0
  • Bridge interfaces must be created manually in TrueNAS UI under Network > Interfaces before use

"OMNI_ENDPOINT is required"

The OMNI_ENDPOINT environment variable is not set.

  • Set it to your Omni instance URL (e.g., https://omni.example.com)
  • If using .env, make sure the file is in the working directory or mounted into the container

"singleton lease acquire failed" / "another provider instance holds the singleton lease"

Two processes are trying to run as the same PROVIDER_ID. The provider refuses to start when it detects a fresh heartbeat from another instance because running two provisioners in parallel causes races on VM creation, zvol creation, and ISO upload.

Expected fields in the error:

  • instance %q — the random UUID of the process that currently holds the lease
  • heartbeat N ago at ... — how long ago the other instance last refreshed
  • provider %q — your PROVIDER_ID

Diagnosis:

  1. Find the other instance. Match the singleton_instance_id log field across your running processes (kubectl logs, journalctl -u, docker logs, etc.). If no other process has that instance-id, it is almost certainly a stale pod that was kill -9'd before it could release.
  2. If the other instance is legitimate, stop it cleanly (SIGTERM) — the outgoing process clears the lease annotation so the successor can acquire immediately without waiting for PROVIDER_SINGLETON_STALE_AFTER to elapse.
  3. If the other instance was killed ungracefully and the heartbeat is frozen, the successor will take over automatically once PROVIDER_SINGLETON_STALE_AFTER (default 45s) passes.

Kubernetes rolling deploys: use strategy.type=Recreate or strategy.rollingUpdate.{maxSurge: 0, maxUnavailable: 1} so the old pod is fully terminated before the new one starts. With the default maxSurge=25% strategy the new pod can start while the old pod is still in its terminationGracePeriodSeconds, and the new pod will crashloop on the preflight check.

Debugging / advanced sharding: to bypass the check entirely, set PROVIDER_SINGLETON_ENABLED=false. Only do this when you are certain that no two instances are servicing the same provider ID — the provider will log a warning on startup as a reminder.

Provisioning Issues

Omni shows "Provisioning" forever with no error

Omni's UI shows the machine stuck in "Provisioning" state but no error message. This happens because the provider retries failed steps every 60 seconds, and each retry clears the error briefly.

How to see the actual error:

  1. Check provider logs — the error is always logged:

    # If running locally
    docker logs omni-infra-provider-truenas 2>&1 | grep "provision failed"
    
    # If running via the binary
    grep "provision failed" /path/to/provider/output
    

  2. Check MachineRequestStatus via omnictl — catches the error between retries:

    omnictl get machinerequeststatus -o yaml | grep -A2 "error:"
    

  3. Common causes:

  4. Pool doesn't exist: pool "previewk8" not found — you specified a dataset name instead of a pool name. Use the top-level pool (e.g., default, tank), not a dataset path.
  5. network interface invalid: the bridge or VLAN doesn't exist on TrueNAS.
  6. Pool full: no space for the zvol.
  7. TrueNAS unreachable: WebSocket connection dropped.

The provider will keep retrying until the issue is fixed. Once you correct the MachineClass config (e.g., fix the pool name), the next retry will succeed automatically.

VMs are created but don't join Omni

The VM boots but never appears in Omni.

  1. Check VM console in TrueNAS UI — is Talos booting? Look for kernel output.
  2. Network connectivity — the VM needs outbound internet access to reach Omni via SideroLink (WireGuard on port 443). Verify:
  3. The network interface target has internet access
  4. No firewall blocking outbound WireGuard traffic
  5. DNS resolution works from the VM's network
  6. Wrong boot method — if the VM shows a BIOS/UEFI shell instead of booting, try switching boot_method between UEFI and BIOS
  7. ISO not attached — check the VM devices in TrueNAS UI. There should be a CDROM device with the Talos ISO.

"schematic generation failed"

The provider failed to generate a Talos image schematic.

  • Verify internet access from the provider container (it needs to reach factory.talos.dev)
  • Check if a custom extension name is misspelled in the MachineClass extensions field
  • Set LOG_LEVEL=debug for detailed error output

ISO download hangs or fails

  • The provider downloads ISOs from factory.talos.dev — ensure outbound HTTPS access
  • Large ISOs (~100 MB) may take time on slow connections
  • Check available disk space on the TrueNAS pool (ISOs are stored at <pool>/talos-iso/)

VM creation succeeds but VM won't start

  • Insufficient resources — TrueNAS needs enough free memory for the VM. Check memory in MachineClass config vs. available RAM.
  • zvol allocation — ensure the pool has enough free space for the disk_size specified
  • CPU mode — the provider uses HOST-PASSTHROUGH CPU mode. Verify the host CPU supports virtualization (VT-x/AMD-V)

VM halts on reboot with "Talos is already installed to disk but booted from another media"

Log output (repeats every 30s until the VM is shut down):

[talos] task haltIfInstalled (1/1): Talos is already installed to disk but booted
from another media and talos.halt_if_installed kernel parameter is set. Please
reboot from the disk.

Cause. On provider versions ≤ v0.14.1, the VM's CDROM was attached with UEFI boot order=1000 and the root disk with order=1001. bhyve's UEFI boot manager tries the entry with the lowest order first, so on every reboot UEFI re-entered the Talos ISO instead of the disk. The initial install worked because the disk was empty on first boot, but once Talos was installed any reboot (manual stop, TrueNAS restart, host reboot) re-entered the ISO — and the ISO's talos.halt_if_installed=1 kernel parameter halts the boot as a safeguard against overwriting an existing installation.

Fix for existing VMs — bump the CDROM order above the root disk's order (1000). Recommended value: 1500.

TrueNAS UI:

  1. Virtualization > Virtual Machines > your VM > Devices
  2. Edit the CDROM device
  3. Change Device Order from 1000 to 1500
  4. Save, then start the VM

TrueNAS shell (faster if you have many VMs):

# Find the CDROM device ID for a VM (replace <VM_ID>)
midclt call vm.device.query '[["vm","=",<VM_ID>]]' | jq '.[] | select(.attributes.dtype=="CDROM") | {id, order}'

# Update the order
midclt call vm.device.update <CDROM_DEVICE_ID> '{"order": 1500}'

# Start the VM
midclt call vm.start <VM_ID>

Fix for new VMs — upgrade the provider to a version that includes the correct boot order (root disk = 1000, additional disks = 1001+, CDROM = 1500, NIC = 2001). New VMs provisioned after the upgrade boot correctly without manual intervention.

Do not detach the CDROM. Talos may still be mid-install when you notice the issue, and detaching a device requires stopping the VM — which interrupts the install. Reordering is always safe.

Deprovision Issues

Orphan VMs or zvols after deletion

The background cleanup process handles this automatically. If you see stale resources:

  1. Check provider logs for cleanup errors
  2. Manually remove via TrueNAS UI: Virtualization > VMs (delete VM) and Storage > Datasets (delete zvol)
  3. ISOs are cleaned up automatically when no longer referenced by active VMs

Debugging

Enable debug logging

LOG_LEVEL=debug

This logs all JSON-RPC calls, provision step progress, and transport-level details.

Check provider health

The provider reports health to Omni. If it shows as unhealthy in the Omni UI:

  1. Check provider logs for health check errors
  2. Verify TrueNAS is reachable (the health check pings the API, checks the pool, and validates the NIC)
  3. Restart the provider container

View raw JSON-RPC calls

With LOG_LEVEL=debug, every JSON-RPC request and response is logged. This is useful for diagnosing TrueNAS API issues.

Common Mistakes

Mistake Fix
Using TrueNAS SCALE < 25.10 Upgrade to 25.10+ (Goldeye) — v0.13.2+ requires the JSON-RPC 2.0 WebSocket API
Omitting TRUENAS_HOST / TRUENAS_API_KEY when running on TrueNAS Set TRUENAS_HOST=localhost, create an API key, and set TRUENAS_INSECURE_SKIP_VERIFY=true
Missing network_mode: host in Docker Add network_mode: host — required for the provider to reach localhost:443
Pool name mismatch Pool names are case-sensitive — check with pool.query
No bridge interface created Create one in TrueNAS UI under Network > Interfaces first