Backup & Disaster Recovery¶

This guide covers backup strategies for Kubernetes clusters running on TrueNAS via the Omni infrastructure provider. Backup splits into two concerns: control plane (managed by Omni) and workload data (PVCs, backed up by Velero).

Control Plane Backup¶

Omni handles control plane backup automatically. Cluster state, machine configs, and etcd data are managed by Omni and can be restored through the Omni UI.

See the Omni backup and disaster recovery documentation for:

Enabling automatic etcd backups
Restoring a cluster from an etcd snapshot
Recovering from control plane node failures

Talos nodes are immutable and declarative. If a VM fails, the correct recovery path is to remove it from the cluster in Omni and let Omni reprovision a fresh replacement automatically.

Key point: You do not need to back up VMs or zvols. Omni recreates nodes from scratch using the machine config it stores. VM-level ZFS snapshots are unnecessary for cluster recovery.

Workload Backup with Velero¶

Omni cannot protect your PersistentVolumeClaim data. Application databases, uploaded files, message queues — anything stored in PVCs lives on your storage backend, not in Omni. If you lose that data, Omni can rebuild the cluster but your application state is gone.

Velero runs inside the cluster and backs up Kubernetes resources and PVC data to a remote S3 bucket, giving you complete workload recovery independent of the infrastructure.

Prerequisites¶

A running Kubernetes cluster managed by Omni
A remote S3-compatible backup target (AWS S3, Backblaze B2, Wasabi, MinIO, etc.)
kubectl access to the cluster (via omnictl kubeconfig)
Persistent storage configured (see Storage Guide)

1. Install the Velero CLI¶

# macOS
brew install velero

# Linux
curl -fsSL https://github.com/vmware-tanzu/velero/releases/latest/download/velero-linux-amd64.tar.gz | \
  tar xz && sudo mv velero-*/velero /usr/local/bin/

2. Create S3 credentials¶

Create a credentials file for your S3 bucket:

cat > /tmp/velero-credentials <<EOF
[default]
aws_access_key_id=<your-access-key>
aws_secret_access_key=<your-secret-key>
EOF

3. Install Velero into the cluster¶

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.11.0 \
  --bucket <your-bucket-name> \
  --secret-file /tmp/velero-credentials \
  --backup-location-config \
    region=<your-region>,s3ForcePathStyle=true,s3Url=<your-s3-endpoint> \
  --use-node-agent \
  --default-volumes-to-fs-backup

Key flags:

--use-node-agent enables file-system-level PV backups (replaces the deprecated restic integration)
--default-volumes-to-fs-backup backs up all PVs by default without requiring per-pod annotations
--plugins must match your storage provider (AWS shown; see Velero supported providers for GCP, Azure, etc.)

Example with AWS S3:

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.11.0 \
  --bucket my-cluster-backups \
  --secret-file /tmp/velero-credentials \
  --backup-location-config region=us-east-1 \
  --use-node-agent \
  --default-volumes-to-fs-backup

Example with Backblaze B2:

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.11.0 \
  --bucket my-cluster-backups \
  --secret-file /tmp/velero-credentials \
  --backup-location-config \
    region=us-west-004,s3ForcePathStyle=true,s3Url=https://s3.us-west-004.backblazeb2.com \
  --use-node-agent \
  --default-volumes-to-fs-backup

4. Verify the installation¶

velero get backup-locations
kubectl get pods -n velero

Backup Operations¶

On-demand backup¶

# Back up everything in the cluster
velero backup create full-backup

# Back up a specific namespace
velero backup create app-backup --include-namespaces my-app

# Back up with a TTL (auto-delete after 30 days)
velero backup create weekly-backup --ttl 720h

Scheduled backups¶

# Daily backups at 2 AM, retained for 7 days
velero schedule create daily --schedule="0 2 * * *" --ttl 168h

# Weekly full backup
velero schedule create weekly --schedule="0 3 * * 0" --ttl 720h

Check backup status¶

velero backup get
velero backup describe <backup-name> --details
velero backup logs <backup-name>

Restore Operations¶

Full cluster restore¶

After Omni reprovisions fresh VMs and the cluster is healthy:

velero restore create --from-backup full-backup

Namespace restore¶

velero restore create --from-backup full-backup --include-namespaces my-app

Restore to a different namespace¶

velero restore create --from-backup full-backup \
  --include-namespaces my-app \
  --namespace-mappings my-app:my-app-restored

Check restore status¶

velero restore get
velero restore describe <restore-name> --details

Disaster Recovery Scenarios¶

Your data, your responsibility. The provider manages VM lifecycle and cluster infrastructure. Everything below -- Velero configuration, backup schedules, restore procedures, and data integrity -- is entirely your responsibility. We document these scenarios as a reference, not a guarantee. Test your restores regularly.

Scenario 1: Single Worker Node Failure¶

A worker VM crashes, gets stuck in ERROR, or is removed from the cluster.

Omni handles it automatically. The provider deprovisions the broken VM and Omni provisions a fresh replacement.
Longhorn re-replicates. If you're using Longhorn with replica count >= 2, data is already replicated on other nodes. The new node joins and Longhorn rebuilds replicas automatically. No restore needed.
NFS is unaffected. NFS data lives on TrueNAS, not the worker. Pods reschedule to healthy nodes and remount.

Action required: None -- wait for the cluster to stabilize.

Scenario 2: Control Plane Failure¶

The control plane VM crashes or is corrupted.

Omni restores etcd. Omni manages etcd backups and can restore the control plane from a snapshot. See the Omni backup and disaster recovery docs.
Worker data is intact. PVs on Longhorn or NFS are not affected by control plane failures.
Verify workloads. Once the control plane is back, check that all Deployments, StatefulSets, and pods are running: kubectl get pods -A.

Action required: Follow the Omni etcd restore procedure if needed.

Scenario 3: Total Cluster Loss¶

All VMs are destroyed (e.g., ZFS pool failure, accidental deletion, TrueNAS hardware failure).

Rebuild the cluster in Omni. Create a new cluster with the same MachineClasses. Omni provisions fresh VMs.
Reinstall storage. Run scripts/install-longhorn.sh <cluster>.
Reinstall Velero. Repeat the Velero install command pointing to the same S3 bucket.
List available backups:
```
velero backup get
```

Restore everything:

velero restore create --from-backup <latest-backup>

Verify the restore:

velero restore describe <restore-name> --details
kubectl get pods -A
kubectl get pvc -A  # All PVCs should be Bound

Action required: Full rebuild + Velero restore. This is why off-site S3 backups are critical.

Scenario 4: Application Data Corruption¶

A bad deploy, migration, or bug corrupts your database.

Do not delete the PVC. The corrupted data is still recoverable.

Scale down the affected workload:

kubectl scale deployment my-app --replicas=0

Restore from the last good backup:

# Restore only the affected namespace
velero restore create --from-backup <pre-corruption-backup> \
  --include-namespaces my-app \
  --existing-resource-policy update

Scale back up and verify:

kubectl scale deployment my-app --replicas=1

Action required: Identify the last good backup and restore selectively.

Scenario 5: TrueNAS NFS Outage (NFS Storage Only)¶

TrueNAS goes offline or the NFS service stops. Pods using NFS volumes hang.

Fix TrueNAS. Restart the NFS service or restore the NAS.
Pods recover automatically. Once NFS is reachable again, hung pods resume. You may need to restart pods that timed out:
```
kubectl delete pods --field-selector=status.phase=Failed -A
```
If NFS data is lost, fall back to Velero restore (Scenario 3).

This is why Longhorn is recommended -- it has no TrueNAS dependency and survives NAS outages.

Recovery Time Expectations¶

Scenario	Downtime	Data Loss
Single worker failure	Minutes (auto-replace)	None (Longhorn replicas)
Control plane failure	Minutes (Omni etcd restore)	None
Total cluster loss	30-60 min (rebuild + restore)	Since last Velero backup
Data corruption	Minutes (selective restore)	Since last good backup
TrueNAS NFS outage	Until NFS recovers	None (data on NAS)

What Each Layer Protects¶

Component	Protected By	Notes
Node lifecycle (VMs)	Omni	Automatic reprovision — no backup needed
Talos machine config	Omni	Declarative — Omni stores and applies it
etcd / control plane	Omni	Built-in etcd backup — see Omni docs
Kubernetes resources	Velero	Deployments, Services, ConfigMaps, Secrets, CRDs, etc.
PersistentVolume data	Velero	Via file-system backup (node-agent) or CSI snapshots

CSI Snapshots with Velero¶

The default Velero setup (above) uses file-system backup via the node-agent -- it copies files from mounted PVs. This works with any storage backend but is slow for large volumes because it copies every file.

If your CSI driver supports the Kubernetes VolumeSnapshot API, Velero can take CSI snapshots instead -- instant, crash-consistent, point-in-time captures at the block level. This is significantly faster and more reliable for large databases.

Which Storage Drivers Support CSI Snapshots?¶

Driver	VolumeSnapshot Support	Notes
Longhorn	Yes	Native CSI snapshot support. Recommended.
democratic-csi	Yes	ZFS-native snapshots exposed as VolumeSnapshots
NFS (nfs-subdir-external-provisioner)	No	NFS has no snapshot capability -- file-system backup only

Prerequisites¶

Velero v1.14+ -- CSI plugin is built-in (no separate plugin install needed)
VolumeSnapshot CRDs installed in the cluster
CSI snapshot controller running in the cluster
A CSI driver that supports VolumeSnapshots (Longhorn or democratic-csi)

Setup with Longhorn¶

1. Verify VolumeSnapshot CRDs exist:

kubectl get crd | grep volumesnapshot
# Should show:
#   volumesnapshotclasses.snapshot.storage.k8s.io
#   volumesnapshotcontents.snapshot.storage.k8s.io
#   volumesnapshots.snapshot.storage.k8s.io

If missing, install them:

kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshotcontents.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-csi/external-snapshotter/master/client/config/crd/snapshot.storage.k8s.io_volumesnapshots.yaml

2. Create a VolumeSnapshotClass for Longhorn:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn-snapshot
  labels:
    velero.io/csi-volumesnapshot-class: "true"
  annotations:
    snapshot.storage.kubernetes.io/is-default-class: "true"
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: snap

The velero.io/csi-volumesnapshot-class: "true" label tells Velero to use this class automatically for Longhorn volumes during backup.

3. Install Velero with CSI snapshots enabled:

velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.11.0 \
  --bucket <your-bucket-name> \
  --secret-file /tmp/velero-credentials \
  --backup-location-config \
    region=<your-region>,s3ForcePathStyle=true,s3Url=<your-s3-endpoint> \
  --features=EnableCSI \
  --use-node-agent \
  --default-volumes-to-fs-backup

The key addition is --features=EnableCSI. The --default-volumes-to-fs-backup flag acts as a fallback for any volumes that don't support CSI snapshots (e.g., if you also have NFS volumes).

4. Verify CSI snapshot integration:

# Create a test backup
velero backup create csi-test --include-namespaces default --wait

# Check the backup used CSI snapshots
velero backup describe csi-test --details | grep -A5 "CSI Snapshots"

Setup with democratic-csi¶

democratic-csi exposes ZFS snapshots as Kubernetes VolumeSnapshots. Create a VolumeSnapshotClass for your driver:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: zfs-snapshot
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: org.democratic-csi.nfs  # or org.democratic-csi.iscsi
deletionPolicy: Delete

The rest of the Velero setup is identical to the Longhorn steps above.

CSI Snapshot Data Movement (Recommended)¶

By default, CSI snapshots stay on the same storage backend as the original volume. If TrueNAS fails, you lose both the volume and its snapshots. CSI Snapshot Data Movement copies snapshot data to your remote S3 bucket, giving you true off-site protection.

This is enabled automatically when you use --features=EnableCSI with --use-node-agent. Velero creates a temporary PVC from the snapshot, mounts it read-only, and uploads the data to S3 via the node-agent.

See the Velero CSI Snapshot Data Movement docs for advanced configuration.

File-System Backup vs CSI Snapshots¶

	File-System Backup	CSI Snapshots
Speed	Slow (copies every file)	Fast (block-level snapshot)
Consistency	Application must be quiesced	Crash-consistent automatically
Works with NFS	Yes	No
Works with Longhorn	Yes	Yes (recommended)
Off-site copy	Direct to S3	Via data movement to S3

For most users running Longhorn, enable both: CSI snapshots for Longhorn volumes (fast, consistent) and file-system backup as the fallback.

Recommendations¶

Schedule daily backups with a 7-day TTL as a baseline
Test restores regularly — a backup you've never restored is a backup you can't trust
Use a remote S3 target — backups on the same NAS as your VMs won't survive a hardware failure
Back up before cluster upgrades — run velero backup create pre-upgrade before changing Talos or Kubernetes versions in Omni
Exclude ephemeral volumes if needed — use the backup.velero.io/backup-volumes-excludes pod annotation for volumes that don't need backup (caches, temp dirs)