Skip to main content

Managed Kubernetes etcd Backup & Restore

How to back up and restore the control-plane etcd of tenant Kubernetes clusters that Kube-DC manages via KdcCluster. This page is for operators and SREs running a Kube-DC cluster — not for tenants.

Both flows are operated through standard kubectl against the management cluster. Tenants have no direct interaction with the backup or restore pipeline.


What's protected, what isn't

This pipeline backs up the control-plane etcd of each managed tenant cluster. That's the API state — Nodes, Deployments, Pods, ConfigMaps, Secrets, CRDs — everything kubectl get on the tenant would return.

It does not back up:

  • Tenant workload data (PVC contents, application state). For that, use Velero or a workload-specific tool deployed on the tenant.
  • Cluster CAs or kubeconfigs. Those live in <datastore>-etcd-certs Secrets in the tenant's project namespace and are managed by the KdcClusterDatastore certificate lifecycle. A control-plane restore preserves them as-is — which is what you want; a snapshot from before a CA rotation will work fine after restore.
  • Kube-DC platform state (the management cluster's own etcd, MetalLB, Rook-Ceph, monitoring). Those are protected separately via Velero on the management cluster.

The recovery pattern is disaster recovery, not "rollback the last 10 minutes of changes." Restoring rolls a cluster's API state back to the snapshot moment. Pods that the controllers think shouldn't exist may be deleted or evicted; pods on workers stay running on disk until evaluated.


How backups work

For every KdcCluster whose namespace has a working managed-k8s-backups S3 ObjectBucketClaim (auto-provisioned when Rook-Ceph is available on the platform), the controller creates a <cluster>-etcd-backup CronJob in the project namespace.

DefaultConfigurable via
Schedule: 0 2 * * * (02:00 UTC daily)KdcCluster.spec.backup.schedule
Retention: 7 days (S3 lifecycle policy)KdcCluster.spec.backup.retentionDays
Bucket: <projectNamespace>-managed-k8s-backupsKdcCluster.spec.backup.destinationPath
Object key: <cluster>/<cluster>-<ts>.db(not configurable)
S3 endpoint: S3_ENDPOINT controller env (e.g. https://s3.stage.kube-dc.com)KdcCluster.spec.backup.s3Endpoint

A snapshot is roughly 20 MB per cluster (etcd 3.5, default quota-backend-bytes=8GB), and the upload to Rook-Ceph S3 takes ~10–25s inside the management cluster. The CronJob is gated on the OBC being Bound and on KdcCluster.status.dataStoreName being set, so newly provisioned clusters don't try to take backups before their etcd exists.

Quick checks

# Are backups configured for every tenant cluster?
kubectl get kdccluster -A -o jsonpath='{range .items[*]}{.metadata.namespace}/{.metadata.name}: {.status.conditions[?(@.type=="BackupReady")].status} {.status.conditions[?(@.type=="BackupReady")].reason}{"\n"}{end}'

# Last 24h of CronJob runs across the cluster
kubectl get jobs -A --sort-by=.metadata.creationTimestamp | grep etcd-backup | tail -20

# Snapshot files for one tenant
NS=<project-ns>
SECRET=$(kubectl -n $NS get obc managed-k8s-backups -o jsonpath='{.spec.secretName}')
BUCKET=$(kubectl -n $NS get obc managed-k8s-backups -o jsonpath='{.spec.bucketName}')
ENDPOINT=$(kubectl -n kube-dc get cm cluster-config -o jsonpath='{.data.S3_HOSTNAME}' \
| sed 's|^|https://|')

kubectl -n $NS run s3 --rm -i --restart=Never \
--image=amazon/aws-cli:2.15.0 \
--overrides="{\"spec\":{\"containers\":[{\"name\":\"s3\",\"image\":\"amazon/aws-cli:2.15.0\",\"command\":[\"sh\",\"-c\",\"aws --endpoint-url $ENDPOINT s3 ls s3://$BUCKET/<cluster>/\"],\"envFrom\":[{\"secretRef\":{\"name\":\"$SECRET\"}}]}]}}"

Symptoms / fixes

SymptomLikely causeResolution
BackupReady=False reason=S3NotAvailableRook-Ceph OBC not Bound yetWait for kubectl -n $NS get obc managed-k8s-backups to reach Bound. Check Rook health on mgmt cluster.
BackupReady=False reason=DataStoreUnknownCluster newly created, etcd not yet readyResolves itself once KdcCluster.status.dataStoreName is set.
Job stuck in FailedCommon: etcd unreachable (TLS, DNS), S3 unreachable from project namespacekubectl logs job/<cluster>-etcd-backup-<unix-time> — exit 2 = snapshot too small (etcd write failed); other exits = check container output.
No CronJob existsKdcCluster.spec.backup.enabled: false, OR project lacks the OBCRe-enable in spec; provision OBC.

A sample CronJob, by name, you can copy:

kubectl -n <project-ns> get cronjob <cluster-name>-etcd-backup -o yaml

Performing a restore

You trigger a restore by setting one annotation on the KdcCluster:

kubectl -n <project-ns> annotate kdccluster <cluster-name> \
kube-dc.com/restore-from="<cluster-name>/<cluster-name>-<timestamp>.db" \
--overwrite

The controller then walks an eight-step state machine. Total runtime is typically 60–120 seconds for a snapshot under ~30 MB.

What happens, step by step (operator view)

PhaseWait timeVisible effect
Validating1 reconcileSnapshot key prefix matched, etcd replica count == 1
DrainingControlPlane~10s<cluster>-cp Deployment scaled to 0 (Kamaji may immediately recreate apiserver pods — that's fine, they can't talk to etcd)
StoppingEtcd~10–30s<datastore>-etcd StatefulSet scaled to 0; etcd-0 Pod terminates
RestoringSnapshot~25s for 19MBJob <cluster>-etcd-restore runs: pulls .db from S3, runs etcdutl snapshot restore, replaces contents of /var/run/etcd/default.etcd
StartingEtcd~10–20sSTS scaled back to 1; etcd-0 boots from restored data
StartingControlPlane~30s<cluster>-cp Deployment scaled back to spec replicas; apiserver becomes Ready
SucceededterminalAnnotations cleared; status.latestRestoreKey + status.latestRestoreTime stamped; restore Job deleted

The tenant cluster's API is unreachable for the entire restore window. Workload pods on tenant worker nodes keep running on disk during this — only the API plane is offline. Once the apiserver comes back, kubelets re-register their node and the cluster catches up.

What the SRE sees in kubectl get

kubectl -n <project-ns> get kdccluster <cluster-name> \
-o jsonpath='{range .status.conditions[?(@.type=="RestoreReady")]}{.status} {.reason}: {.message}{"\n"}{end}'

# Live progress (state annotation tracks the current step):
kubectl -n <project-ns> get kdccluster <cluster-name> \
-o jsonpath='{.metadata.annotations.kube-dc\.com/restore-state}{"\n"}'

Final state after success

FieldValue
metadata.annotations[kube-dc.com/restore-from](cleared)
metadata.annotations[kube-dc.com/restore-state](cleared)
status.conditions[type=RestoreReady]True / Succeeded / Restore from "<key>" complete
status.latestRestoreKey<cluster>/<cluster>-<timestamp>.db
status.latestRestoreTime<UTC timestamp>

A second restore can be requested by re-applying the annotation; the state machine re-validates and walks the steps again.

Failure handling

If a restore fails (validation rejected the request, the Job didn't complete, or scaling errored), the controller writes:

RestoreReady=False reason=Failed message="Restore from \"<key>\" failed (<reason>): <detail>"

and clears the trigger annotations. The tenant control plane is left in whatever state the failure occurred at — if etcd was already wiped (step 4 onwards), the cluster needs another restore attempt, or a fall-back to the manual runbook (see "When the controller is unavailable" below).

Common failures:

ReasonWhat to do
MultiReplicaUnsupportedThe cluster is running HA etcd (replicas > 1). Phase 1 only supports single-replica. Use the manual runbook or wait for Phase 2.
ForeignKeySnapshot key doesn't start with this cluster's name. Use <cluster-name>/... paths only.
RestoreJobFailedInspect kubectl logs job/<cluster>-etcd-restore — common causes: S3 credentials invalid, snapshot integrity check failed, network to S3 from project namespace blocked.
EtcdStartFailed / ControlPlaneStartFailedRestored etcd didn't come up. Check kubectl logs <datastore>-etcd-etcd-0 — usually a cert mismatch (peer URL changed since snapshot) or a corrupted snapshot. Cert mismatch → re-trigger Kamaji's certificate reconcile by annotating the TenantControlPlane.

Verifying a restore is complete

# 1. Condition + status fields
kubectl -n <project-ns> get kdccluster <cluster-name> -o yaml | grep -A2 RestoreReady
kubectl -n <project-ns> get kdccluster <cluster-name> \
-o jsonpath='lastKey={.status.latestRestoreKey}{"\n"}lastTime={.status.latestRestoreTime}{"\n"}'

# 2. Tenant API responds
kubectl -n <project-ns> get secret <cluster-name>-kubeconfig \
-o jsonpath='{.data.value}' | base64 -d > /tmp/kc.yaml
kubectl --kubeconfig /tmp/kc.yaml get nodes
kubectl --kubeconfig /tmp/kc.yaml get pods -A | head

# 3. Pod ages preserved (sanity — not 0m old which would mean fresh-bootstrap, not restore)
kubectl --kubeconfig /tmp/kc.yaml -n kube-system get pods -o wide

If the tenant pods show ages similar to the cluster's ORIGINAL ages (e.g. several days, matching when they were last redeployed), the restore preserved data. If they're all freshly-created (within minutes), something went wrong — etcd booted on empty data, not on the snapshot.


Constraints

Phase 1 (today)
Etcd replica count1 only
Cross-cluster restorerejected (snapshot key must belong to target cluster's bucket)
Concurrent restores on same clusterthe second annotation overwrites the first; controller doesn't queue — admission webhook rejection is on the Phase 2 roadmap
RPO24 hours (daily 02:00 UTC backup)
RTO60–120 seconds (controller-driven), 30–60 minutes (manual runbook)

If a tenant runs HA etcd (KdcClusterDatastore.spec.etcd.replicas>1) and you need to restore, use the manual runbook below. Phase 2 will add multi-replica support.

If a tenant needs sub-24h RPO, the path forward is delta snapshots — also planned for Phase 2/3.


When the controller is unavailable

Sometimes you can't use the annotation flow — the management cluster's kube-dc-k8-manager is down, or the controller has a bug, or you're recovering during a dependency outage. Use the manual runbook in kube-dc-k8-manager/docs/RESTORE.md. It covers the same 8-step flow but with kubectl scale + a debug Pod that runs etcdutl snapshot restore against the etcd member-0 PVC.

In short:

NS=<project-ns>
CLUSTER=<cluster-name>
DS=<datastore-name> # usually <CLUSTER>-etcd

# 1. Pause the control plane and etcd
kubectl -n $NS scale deploy ${CLUSTER}-cp --replicas=0
kubectl -n $NS scale sts ${DS}-etcd --replicas=0

# 2. Run a one-shot Pod that mounts the etcd-data-${DS}-etcd-0 PVC
# and runs etcdutl snapshot restore against /var/run/etcd/default.etcd
# (full Pod spec in the runbook)

# 3. Bring etcd back up
kubectl -n $NS scale sts ${DS}-etcd --replicas=1
kubectl -n $NS wait --for=condition=Ready pod/${DS}-etcd-0 --timeout=120s

# 4. Bring the control plane back up
kubectl -n $NS scale deploy ${CLUSTER}-cp --replicas=1

The trickiest part is the PVC mount path: the restore data dir must be /var/run/etcd/default.etcd (the etcd container's --data-dir), NOT the PVC mount root.


Architecture context

For the bigger picture of how managed Kubernetes clusters fit into Kube-DC's networking, control-plane, and storage layers: