Upgrading a Management Cluster (RKE2)

This guide describes how to upgrade the Kubernetes version of a Kube-DC management cluster (the RKE2 cluster that runs the controllers, KubeVirt, Kube-OVN, Keycloak, and tenant workloads) using Rancher's system-upgrade-controller (SUC).

It is written to be cluster-agnostic. Substitute your own node names and target versions.

Why SUC

SUC turns a node-by-node RKE2 upgrade into a declarative Plan: you label the nodes, set a target version, and the controller cordons → upgrades → uncordons each node in a controlled, serialized order. This replaces error-prone manual curl … | INSTALL_RKE2_VERSION=… sh runs and keeps the fleet consistent.

Hard constraints — read first

Constraint	What it means for you
No minor-version skipping	Kubernetes does not support jumping the control plane more than one minor at a time. Upgrade one minor per pass (e.g. `1.32 → 1.33 → 1.34`), not straight to the target.
No downgrade	RKE2 cannot be rolled back. If a pass breaks the API server or CNI, there is no easy recovery — validate every pass before the next.
CNI / KubeVirt support windows	Kube-OVN and KubeVirt are each certified against a range of Kubernetes versions. Before targeting a brand-new minor, confirm your installed Kube-OVN and KubeVirt versions support it. Don't outrun your data-plane.
Single control-plane	If the cluster has one server node, its RKE2 restart is a brief (~1–2 min) full API-server outage per pass. Plan for it.

Picking a target. Use the RKE2 stable channel, not latest (which is bleeding edge). Check current versions at update.rke2.io/v1-release/channels.

Pre-flight checklist

Run these before starting — they prevent the most common mid-upgrade stalls:

Node disk headroom. Each upgrade pulls the new RKE2 release (~1–2 GiB) onto every node. Ensure each node has comfortable free space; a node near its disk limit can trip DiskPressure, which evicts pods and blocks the upgrade job from scheduling. Storage-dense nodes (large local-path PVCs) are the usual offenders.
max-pods consistency. A node that is at its max-pods cap cannot schedule the SUC upgrade pod (FailedScheduling … Too many pods). Confirm every node has the same, adequate max-pods (Kube-DC's bootstrap sets it by memory tier; nodes installed by older tooling may be on the default 110).
Eviction thresholds. Never set a disk-eviction threshold tighter than a node's current free space — on a legitimately-full node it triggers DiskPressure immediately. Match RKE2's default (nodefs.available<5%) or looser on storage-dense nodes.
OpenBao re-seal. OpenBao seals on every pod restart. If a node hosting the OpenBao pod is upgraded (or the pod reschedules), OpenBao will need to be unsealed again. Have your unseal procedure ready.
Quiesce / snapshot. Take an etcd snapshot and note current component health so you have a clean baseline to compare against.

Procedure

1. Install the controller (once per cluster)

Vendor the SUC release manifests (crd.yaml + system-upgrade-controller.yaml) into your GitOps repo and apply them — in a Kube-DC fleet this is an infrastructure/system-upgrade-controller/ kustomization wired as its own Flux Kustomization. The controller runs in the system-upgrade namespace and does nothing until a Plan exists.

Keep the controller in GitOps but apply the upgrade Plans manually — a GitOps-managed Plan would re-apply on every reconcile and could re-run.

2. Opt nodes in

kubectl label node <all-nodes> rke2-upgrade=true --overwrite

3. Apply the Plans (first minor)

Two Plans — one for control-plane (rke2-server) and one for workers (rke2-agent). The agent Plan's prepare step blocks until the server Plan finishes, so the control plane always upgrades first. Both use concurrency: 1 (one node at a time).

apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata: { name: rke2-server, namespace: system-upgrade }
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
      - { key: rke2-upgrade, operator: In, values: ["true"] }
      - { key: node-role.kubernetes.io/control-plane, operator: In, values: ["true"] }
  serviceAccountName: system-upgrade
  tolerations: [ { operator: Exists } ]
  upgrade: { image: rancher/rke2-upgrade }
  version: v1.33.12+rke2r2          # <-- next minor, not the final target
---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata: { name: rke2-agent, namespace: system-upgrade }
spec:
  concurrency: 1
  cordon: true
  nodeSelector:
    matchExpressions:
      - { key: rke2-upgrade, operator: In, values: ["true"] }
      - { key: node-role.kubernetes.io/control-plane, operator: NotIn, values: ["true"] }
  serviceAccountName: system-upgrade
  tolerations: [ { operator: Exists } ]
  prepare: { args: ["prepare", "rke2-server"], image: rancher/rke2-upgrade }
  upgrade: { image: rancher/rke2-upgrade }
  version: v1.33.12+rke2r2

kubectl apply -f rke2-upgrade-plans.yaml

4. Watch the pass

kubectl -n system-upgrade get plans
kubectl -n system-upgrade get jobs,pods
kubectl get nodes -o wide        # kubeletVersion flips per node as each finishes

Nodes cycle Ready,SchedulingDisabled → NotReady (brief) → Ready. The controller uncordons each node when its job succeeds.

5. Validate before the next minor

kubectl get nodes                                   # all Ready, all on the new minor
kubectl -n kube-system get pods | grep -E 'kube-ovn|ovs|multus'   # CNI healthy
kubectl get vmi -A                                  # VMs still Running
kubectl get pods -A | grep -vE 'Running|Completed'  # nothing stuck

If healthy, advance both Plans to the next minor — SUC re-runs because the version hash changes:

NEXT=v1.34.8+rke2r2
kubectl -n system-upgrade patch plan rke2-server --type=merge -p "{\"spec\":{\"version\":\"$NEXT\"}}"
kubectl -n system-upgrade patch plan rke2-agent  --type=merge -p "{\"spec\":{\"version\":\"$NEXT\"}}"

Repeat until you reach the final target. When done, you can clear the target:

kubectl -n system-upgrade delete plan rke2-server rke2-agent

KubeVirt VMs: cordon-only vs drain

The example Plans above are cordon-only (no drain:). This is the right default when VMs are not live-migratable (evictionStrategy: None): a drain would evict their virt-launcher pods and stop the VMs, whereas cordon-only leaves the containerd-managed VM pods running across the in-place rke2 restart (rke2-upgrade swaps the binary, it does not reboot the host).

If your VMs are live-migratable, add a drain to the agent Plan for a cleaner roll:

  drain: { force: true, skipWaitForDeleteTimeout: 60 }

If a node's upgrade job won't schedule

Symptom: the node stays on the old version and kubectl -n system-upgrade get pods shows the apply pod Pending or Evicted.

FailedScheduling … Too many pods → the node is at its max-pods cap. Raise max-pods (kubelet arg) and restart the agent, or free a slot.
Pod was rejected … DiskPressure → the node is out of disk. Free space (image prune helps only marginally if the bulk is PVC data) and/or relax the disk eviction threshold to no tighter than current free space.
A previous attempt left a Failed Job that SUC is backing off from. Delete the stale Job so the controller recreates it: kubectl -n system-upgrade delete job <apply-…-job>.

Post-upgrade

Confirm all nodes report the target version and are Ready.
Re-check CNI, KubeVirt VMs, and every operator pod.
Re-unseal OpenBao if its pod restarted.
The OIDC-webhook API-server flag lives in /etc/rancher/rke2/config.yaml and survives the binary swap — no re-cutover needed.

Why SUC​

Hard constraints — read first​

Pre-flight checklist​

Procedure​

1. Install the controller (once per cluster)​

2. Opt nodes in​

3. Apply the Plans (first minor)​

4. Watch the pass​

5. Validate before the next minor​

KubeVirt VMs: cordon-only vs drain​

If a node's upgrade job won't schedule​

Post-upgrade​