Metal3 Bare-Metal Worker Nodes

This guide covers deploying and managing bare-metal worker nodes for Kube-DC using Metal3 — the Cluster API infrastructure provider for bare metal. Metal3 automates server discovery, OS provisioning, Kubernetes node joining, and lifecycle management through standard CAPI resources.

Architecture

┌─────────────────────────────────────────────────────────────────────────────────┐
│                     KUBE-DC MANAGEMENT CLUSTER                                  │
│                                                                                 │
│   ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                             │
│   │  master-1   │  │  master-2   │  │  master-3   │  ← Control-plane nodes      │
│   │  RKE2 server│  │  RKE2 server│  │  RKE2 server│    (already installed)      │
│   │  OVN-DB     │  │  OVN-DB     │  │  OVN-DB     │                             │
│   └──────┬──────┘  └──────┬──────┘  └──────┬──────┘                             │
│          │                │                │                                    │
│   ┌──────┴────────────────┴────────────────┴──────┐                             │
│   │          Metal3 Control Plane Components      │                             │
│   │                                               │                             │
│   │  ┌──────────────────┐  ┌───────────────────┐  │                             │
│   │  │  Bare Metal      │  │  Ironic           │  │                             │
│   │  │  Operator (BMO)  │  │  (Provisioning)   │  │                             │
│   │  │                  │  │                   │  │                             │
│   │  │  Manages BMH CRs │  │  PXE/Virtual Media│  │                             │
│   │  └────────┬─────────┘  └────────┬──────────┘  │                             │
│   │           │                     │             │                             │
│   │  ┌────────┴─────────────────────┴──────────┐  │                             │
│   │  │  CAPM3 (Cluster API Provider Metal3)    │  │                             │
│   │  │  + Metal3 IPAM Controller               │  │                             │
│   │  └─────────────────────────────────────────┘  │                             │
│   └─────────────────────────┬─────────────────────┘                             │
│                             │                                                   │
│                             │  BMC (IPMI/Redfish/iDRAC)                         │
│                             ▼                                                   │
│   ┌─────────────────────────────────────────────────────────────────────────┐   │
│   │                     BARE-METAL WORKER POOL                              │   │
│   │                                                                         │   │
│   │   ┌───────────┐ ┌───────────┐  ┌───────────┐  ┌───────────┐             │   │
│   │   │ worker-1  │ │ worker-2  │  │ worker-3  │  │ worker-N  │             │   │
│   │   │ RKE2 agent│ │ RKE2 agent│  │ RKE2 agent│  │ RKE2 agent│             │   │
│   │   │ KubeVirt  │ │ KubeVirt  │  │ KubeVirt  │  │ KubeVirt  │             │   │
│   │   └───────────┘ └───────────┘  └───────────┘  └───────────┘             │   │
│   └─────────────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────────────┘

Network Connectivity:
─────────────────────
  Management VLAN ──── All nodes (masters + workers): Kubernetes API, etcd, SSH
  Cloud VLAN      ──── All nodes: Kube-OVN underlay, project VPCs, VM traffic
  Provider VLAN   ──── All nodes: Public IPs (EIPs, FIPs, Service LoadBalancers)
  BMC Network     ──── Masters → Worker BMCs: IPMI/Redfish for power management

How It Works

Enroll — Register each bare-metal server as a BareMetalHost (BMH) CR with its BMC address and credentials
Inspect — Ironic powers on the server via BMC, PXE-boots a ramdisk, and collects hardware inventory (CPUs, RAM, disks, NICs, MAC addresses)
Provision — When a MachineDeployment scales up, CAPM3 selects an available BMH, writes an OS image to disk via Ironic, and injects cloud-init user/network data
Join — The provisioned server boots into Ubuntu with RKE2 agent pre-configured, joins the management cluster, and becomes a schedulable worker node
Heal — MachineHealthCheck monitors node health; if a node becomes unhealthy, the Metal3 remediation controller power-cycles it via BMC or reprovisions it

Prerequisites

Hardware Requirements

Requirement	Details
BMC access	Each worker server must have a Baseboard Management Controller (IPMI, Redfish, iDRAC, iLO) reachable from the management cluster
PXE or Virtual Media boot	Servers must support network boot (PXE) or Redfish Virtual Media for OS provisioning
Boot mode	UEFI recommended (legacy BIOS supported but not recommended)
Network interfaces	Minimum 2 NICs: one for management, one trunk for cloud/provider VLANs
Storage	At least one disk for OS installation (SSD recommended, 100 GB+)

Network Requirements

The management cluster nodes must be able to reach:

Target	Protocol	Port	Purpose
Worker BMCs	IPMI/Redfish	623 (IPMI), 443 (Redfish)	Power management, virtual media
Worker PXE NICs	DHCP + TFTP	67-69, 6180	PXE boot (if not using virtual media)
Worker management NICs	SSH	22	Post-provisioning verification

Workers must have the same VLAN trunk access as the master nodes — they need connectivity to the management, cloud, and provider VLANs. See Network Architecture for VLAN details.

Software Requirements

Component	Status	Notes
Kube-DC management cluster	Required	3-node HA cluster per Installation Guide
cert-manager	Already installed	Deployed by the kube-dc installer (Flux)
Cluster API core	Already installed	Deployed by the kube-dc installer (Flux)
CAPM3 + BMO + Ironic	To be installed	This guide covers installation
OS disk image	To be prepared	Ubuntu 24.04 with RKE2 agent pre-baked

Information to Collect

Before proceeding, gather the following for each worker server:

Info	Example	How to obtain
BMC IP address	`192.168.1.101`	Check BMC/iDRAC web UI or server documentation
BMC protocol	`redfish-virtualmedia`	Depends on hardware vendor (see supported hardware)
BMC credentials	`admin` / `password`	Set via BMC web interface
Boot NIC MAC address	`aa:bb:cc:dd:ee:01`	Check `ip link` output or BMC hardware inventory
Management NIC name	`eth0` or `eno1`	Varies by hardware; inspect after first boot
Trunk NIC name	`eth1` or `enp94s0f0np0`	The VLAN-capable NIC connected to cloud/provider switch

Phase 1 — Install Metal3 Components

1.1 Initialize CAPM3

The Kube-DC installer already deploys Cluster API core components. Add the Metal3 infrastructure provider and IPAM:

clusterctl init --infrastructure metal3 --ipam metal3

This installs:

CAPM3 — Cluster API Provider Metal3 (manages Metal3Machine, Metal3MachineTemplate)
Metal3 IPAM — IP address management for static IP assignment during provisioning
Bare Metal Operator (BMO) — Manages BareMetalHost lifecycle

1.2 Deploy Ironic

Ironic is the provisioning engine that Metal3 uses to interact with hardware via BMC protocols. Deploy it using the Ironic Standalone Operator:

# Install Ironic Standalone Operator
kubectl apply -k https://github.com/metal3-io/ironic-standalone-operator/config/default

kubectl -n ironic-standalone-operator-system wait \
  --for=condition=Available --timeout=300s \
  deploy/ironic-standalone-operator-controller-manager

Create the Ironic deployment. Adjust the network settings to match your BMC network:

# ironic.yaml
apiVersion: metal3.io/v1alpha1
kind: Ironic
metadata:
  name: ironic
  namespace: baremetal-operator-system
spec:
  networking:
    interface: eth0                       # Management interface on master nodes
    ipAddress: 192.168.0.1               # Management IP of the node running Ironic
    dhcp:
      networkCIDR: 192.168.0.0/18        # Management network CIDR
      rangeBegin: 192.168.10.1           # DHCP range for PXE boot (avoid conflicts)
      rangeEnd: 192.168.10.254
  databaseRef:
    name: ironic-mariadb
    namespace: baremetal-operator-system

kubectl create ns baremetal-operator-system
kubectl apply -f ironic.yaml

PXE vs Virtual Media

If your hardware supports Redfish Virtual Media (most modern servers do), you can skip the DHCP configuration entirely. Virtual Media mounts the boot ISO directly via BMC, avoiding PXE network complexity. Use BMC addresses like redfish-virtualmedia://192.168.1.101/redfish/v1/Systems/1 in your BareMetalHost specs.

1.3 Verify Metal3 Stack

# Check all Metal3 components are running
kubectl get pods -n capm3-system
kubectl get pods -n baremetal-operator-system
kubectl get pods -n ironic-standalone-operator-system

# Verify CRDs are installed
kubectl api-resources | grep metal3
# Expected: baremetalhosts, metal3machines, metal3machinetemplates, etc.

Phase 2 — Prepare the Worker OS Image

Metal3 provisions servers by writing a disk image. This image must contain:

Ubuntu 24.04 LTS base system
RKE2 agent binaries (pre-installed but not started)
cloud-init for first-boot configuration (network, hostname, cluster join)
Kernel modules required by Kube-OVN (openvswitch, nf_conntrack)

2.1 Build the Image

Use a tool like image-builder or create a custom image with Packer:

# Example: Download base Ubuntu 24.04 cloud image and customize
wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img

# Customize with virt-customize (libguestfs)
virt-customize -a noble-server-cloudimg-amd64.img \
  --install curl,iptables,linux-headers-generic,nfs-common,open-iscsi \
  --run-command 'curl -sfL https://get.rke2.io | INSTALL_RKE2_VERSION=v1.35.0+rke2r1 INSTALL_RKE2_TYPE=agent sh -' \
  --run-command 'systemctl enable rke2-agent.service' \
  --run-command 'echo nf_conntrack >> /etc/modules' \
  --run-command 'echo "fs.inotify.max_user_watches=1524288" >> /etc/sysctl.conf' \
  --run-command 'echo "fs.inotify.max_user_instances=4024" >> /etc/sysctl.conf' \
  --run-command 'echo "net.ipv4.ip_forward=1" >> /etc/sysctl.conf' \
  --run-command 'systemctl disable systemd-resolved' \
  --run-command 'rm -f /etc/resolv.conf && echo -e "nameserver 8.8.8.8\nnameserver 8.8.4.4" > /etc/resolv.conf'

2.2 Host the Image

Make the image available via HTTP from a server reachable by Ironic:

# Compute checksum
sha256sum noble-server-cloudimg-amd64.img > noble-server-cloudimg-amd64.img.sha256sum

# Serve via nginx, Apache, or any HTTP server
# Example URL: http://192.168.0.1:8080/images/noble-server-cloudimg-amd64.img

Phase 3 — Enroll Bare-Metal Hosts

3.1 Create BMC Credentials

Create a Kubernetes secret for each worker server's BMC credentials:

# bmh-secrets.yaml
apiVersion: v1
kind: Secret
metadata:
  name: worker-1-bmc
  namespace: baremetal-operator-system
type: Opaque
stringData:
  username: admin
  password: your-bmc-password
---
apiVersion: v1
kind: Secret
metadata:
  name: worker-2-bmc
  namespace: baremetal-operator-system
type: Opaque
stringData:
  username: admin
  password: your-bmc-password

kubectl apply -f bmh-secrets.yaml

3.2 Create BareMetalHost Resources

# baremetalhosts.yaml
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: worker-1
  namespace: baremetal-operator-system
spec:
  online: true
  bootMACAddress: "aa:bb:cc:dd:ee:01"    # MAC of the PXE/management NIC
  bootMode: UEFI
  bmc:
    address: redfish-virtualmedia://192.168.1.101/redfish/v1/Systems/1
    credentialsName: worker-1-bmc
    disableCertificateVerification: true
  automatedCleaningMode: metadata          # Clean disk metadata between provisions
  rootDeviceHints:
    minSizeGigabytes: 100                  # Select disk ≥100 GB for OS install
---
apiVersion: metal3.io/v1alpha1
kind: BareMetalHost
metadata:
  name: worker-2
  namespace: baremetal-operator-system
spec:
  online: true
  bootMACAddress: "aa:bb:cc:dd:ee:02"
  bootMode: UEFI
  bmc:
    address: redfish-virtualmedia://192.168.1.102/redfish/v1/Systems/1
    credentialsName: worker-2-bmc
    disableCertificateVerification: true
  automatedCleaningMode: metadata
  rootDeviceHints:
    minSizeGigabytes: 100

kubectl apply -f baremetalhosts.yaml

3.3 Wait for Inspection

Watch the BMH resources progress through registering → inspecting → available:

kubectl get bmh -n baremetal-operator-system -w

# NAME       STATE        CONSUMER   ONLINE   ERROR   AGE
# worker-1   registering             true             10s
# worker-1   inspecting              true             30s
# worker-1   available               true             5m
# worker-2   available               true             5m

Once a BMH reaches available, Ironic has successfully:

Powered on the server via BMC
PXE-booted a ramdisk
Collected hardware inventory (CPUs, RAM, disks, NICs with MAC addresses)
Powered the server back off

Inspect the discovered hardware:

kubectl get bmh worker-1 -n baremetal-operator-system -o jsonpath='{.status.hardware}' | jq .

This shows all discovered NICs, disks, CPU, and RAM — essential for configuring network data templates.

Phase 4 — Configure CAPI Resources for Worker Provisioning

4.1 Create Metal3 IPAM Pool

Define an IP pool for worker management network addresses:

# ippool-mgmt.yaml
apiVersion: ipam.metal3.io/v1alpha1
kind: IPPool
metadata:
  name: worker-mgmt-pool
  namespace: baremetal-operator-system
spec:
  clusterName: kube-dc-mgmt
  namePrefix: worker-mgmt
  pools:
    - start: 192.168.0.50
      end: 192.168.0.99
      prefix: 18
      gateway: 192.168.0.254

kubectl apply -f ippool-mgmt.yaml

4.2 Create Metal3DataTemplate

The Metal3DataTemplate defines how network data and metadata are generated for each provisioned worker. This is critical — it tells cloud-init how to configure the server's network interfaces.

# metal3datatemplate.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3DataTemplate
metadata:
  name: worker-data-template
  namespace: baremetal-operator-system
spec:
  clusterName: kube-dc-mgmt
  metaData:
    strings:
      - key: local-hostname
        value: "{{ ds.meta_data.name }}"
  networkData:
    links:
      ethernets:
        # Management NIC — carries Kubernetes API, SSH, node-to-node traffic
        - id: mgmt-nic
          macAddress:
            fromHostInterface: eth0       # Matched against BMH hardware inventory
          type: phy
          mtu: 1500
        # Trunk NIC — carries cloud and provider VLANs
        # Do NOT assign an IP; Kube-OVN manages this via OVS bridges
        - id: trunk-nic
          macAddress:
            fromHostInterface: eth1       # Matched against BMH hardware inventory
          type: phy
          mtu: 9000
    networks:
      ipv4:
        - id: mgmt-network
          ipAddressFromIPPool: worker-mgmt-pool
          link: mgmt-nic
          routes:
            - network: 0.0.0.0
              prefix: 0
              gateway:
                fromIPPool: worker-mgmt-pool
    services:
      dns:
        - 8.8.8.8
        - 8.8.4.4

kubectl apply -f metal3datatemplate.yaml

NIC Name Matching

The fromHostInterface values (eth0, eth1) must match the NIC names discovered during BMH inspection. Check kubectl get bmh worker-1 -o jsonpath='{.status.hardware.nics}' to see the actual interface names on your hardware. If NICs have different names across servers, use MAC-based matching or ensure consistent naming via udev rules in the OS image.

4.3 Create Metal3MachineTemplate

# metal3machinetemplate.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3MachineTemplate
metadata:
  name: worker-machine-template
  namespace: baremetal-operator-system
spec:
  template:
    spec:
      image:
        url: http://192.168.0.1:8080/images/noble-server-cloudimg-amd64.img
        checksum: http://192.168.0.1:8080/images/noble-server-cloudimg-amd64.img.sha256sum
        checksumType: sha256
        format: qcow2
      dataTemplate:
        name: worker-data-template
      hostSelector: {}                    # Selects any available BMH (or use matchLabels)

kubectl apply -f metal3machinetemplate.yaml

4.4 Create Cloud-Init UserData

The cloud-init user data configures RKE2 agent to join the management cluster and applies required system settings:

# worker-userdata-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: worker-userdata
  namespace: baremetal-operator-system
type: Opaque
stringData:
  userData: |
    #cloud-config
    hostname: '{{ ds.meta_data.local-hostname }}'
    manage_etc_hosts: true

    write_files:
      - path: /etc/rancher/rke2/config.yaml
        owner: root:root
        permissions: '0644'
        content: |
          token: <RKE2_JOIN_TOKEN>
          server: https://192.168.0.1:9345
          cni: none
          node-ip: '{{ ds.meta_data.local-hostname }}'

      - path: /etc/sysctl.d/99-kube-dc.conf
        owner: root:root
        permissions: '0644'
        content: |
          fs.inotify.max_user_watches=1524288
          fs.inotify.max_user_instances=4024
          net.ipv4.ip_forward=1

    runcmd:
      - sysctl --system
      - modprobe nf_conntrack
      - echo "nf_conntrack" >> /etc/modules
      - systemctl stop systemd-resolved || true
      - systemctl disable systemd-resolved || true
      - rm -f /etc/resolv.conf
      - echo -e "nameserver 8.8.8.8\nnameserver 8.8.4.4" > /etc/resolv.conf
      - systemctl enable rke2-agent.service
      - systemctl start rke2-agent.service

Security

Replace <RKE2_JOIN_TOKEN> with the actual token from master-1:

sudo cat /var/lib/rancher/rke2/server/node-token

For production, consider using a Kubernetes Secret reference or sealed secret instead of embedding the token directly.

kubectl apply -f worker-userdata-secret.yaml

4.5 Create MachineDeployment

The MachineDeployment controls how many worker nodes to provision and links all the templates together:

# machinedeployment.yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: kube-dc-workers
  namespace: baremetal-operator-system
  labels:
    cluster.x-k8s.io/cluster-name: kube-dc-mgmt
    nodepool: kube-dc-worker-pool
spec:
  clusterName: kube-dc-mgmt
  replicas: 2                             # Number of worker nodes to provision
  selector:
    matchLabels:
      cluster.x-k8s.io/cluster-name: kube-dc-mgmt
      nodepool: kube-dc-worker-pool
  template:
    metadata:
      labels:
        cluster.x-k8s.io/cluster-name: kube-dc-mgmt
        nodepool: kube-dc-worker-pool
    spec:
      clusterName: kube-dc-mgmt
      bootstrap:
        dataSecretName: worker-userdata
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: Metal3MachineTemplate
        name: worker-machine-template
      version: v1.35.0

kubectl apply -f machinedeployment.yaml

Watch the provisioning:

# Watch BMH state changes
kubectl get bmh -n baremetal-operator-system -w

# Watch Machine status
kubectl get machines -n baremetal-operator-system

# Watch nodes joining the cluster
kubectl get nodes -w

Provisioning typically takes 10–20 minutes per server (depending on hardware, network speed, and image size).

Phase 5 — Post-Provisioning Network Configuration

After worker nodes join the cluster, configure Kube-OVN networking to include them.

5.1 Update ProviderNetwork

The Kube-OVN ProviderNetwork must include the worker nodes so they can participate in cloud and provider VLAN traffic. If workers have the same trunk NIC name as the masters, they are automatically included via defaultInterface. If NICs differ, add customInterfaces:

# Check what NIC names the workers have
kubectl get bmh worker-1 -n baremetal-operator-system \
  -o jsonpath='{.status.hardware.nics[*].name}'

Patch the ProviderNetwork:

apiVersion: kubeovn.io/v1
kind: ProviderNetwork
metadata:
  name: ext-cloud
spec:
  defaultInterface: eth1                  # Default trunk NIC (most nodes)
  customInterfaces:
    - interface: eno2                     # Override for workers with different NIC names
      nodes:
        - worker-1
        - worker-2
  autoCreateVlanSubinterfaces: true
  preserveVlanInterfaces: true

kubectl apply -f provider-network-patch.yaml

Verify all nodes (masters + workers) are ready in the ProviderNetwork:

kubectl get provider-networks ext-cloud -o jsonpath='{.status.readyNodes}' | jq .
# Expected: ["master-1", "master-2", "master-3", "worker-1", "worker-2"]

5.2 Node Labels

Worker nodes provisioned by Metal3 do not need the kube-ovn/role=master label — that is only for control-plane nodes running OVN Northbound/Southbound databases. However, verify these labels are absent on workers:

# Workers should NOT have these labels:
kubectl get node worker-1 --show-labels | grep -E 'kube-ovn/role|kube-dc-manager'
# Expected: no output

Label	Masters	Workers	Purpose
`kube-ovn/role=master`	✅ Yes	❌ No	Runs OVN central databases
`kube-dc-manager=true`	✅ Yes	❌ No	Schedules Kube-DC control-plane pods
`node-role.kubernetes.io/worker`	❌ No	✅ Yes (auto)	Standard Kubernetes worker role

5.3 Verify Worker Networking

After the ProviderNetwork is updated, Kube-OVN creates OVS bridges on the worker nodes:

# Check OVS bridges on a worker
kubectl exec -n kube-system -it $(kubectl get pod -n kube-system -l app=ovs-ovn \
  --field-selector spec.nodeName=worker-1 -o name) -- ovs-vsctl show

# Check that VLAN subinterfaces were created
kubectl get provider-networks ext-cloud -o jsonpath='{.status.vlans}'
# Expected: ["vlan200","vlan300"] (your cloud and provider VLANs)

Phase 6 — Health Checks and Auto-Remediation

Metal3 supports automated health checking and remediation of worker nodes through CAPI MachineHealthCheck and Metal3RemediationTemplate resources.

6.1 Create Metal3RemediationTemplate

The remediation template defines the strategy for handling unhealthy nodes. The reboot strategy is recommended for bare metal — it power-cycles the server via BMC rather than reprovisioning from scratch:

# metal3remediationtemplate.yaml
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: Metal3RemediationTemplate
metadata:
  name: worker-remediation
  namespace: baremetal-operator-system
spec:
  template:
    spec:
      strategy:
        type: Reboot
        retryLimit: 2                     # Retry power-cycle up to 2 times
        timeout: 600s                     # Wait 10 minutes for node to recover

6.2 Create MachineHealthCheck

# machinehealthcheck.yaml
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: worker-healthcheck
  namespace: baremetal-operator-system
spec:
  clusterName: kube-dc-mgmt
  # Match all machines in the worker pool
  selector:
    matchLabels:
      nodepool: kube-dc-worker-pool
  # Safety valve: don't remediate if >40% of nodes are unhealthy
  maxUnhealthy: 40%
  # Time to wait for a new node to join before considering it unhealthy
  nodeStartupTimeout: 30m                 # Bare metal is slow — allow 30 minutes
  # Conditions that trigger remediation
  unhealthyConditions:
    - type: Ready
      status: Unknown
      timeout: 300s                       # Node is unreachable for 5 minutes
    - type: Ready
      status: "False"
      timeout: 300s                       # Node reports NotReady for 5 minutes
  # Use Metal3 remediation (power-cycle via BMC)
  remediationTemplate:
    kind: Metal3RemediationTemplate
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    name: worker-remediation

kubectl apply -f metal3remediationtemplate.yaml
kubectl apply -f machinehealthcheck.yaml

6.3 How Remediation Works

When a worker node becomes unhealthy:

1. MachineHealthCheck detects unhealthy condition (Ready=Unknown for 5 min)
         │
         ▼
2. Creates Metal3Remediation request
         │
         ▼
3. CAPM3 Remediation Controller:
   a. Powers OFF the server via BMC
   b. Applies Out-of-Service taint on the Node
      (triggers StatefulSet/PV rescheduling to healthy nodes)
   c. Powers ON the server via BMC
         │
         ▼
4. Server reboots → RKE2 agent reconnects → Node becomes Ready
         │
         ▼
5. If still unhealthy after retryLimit, Machine is deleted and
   a new one is provisioned from the available BMH pool

6.4 Monitor Health Checks

# Check MachineHealthCheck status
kubectl get machinehealthcheck -n baremetal-operator-system

# Check for active remediations
kubectl get metal3remediation -n baremetal-operator-system

# Check Machine health conditions
kubectl get machines -n baremetal-operator-system -o wide

Scaling Workers

Scale Up

Increase replicas in the MachineDeployment:

kubectl scale machinedeployment kube-dc-workers \
  -n baremetal-operator-system --replicas=4

CAPM3 will select available BareMetalHosts and provision them. Remember to:

Update the ProviderNetwork if new workers have different NIC names
Ensure enough BareMetalHosts are enrolled and in available state

Scale Down

kubectl scale machinedeployment kube-dc-workers \
  -n baremetal-operator-system --replicas=1

CAPM3 will:

Cordon and drain the selected worker
Power off the server via BMC
Clean the disk (per automatedCleaningMode)
Return the BMH to available state for future use

Best Practices

Disk Management

Set rootDeviceHints on BareMetalHosts to ensure the OS is installed on the correct disk, especially on servers with multiple drives
Use automatedCleaningMode: metadata to wipe partition tables between provisions without full disk erase (saves time)
For sensitive environments, use automatedCleaningMode: disk for full disk wipe

BMC Security

Use Redfish Virtual Media over IPMI when possible — it's more secure and reliable
Enable TLS on BMC interfaces and avoid disableCertificateVerification: true in production
Rotate BMC credentials regularly and store them as Kubernetes Secrets

Image Management

Pre-bake RKE2 agent, kernel modules, and system packages into the OS image to reduce first-boot time
Maintain versioned images (e.g., ubuntu-24.04-rke2-v1.35.0.qcow2) for reproducible deployments
Host images on a local HTTP server within the management network — downloading from the internet during provisioning is slow and unreliable

Node Reuse

Metal3 supports node reuse during rolling upgrades — instead of provisioning a fresh BMH, it reprovisions the same server with a new image
Enable this by using the scale-in upgrade strategy on MachineDeployments
This significantly reduces upgrade time for bare-metal clusters

Monitoring and Alerting

Set up Prometheus alerts for BMH state changes (e.g., error, provisioning failed)
Monitor the MachineHealthCheck targets count and current healthy/unhealthy ratios
Alert on Metal3Remediation objects being created — they indicate node failures

ProviderNetwork Consistency

If all worker servers are the same hardware model, they will have consistent NIC names — use defaultInterface in the ProviderNetwork
For mixed hardware, use customInterfaces to map each node to its correct trunk NIC
Always verify workers appear in provider-networks ext-cloud status after joining

Troubleshooting

BMH Stuck in "registering"

kubectl get bmh worker-1 -n baremetal-operator-system -o yaml | grep -A5 errorMessage

Common causes:

BMC IP unreachable from management cluster — check network/firewall
Wrong BMC credentials — verify the Secret
Unsupported BMC protocol — check Metal3 supported hardware

BMH Stuck in "inspecting"

Server failed to PXE boot — check BIOS boot order, PXE NIC settings
Ironic ramdisk didn't start — check Ironic logs: kubectl logs -n baremetal-operator-system -l app=ironic
Virtual Media mount failed — ensure BMC firmware supports the protocol

Worker Node Not Joining Cluster

# Check RKE2 agent logs on the worker (via BMC console or SSH)
sudo journalctl -u rke2-agent -f

# Common issues:
# - Wrong join token in cloud-init userData
# - master-1 unreachable on management network (check routing)
# - DNS resolution failing (check /etc/resolv.conf)

Kube-OVN Not Working on Worker

kubectl get pods -n kube-system -l app=kube-ovn-cni --field-selector spec.nodeName=worker-1
kubectl logs -n kube-system -l app=kube-ovn-cni --field-selector spec.nodeName=worker-1

Common cause: Worker's trunk NIC not matched in ProviderNetwork. Fix by adding a customInterfaces entry.

Installation Overview — Reference architecture and network prerequisites
Installation Guide — Management cluster deployment
Networking Architecture — Kube-OVN, VLANs, VPCs, service exposure
Deploy MetalLB HA — Floating IP for Envoy Gateway
Metal3 User Guide — Upstream Metal3 documentation
CAPM3 Remediation — Health check and remediation details

Architecture​

How It Works​

Prerequisites​

Hardware Requirements​

Network Requirements​

Software Requirements​

Information to Collect​

Phase 1 — Install Metal3 Components​

1.1 Initialize CAPM3​

1.2 Deploy Ironic​

1.3 Verify Metal3 Stack​

Phase 2 — Prepare the Worker OS Image​

2.1 Build the Image​

2.2 Host the Image​

Phase 3 — Enroll Bare-Metal Hosts​

3.1 Create BMC Credentials​

3.2 Create BareMetalHost Resources​

3.3 Wait for Inspection​

Phase 4 — Configure CAPI Resources for Worker Provisioning​

4.1 Create Metal3 IPAM Pool​

4.2 Create Metal3DataTemplate​

4.3 Create Metal3MachineTemplate​

4.4 Create Cloud-Init UserData​

4.5 Create MachineDeployment​

Phase 5 — Post-Provisioning Network Configuration​

5.1 Update ProviderNetwork​

5.2 Node Labels​

5.3 Verify Worker Networking​

Phase 6 — Health Checks and Auto-Remediation​

6.1 Create Metal3RemediationTemplate​

6.2 Create MachineHealthCheck​

6.3 How Remediation Works​

6.4 Monitor Health Checks​

Scaling Workers​

Scale Up​

Scale Down​

Best Practices​

Disk Management​

BMC Security​

Image Management​

Node Reuse​

Monitoring and Alerting​

ProviderNetwork Consistency​

Troubleshooting​

BMH Stuck in "registering"​

BMH Stuck in "inspecting"​

Worker Node Not Joining Cluster​

Kube-OVN Not Working on Worker​

Related Documentation​