Skip to main content

Common Checks & Troubleshooting

Daily-driver health checks first, then a catalogue of errors with fixes.

Common checks

"Is my fleet healthy?"

kube-dc bootstrap
# expect: every row Ready (or Reconciling during a deploy)

"Can I reach this cluster?"

kube-dc bootstrap kubeconfig <cluster>     # writes the context
kube-dc login --domain <domain> --admin # mints OIDC tokens
kubectl --context kube-dc/<domain>/admin get nodes

"Who am I on this cluster?"

kube-dc bootstrap context
# select the row, read the right pane: email + groups + expiry

Or one-shot (Kubernetes ≥ 1.28):

kubectl auth whoami

"Is my admin login wired up correctly?"

# 1. JWT issuer present in auth-conf?
kubectl get configmap kube-dc-auth-config -n kube-dc -o yaml | grep -A1 "realms/master"

# 2. ClusterRoleBinding present?
kubectl get clusterrolebinding platform-admin -o yaml | grep -A3 subjects

# 3. The kube-dc-admin OIDC client exists?
KEYCLOAK_URL=https://login.<domain>
# Manual: open ${KEYCLOAK_URL}/admin → master → Clients → kube-dc-admin

"What contexts has the CLI written?"

kubectl config get-contexts | grep kube-dc/
# Or visually:
kube-dc bootstrap context

"How do I drop everything kube-dc and start clean?"

kube-dc logout --all --remove-contexts

This deletes every cached token in ~/.kube-dc/credentials/ and every kube-dc/* context in ~/.kube/config. Non-kube-dc contexts (kubectx-managed, vendor exec plugins) are untouched.


Troubleshooting

"not logged in" / "session expired"

The exec plugin can't find a valid cached token. Run the matching login again:

# tenant
kube-dc login --domain <domain> --org <org>

# admin
kube-dc login --domain <domain> --admin

The error message always contains the right command — copy-paste it.

kube-dc bootstrap says all clusters are Unreachable

You haven't logged in to any of them yet. The probe needs an OIDC bearer token to query the apiserver. Run kube-dc login --admin for one cluster, hit r in the fleet view, and that row should turn Ready.

kube-dc login --admin fails with "user is authenticated but NOT in the 'admin' group"

The OAuth flow worked but Keycloak says you're not a platform admin. Ask someone with Keycloak access to add you to the master realm's admin group (see Adding a new admin).

Browser shows "We are sorry... Client not found" (and the CLI hangs)

The cluster's master realm doesn't have the kube-dc-admin PKCE OIDC client yet. Run the setup script — it's idempotent and won't disturb the existing flux-web client:

cd <fleet-repo-path>
git pull
export KUBECONFIG=~/.kube/<cluster>_config
bash bootstrap/setup-keycloak-oidc.sh <cluster>

Then retry kube-dc login --domain <domain> --admin. The script auto-fixes a known stale-config case where early versions registered the client with port-less localhost redirects (Keycloak silently accepts these but rejects them at auth time). It's safe to re-run anytime.

Verify the client now exists by probing the auth endpoint:

curl -s -o /dev/null -w "%{http_code}\n" \
"https://login.<domain>/realms/master/protocol/openid-connect/auth?response_type=code&client_id=kube-dc-admin&redirect_uri=http%3A%2F%2Flocalhost%3A55432%2Fcallback&state=t&scope=openid&code_challenge=abc&code_challenge_method=S256"
# 302 → client exists, redirect accepted
# 400 → client missing OR redirect rejected

Browser shows "Invalid parameter: redirect_uri"

The kube-dc-admin client exists but its redirectUris registration is too narrow. Re-run setup-keycloak-oidc.sh <cluster> — recent versions PUT the canonical config (redirectUris: ["*"]) over any stale entry.

Why universal *? Keycloak's wildcard matching is path-only, not port — http://localhost/* doesn't match http://localhost:55432/callback. PKCE's code-verifier (which never leaves the CLI process) is the real security boundary, so * is acceptable for native CLIs. Same pattern the tenant kube-dc client has shipped since v0.1.

kubectl get nodes says forbidden under --admin

The OIDC chain is fine but the cluster-side RBAC isn't wired. Check "Is my admin login wired up correctly?" above — usually the platform-admin ClusterRoleBinding hasn't reconciled yet.

A cluster row shows Drifted

The image tag pinned in cluster-config.env differs from what's actually running. The right pane shows which Deployment is drifted and what tag is expected. Either:

  • The cluster-config.env is stale (an operator forgot to bump it after a kubectl set image) — bump and commit.
  • Flux hasn't reconciled yet — flux reconcile kustomization platform --with-source.

My ~/.kube/config got broken

The CLI never overwrites or removes contexts it didn't create. If you see a kube-dc bug here, restore from your most recent kubeconfig backup and file an issue with the diff.

That said, your kubectx-managed contexts and any vendor exec plugins are safe by design — only kube-dc/*, kube-dc-*, and kube-dc@* entries are touched.

"I logged in but kubectx doesn't show the new context"

The likely cause is $KUBECONFIG leaking from a previous step. If you ran something like:

export KUBECONFIG=~/.kube/<cluster>_kubeconfig_tunnel    # for a fleet-bootstrap step
bash bootstrap/setup-keycloak-oidc.sh <cluster>
kube-dc login --domain <domain> --admin # ← context lands in <cluster>_kubeconfig_tunnel, NOT ~/.kube/config

…then the new context is in whatever file $KUBECONFIG pointed at, not in ~/.kube/config. kubectx reads ~/.kube/config by default.

Recent versions of kube-dc login print a banner + confirmation prompt when $KUBECONFIG points at anything other than ~/.kube/config:

  ┌─ kubeconfig destination ─
│ $KUBECONFIG = /home/<you>/.kube/<cluster>_kubeconfig_tunnel
│ → writing to: /home/<you>/.kube/<cluster>_kubeconfig_tunnel
│ (default would be /home/<you>/.kube/config — kubectx reads from there)
└──
Continue writing to this file? [y/N]

In a non-interactive shell (CI, IDE-launched processes) the banner still prints but the command proceeds without prompting.

Recovery path if you ended up here without seeing the prompt:

# 1. Back up
cp ~/.kube/config ~/.kube/config.before-recovery.$(date +%Y%m%dT%H%M%S)

# 2. Merge the stray contexts back in
KUBECONFIG=~/.kube/config:~/.kube/<the-stray-file> \
kubectl config view --raw --flatten > /tmp/merged.config
mv /tmp/merged.config ~/.kube/config
chmod 0600 ~/.kube/config

# 3. Unset KUBECONFIG so future logins go to the default
unset KUBECONFIG

"I want to debug what kube-dc is doing"

# Show the cached creds + expiry for every server
kube-dc config show

# Print the ExecCredential the plugin emits (without going through kubectl)
kube-dc credential --server https://kube-api.<domain>:6443 --realm master

Decode a cached JWT to see what the apiserver actually receives

When admin login succeeds but kubectl get nodes returns 401, decode the token and look at the actual claims:

TOKEN_FILE=$(ls -t ~/.kube-dc/credentials/*-master.json 2>/dev/null | head -1)
python3 -c "
import json, base64
t = json.load(open('$TOKEN_FILE'))['access_token']
p = t.split('.')[1] + '=' * (-len(t.split('.')[1]) % 4)
c = json.loads(base64.urlsafe_b64decode(p))
print('iss: ', c['iss'])
print('aud: ', c.get('aud'))
print('azp: ', c.get('azp'))
print('groups: ', c.get('groups'))
print('email: ', c.get('email'))
"

The most common 401 cause: aud doesn't include kube-dc-admin. That means the audience mapper wasn't attached to the client in Keycloak — re-run setup-keycloak-oidc.sh <cluster> to add it, then kube-dc login --domain <domain> --admin again to mint a token with the new audience.

kubelet image cache trap (when kubectl set image doesn't actually update the pod)

Symptom: you push a new image, run kubectl set image, the pod rolls — but the new pod is still running the OLD binary. Verified by comparing image digests:

# What the pod actually pulled:
kubectl get pod -n kube-dc -l app.kubernetes.io/name=kube-dc-manager \
-o jsonpath='{.items[0].status.containerStatuses[0].imageID}'

# What's in the registry NOW:
docker manifest inspect --verbose <registry>/<image>:<tag> \
| python3 -c "import json,sys; print(json.load(sys.stdin)['Descriptor']['digest'])"

If they differ, the deployment has imagePullPolicy: IfNotPresent and a node had a stale image cached against that tag. The fix is to pin the deployment to the digest, which always forces a fresh pull:

kubectl set image -n kube-dc deployment/kube-dc-manager \
manager=<registry>/<image>@sha256:<digest-from-registry>

Bumping the tag (e.g. vX.Y.Z-devN+1) and pushing again works too — but digest-pinning is cheaper and more reliable when the tag was reused.