Description of problem: ----------------------- During scale down procedure MCP got in degraded state. Logs from machine-config-daemon-* pod showed: ``` I1007 08:38:11.762043 16232 rpm-ostree.go:261] Running captured: journalctl --list-boots I1007 08:38:11.960350 16232 daemon.go:869] journalctl --list-boots: -4 5d7e9d3de6f44b8180ceb54dcf27f676 Mon 2020-10-05 20:37:41 UTC—Wed 2020-10-07 07:30:19 UTC -3 56b4b00596844bb8945bc232f316ba29 Wed 2020-10-07 07:34:29 UTC—Wed 2020-10-07 07:51:18 UTC -2 ec5b27f98d8b4b6baace0f92fd342cb9 Wed 2020-10-07 07:55:00 UTC—Wed 2020-10-07 08:12:19 UTC -1 be9c17e727f34d4583e0ac9e83f1363c Wed 2020-10-07 08:16:01 UTC—Wed 2020-10-07 08:33:18 UTC 0 ec017cfddb434df1ac20158c21a96c92 Wed 2020-10-07 08:36:59 UTC—Wed 2020-10-07 08:38:11 UTC I1007 08:38:11.960388 16232 rpm-ostree.go:261] Running captured: systemctl list-units --state=failed --no-legend I1007 08:38:11.973053 16232 daemon.go:882] systemctl --failed: NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online I1007 08:38:11.973072 16232 daemon.go:616] Starting MachineConfigDaemon I1007 08:38:11.973210 16232 daemon.go:623] Enabling Kubelet Healthz Monitor I1007 08:38:41.194987 16232 trace.go:205] Trace[1106410694]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (07-Oct- 2020 08:38:11.194) (total time: 30000ms): Trace[1106410694]: [30.000777255s] [30.000777255s] END E1007 08:38:41.195411 16232 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v 1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout I1007 08:38:41.194998 16232 trace.go:205] Trace[1747278511]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (07-Oct-2020 08:38:11.194) (total time: 30000ms): Trace[1747278511]: [30.000801842s] [30.000801842s] END E1007 08:38:41.195448 16232 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceV ersion=0": dial tcp 172.30.0.1:443: i/o timeout I1007 08:38:43.322848 16232 daemon.go:403] Node openshift-worker-5 is not labeled node-role.kubernetes.io/master E1007 08:38:43.358219 16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5' E1007 08:38:45.387975 16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5' E1007 08:38:49.417490 16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5' E1007 08:38:57.446143 16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5' E1007 08:39:13.475158 16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5' E1007 08:39:45.502864 16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5' E1007 08:40:45.531140 16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5' ``` Adding annotation manually resulted in : ``` E1007 09:07:38.313438 13937 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5' I1007 09:08:38.340832 13937 daemon.go:808] Current+desired config: rendered-worker-61fbb0947e09c761cb036c96ab321807 I1007 09:08:38.353132 13937 daemon.go:1085] Validating against current config rendered-worker-61fbb0947e09c761cb036c96ab321807 I1007 09:08:38.369501 13937 daemon.go:1093] Validated on-disk state I1007 09:08:38.386479 13937 daemon.go:1150] In desired config rendered-worker-61fbb0947e09c761cb036c96ab321807 I1007 09:12:14.860336 13937 daemon.go:641] Shutting down MachineConfigDaemon ``` Version-Release number of selected component (if applicable): ------------------------------------------------------------- 4.6.0-0.nightly-2020-10-02-065738 How reproducible: Steps to Reproduce: ------------------- 1. Annotate machine: oc annotate machine worker-0-hb974 machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api 2. Delete BMH(BareMetalHost) linked to the aforementioned machine oc delete kmh worker-X -n openshift-machine-api 3. Scale down machinset by one oc scale machineset worker-0 --replicas=Y -n openshift-machine-api 4. Wait for mcp to get in degraded state Additional info: ---------------- BM IPI deployment with 3 masters and 8 workers Not clear why worker-5 got redeployed(it was a scale down) but for this separate bz would be filed
So there's a couple of weird things going on in your system. From your node objects: worker 3 (normal) - creationTimestamp: "2020-10-02T13:17:48Z" worker 4 (weird) - creationTimestamp: "2020-10-02T13:17:40Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2020-10-07T07:36:49Z" worker 5 (weird) - creationTimestamp: "2020-10-07T07:35:27Z" So it seems to me that worker 5 was recreated at some point, after which worker 4 attempted to be removed? reading https://docs.openshift.com/container-platform/4.5/machine_management/manually-scaling-machineset.html and https://docs.openshift.com/container-platform/4.5/machine_management/deleting-machine.html it seems that you double-deleted machines. As in you should only be deleting the machine object or scaling the machineset, not both. I also tried scaling down on an AWS cluster (oc scale machineset xxx --replicas=Y -n openshift-machine-api) and the scaledown operation completed fine without the MCP degrading. So in this particular case I'd suspect that the scaledown operation itself had issues. Could you check if the scaledown operation you attempted is supported? If you think my assessment makes sense I'd suggest perhaps reassigning to machine-api since they would have more insight on what exactly went wrong in terms of the scaling, and how the node object was recreated when you scale up/down. As for the MCP degrade, I do think perhaps we should be more bulletproof there. The MCC should be writing the desiredConfig annotation upon syncing the MCP. I can check that but even if that's fixed I don't think your scaleupdown would succeed.
Also the MCD log you snapshotted shows not the initial config: 2020-10-07T07:36:10.374082881Z W1007 07:36:10.373946 15529 daemon.go:644] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "openshift-worker-5" not found 2020-10-07T07:36:11.373602989Z I1007 07:36:11.373470 15529 daemon.go:403] Node openshift-worker-5 is not labeled node-role.kubernetes.io/master 2020-10-07T07:36:11.373936249Z I1007 07:36:11.373605 15529 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node openshift-worker-5 ... 2020-10-07T07:36:11.376719396Z I1007 07:36:11.376626 15529 node.go:34] Setting initial node config based on current configuration on disk: rendered-worker-61fbb0947e09c761cb036c96ab321807 which looks to me that the node object itself had problems upon recreation (again, not sure what recreated it)
Yurii can you reproduce this consistently? I'm leaning towards pushing this out of 4.6 if a) this isn't consistent b) there's a workaround - we'll keep debugging meanwhile.
I'm going to reassign this current BZ to machine-api to see what happened during the scaledown. This root cause is not the MCO, the scale operation appears to have gone wrong somewhere. Machine-api can tell you if your operations were correct, and why the nodes are flipping like that hopefully.
I believe this has the same cause as bug 1886028 (the Node not being deleted due to a finalizer left behind), so closing as a duplicate. *** This bug has been marked as a duplicate of bug 1886028 ***