Description of problem: I hit this issue testing what happens when attempting to scale beyond the capacity of my cluster, but there are other realistic scenarios where it could happen. This is most likely non-deterministic, but my specific reproducer was: * Start with 3 running workers * Request scale up to 30 workers ... after some time I had 8 running workers, 1 machine in an error state, and the rest were in a non-error state, but not yet running * Request scale down to 3 workers All 8 running workers were deleted. I was left with 3 machines in a 'Provisioned' state which did not come up (looks like their CSRs timed out? That's a separate issue), and my last running worker in a Deleting state, which couldn't drain because it had the last remaining route-default pod. The behaviour I expected was that it would have preferred to delete machines which were not running any workload over machines which were. Version-Release number of selected component (if applicable): machine-api 4.6.4
PR here: https://github.com/openshift/machine-api-operator/pull/772
Checked with 4.7.0-0.nightly-2020-12-21-131655, and can not reproduce the original issue, so moved to verified. # Before scaleup: $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME wj47ios0104az-wvddr-master-0 Ready master 165m v1.20.0+87544c5 192.168.0.123 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-1 Ready master 165m v1.20.0+87544c5 192.168.2.87 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-2 Ready master 165m v1.20.0+87544c5 192.168.1.13 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-p5bzs Ready worker 18m v1.20.0+87544c5 192.168.0.93 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-v7crt Ready worker 6m18s v1.20.0+87544c5 192.168.3.83 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-wrbxd Ready worker 17m v1.20.0+87544c5 192.168.0.21 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 $ oc get machine -A -o wide NAMESPACE NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE openshift-machine-api wj47ios0104az-wvddr-master-0 Running m1.xlarge regionOne nova 166m wj47ios0104az-wvddr-master-0 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-1 Running m1.xlarge regionOne nova 166m wj47ios0104az-wvddr-master-1 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-2 Running m1.xlarge regionOne nova 166m wj47ios0104az-wvddr-master-2 ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-p5bzs Running m1.large regionOne nova 22m wj47ios0104az-wvddr-worker-0-p5bzs ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-v7crt Running m1.large regionOne nova 11m wj47ios0104az-wvddr-worker-0-v7crt ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-wrbxd Running m1.large regionOne nova 22m wj47ios0104az-wvddr-worker-0-wrbxd ACTIVE # After scaleup: $ oc get machine -A -o wide NAMESPACE NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE openshift-machine-api wj47ios0104az-wvddr-master-0 Running m1.xlarge regionOne nova 173m wj47ios0104az-wvddr-master-0 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-1 Running m1.xlarge regionOne nova 173m wj47ios0104az-wvddr-master-1 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-2 Running m1.xlarge regionOne nova 173m wj47ios0104az-wvddr-master-2 ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-6xxcd Running m1.large regionOne nova 6m42s wj47ios0104az-wvddr-worker-0-6xxcd ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-bb8pl Running m1.large regionOne nova 6m42s wj47ios0104az-wvddr-worker-0-bb8pl ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-clfp6 Running m1.large regionOne nova 6m42s wj47ios0104az-wvddr-worker-0-clfp6 ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-p5bzs Running m1.large regionOne nova 29m wj47ios0104az-wvddr-worker-0-p5bzs ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-v7crt Running m1.large regionOne nova 18m wj47ios0104az-wvddr-worker-0-v7crt ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-wrbxd Running m1.large regionOne nova 29m wj47ios0104az-wvddr-worker-0-wrbxd ACTIVE $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME wj47ios0104az-wvddr-master-0 Ready master 173m v1.20.0+87544c5 192.168.0.123 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-1 Ready master 173m v1.20.0+87544c5 192.168.2.87 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-2 Ready master 173m v1.20.0+87544c5 192.168.1.13 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-6xxcd Ready worker 4m25s v1.20.0+87544c5 192.168.2.22 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-bb8pl Ready worker 77s v1.20.0+87544c5 192.168.3.197 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-clfp6 Ready worker 3m26s v1.20.0+87544c5 192.168.2.90 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-p5bzs Ready worker 26m v1.20.0+87544c5 192.168.0.93 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-v7crt Ready worker 14m v1.20.0+87544c5 192.168.3.83 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-wrbxd Ready worker 25m v1.20.0+87544c5 192.168.0.21 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 # After scaledown: $ oc get machine -A -o wide NAMESPACE NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE openshift-machine-api wj47ios0104az-wvddr-master-0 Running m1.xlarge regionOne nova 177m wj47ios0104az-wvddr-master-0 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-1 Running m1.xlarge regionOne nova 177m wj47ios0104az-wvddr-master-1 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-2 Running m1.xlarge regionOne nova 177m wj47ios0104az-wvddr-master-2 ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-p5bzs Running m1.large regionOne nova 33m wj47ios0104az-wvddr-worker-0-p5bzs ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-v7crt Running m1.large regionOne nova 22m wj47ios0104az-wvddr-worker-0-v7crt ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-wrbxd Running m1.large regionOne nova 33m wj47ios0104az-wvddr-worker-0-wrbxd ACTIVE $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME wj47ios0104az-wvddr-master-0 Ready master 176m v1.20.0+87544c5 192.168.0.123 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-1 Ready master 176m v1.20.0+87544c5 192.168.2.87 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-2 Ready master 176m v1.20.0+87544c5 192.168.1.13 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-p5bzs Ready worker 29m v1.20.0+87544c5 192.168.0.93 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-v7crt Ready worker 17m v1.20.0+87544c5 192.168.3.83 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-wrbxd Ready worker 28m v1.20.0+87544c5 192.168.0.21 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633