Bug 1903733
Summary: | Scale up followed by scale down can delete all running workers | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Matthew Booth <mbooth> |
Component: | Cloud Compute | Assignee: | Matthew Booth <mbooth> |
Cloud Compute sub component: | OpenStack Provider | QA Contact: | weiwei jiang <wjiang> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | adduarte, egarcia, m.andre, mfedosin, pprinett |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause: The default machineset delete priority, random, did not prioritise Ready nodes over nodes which were still building.
Consequence: Attempting to scale up a machineset followed immediately by scaling down again, especially if the scale up was to a very large number, could result in the scale down deleting all Ready nodes, leaving the cluster unavailable.
Fix: The random delete priority now assigns a lower priority to machines which have not yet become Ready.
Result: A large scale up followed immediately by scale down will delete machines which are not yet Ready before deleting machines running workloads.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2021-02-24 15:37:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Matthew Booth
2020-12-02 17:32:35 UTC
Checked with 4.7.0-0.nightly-2020-12-21-131655, and can not reproduce the original issue, so moved to verified. # Before scaleup: $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME wj47ios0104az-wvddr-master-0 Ready master 165m v1.20.0+87544c5 192.168.0.123 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-1 Ready master 165m v1.20.0+87544c5 192.168.2.87 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-2 Ready master 165m v1.20.0+87544c5 192.168.1.13 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-p5bzs Ready worker 18m v1.20.0+87544c5 192.168.0.93 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-v7crt Ready worker 6m18s v1.20.0+87544c5 192.168.3.83 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-wrbxd Ready worker 17m v1.20.0+87544c5 192.168.0.21 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 $ oc get machine -A -o wide NAMESPACE NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE openshift-machine-api wj47ios0104az-wvddr-master-0 Running m1.xlarge regionOne nova 166m wj47ios0104az-wvddr-master-0 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-1 Running m1.xlarge regionOne nova 166m wj47ios0104az-wvddr-master-1 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-2 Running m1.xlarge regionOne nova 166m wj47ios0104az-wvddr-master-2 ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-p5bzs Running m1.large regionOne nova 22m wj47ios0104az-wvddr-worker-0-p5bzs ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-v7crt Running m1.large regionOne nova 11m wj47ios0104az-wvddr-worker-0-v7crt ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-wrbxd Running m1.large regionOne nova 22m wj47ios0104az-wvddr-worker-0-wrbxd ACTIVE # After scaleup: $ oc get machine -A -o wide NAMESPACE NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE openshift-machine-api wj47ios0104az-wvddr-master-0 Running m1.xlarge regionOne nova 173m wj47ios0104az-wvddr-master-0 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-1 Running m1.xlarge regionOne nova 173m wj47ios0104az-wvddr-master-1 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-2 Running m1.xlarge regionOne nova 173m wj47ios0104az-wvddr-master-2 ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-6xxcd Running m1.large regionOne nova 6m42s wj47ios0104az-wvddr-worker-0-6xxcd ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-bb8pl Running m1.large regionOne nova 6m42s wj47ios0104az-wvddr-worker-0-bb8pl ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-clfp6 Running m1.large regionOne nova 6m42s wj47ios0104az-wvddr-worker-0-clfp6 ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-p5bzs Running m1.large regionOne nova 29m wj47ios0104az-wvddr-worker-0-p5bzs ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-v7crt Running m1.large regionOne nova 18m wj47ios0104az-wvddr-worker-0-v7crt ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-wrbxd Running m1.large regionOne nova 29m wj47ios0104az-wvddr-worker-0-wrbxd ACTIVE $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME wj47ios0104az-wvddr-master-0 Ready master 173m v1.20.0+87544c5 192.168.0.123 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-1 Ready master 173m v1.20.0+87544c5 192.168.2.87 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-2 Ready master 173m v1.20.0+87544c5 192.168.1.13 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-6xxcd Ready worker 4m25s v1.20.0+87544c5 192.168.2.22 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-bb8pl Ready worker 77s v1.20.0+87544c5 192.168.3.197 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-clfp6 Ready worker 3m26s v1.20.0+87544c5 192.168.2.90 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-p5bzs Ready worker 26m v1.20.0+87544c5 192.168.0.93 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-v7crt Ready worker 14m v1.20.0+87544c5 192.168.3.83 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-wrbxd Ready worker 25m v1.20.0+87544c5 192.168.0.21 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 # After scaledown: $ oc get machine -A -o wide NAMESPACE NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE openshift-machine-api wj47ios0104az-wvddr-master-0 Running m1.xlarge regionOne nova 177m wj47ios0104az-wvddr-master-0 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-1 Running m1.xlarge regionOne nova 177m wj47ios0104az-wvddr-master-1 ACTIVE openshift-machine-api wj47ios0104az-wvddr-master-2 Running m1.xlarge regionOne nova 177m wj47ios0104az-wvddr-master-2 ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-p5bzs Running m1.large regionOne nova 33m wj47ios0104az-wvddr-worker-0-p5bzs ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-v7crt Running m1.large regionOne nova 22m wj47ios0104az-wvddr-worker-0-v7crt ACTIVE openshift-machine-api wj47ios0104az-wvddr-worker-0-wrbxd Running m1.large regionOne nova 33m wj47ios0104az-wvddr-worker-0-wrbxd ACTIVE $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME wj47ios0104az-wvddr-master-0 Ready master 176m v1.20.0+87544c5 192.168.0.123 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-1 Ready master 176m v1.20.0+87544c5 192.168.2.87 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-master-2 Ready master 176m v1.20.0+87544c5 192.168.1.13 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-p5bzs Ready worker 29m v1.20.0+87544c5 192.168.0.93 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-v7crt Ready worker 17m v1.20.0+87544c5 192.168.3.83 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 wj47ios0104az-wvddr-worker-0-wrbxd Ready worker 28m v1.20.0+87544c5 192.168.0.21 <none> Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa) 4.18.0-240.8.1.el8_3.x86_64 cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 |