Description of problem (please be detailed as possible and provide log snippests): After 2 OCS node shutdown, both provisioner pods running on a same worker node Version of all relevant components (if applicable): Provider:AWS_IPI OCP version:4.5.0-0.nightly-2020-08-06-102404 OCS version:ocs-operator.v4.5.0-515.ci sh-4.4# ceph version ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable) sh-4.4# rook version rook: 4.5-38.e7a77d32.release_4.5 go: go1.13.4 sh-4.4# ceph versions { "mon": { "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 9 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Get Nodes: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-132-167.us-east-2.compute.internal Ready worker 68m v1.18.3+08c38ef ip-10-0-142-171.us-east-2.compute.internal Ready master 77m v1.18.3+08c38ef ip-10-0-168-36.us-east-2.compute.internal Ready master 77m v1.18.3+08c38ef ip-10-0-177-178.us-east-2.compute.internal Ready worker 68m v1.18.3+08c38ef ip-10-0-216-231.us-east-2.compute.internal Ready worker 67m v1.18.3+08c38ef ip-10-0-223-41.us-east-2.compute.internal Ready master 77m v1.18.3+08c38ef 2.Shutdown 2 worker nodes via Amazon UI: *ip-10-0-132-167.us-east-2.compute.internal *ip-10-0-177-178.us-east-2.compute.internal 3.Get Nodes: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-132-167.us-east-2.compute.internal NotReady worker 73m v1.18.3+08c38ef ip-10-0-142-171.us-east-2.compute.internal Ready master 82m v1.18.3+08c38ef ip-10-0-168-36.us-east-2.compute.internal Ready master 82m v1.18.3+08c38ef ip-10-0-177-178.us-east-2.compute.internal NotReady worker 73m v1.18.3+08c38ef ip-10-0-216-231.us-east-2.compute.internal Ready worker 72m v1.18.3+08c38ef ip-10-0-223-41.us-east-2.compute.internal Ready master 82m v1.18.3+08c38ef 4.Start relevant nodes via Amazon UI: *ip-10-0-132-167.us-east-2.compute.internal *ip-10-0-177-178.us-east-2.compute.internal 5.Check Nodes status: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-132-167.us-east-2.compute.internal Ready worker 85m v1.18.3+08c38ef ip-10-0-142-171.us-east-2.compute.internal Ready master 94m v1.18.3+08c38ef ip-10-0-168-36.us-east-2.compute.internal Ready master 94m v1.18.3+08c38ef ip-10-0-177-178.us-east-2.compute.internal Ready worker 85m v1.18.3+08c38ef ip-10-0-216-231.us-east-2.compute.internal Ready worker 84m v1.18.3+08c38ef ip-10-0-223-41.us-east-2.compute.internal Ready master 94m v1.18.3+08c38ef 6.Get all pods (openshift-storage) $ oc get pods -n openshift-storage -o wide NAME READY STATUS AGE IP NODE csi-cephfsplugin-d249c 3/3 Running 62m 10.0.216.231 ip-10-0-216-231.us-east-2.compute.internal csi-cephfsplugin-kf7rc 3/3 Running 62m 10.0.132.167 ip-10-0-132-167.us-east-2.compute.internal csi-cephfsplugin-p982p 3/3 Running 62m 10.0.177.178 ip-10-0-177-178.us-east-2.compute.internal csi-cephfsplugin-provisioner-745957785f-qz7l7 5/5 Running 62m 10.129.2.16 ip-10-0-216-231.us-east-2.compute.internal csi-cephfsplugin-provisioner-745957785f-z9zdz 5/5 Running 11m 10.129.2.37 ip-10-0-216-231.us-east-2.compute.internal csi-rbdplugin-kqxqm 3/3 Running 62m 10.0.177.178 ip-10-0-177-178.us-east-2.compute.internal csi-rbdplugin-p8rgl 3/3 Running 62m 10.0.132.167 ip-10-0-132-167.us-east-2.compute.internal csi-rbdplugin-provisioner-7d4596b7d6-7ds28 5/5 Running 11m 10.129.2.43 ip-10-0-216-231.us-east-2.compute.internal csi-rbdplugin-provisioner-7d4596b7d6-l5pb9 5/5 Running 62m 10.129.2.15 ip-10-0-216-231.us-east-2.compute.internal csi-rbdplugin-srnsr 3/3 Running 62m 10.0.216.231 ip-10-0-216-231.us-east-2.compute.internal noobaa-core-0 1/1 Running 4m52s 10.131.0.8 ip-10-0-177-178.us-east-2.compute.internal noobaa-db-0 1/1 Running 4m52s 10.131.0.12 ip-10-0-177-178.us-east-2.compute.internal noobaa-endpoint-d4bccf9d5-dhnzx 1/1 Running 11m 10.129.2.34 ip-10-0-216-231.us-east-2.compute.internal noobaa-operator-7df6dc9b74-rgd5h 1/1 Running 11m 10.129.2.42 ip-10-0-216-231.us-east-2.compute.internal ocs-operator-6c4cbb75d8-w5kqz 1/1 Running 11m 10.129.2.40 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-crashcollector-ip-10-0-132-167-5fb86ccc4b-694f8 1/1 Running 11m 10.128.2.5 ip-10-0-132-167.us-east-2.compute.internal rook-ceph-crashcollector-ip-10-0-177-178-554ddc4b69-qtjrj 1/1 Running 4m52s 10.131.0.7 ip-10-0-177-178.us-east-2.compute.internal rook-ceph-crashcollector-ip-10-0-216-231-6469f797c5-qkc7r 1/1 Running 59m 10.129.2.22 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-drain-canary-57e3b57cc42ecf0d1b5ce0470ce1c9a3-58ql45n 1/1 Running 59m 10.129.2.21 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-drain-canary-5c4fe7c2d0fd0ce702064d89daab3bff-78rcbfz 1/1 Running 11m 10.128.2.6 ip-10-0-132-167.us-east-2.compute.internal rook-ceph-drain-canary-ffddf166409fdafc40a4e743896c1a5d-6cknf9w 1/1 Running 11m 10.131.0.6 ip-10-0-177-178.us-east-2.compute.internal rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-866fd967pngwm 1/1 Running 11m 10.129.2.30 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-56b48bb6tfhcg 1/1 Running 11m 10.131.0.5 ip-10-0-177-178.us-east-2.compute.internal rook-ceph-mgr-a-577f465cf8-98zwt 1/1 Running 11m 10.129.2.33 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-mon-a-67948d76bc-twslw 1/1 Running 11m 10.128.2.8 ip-10-0-132-167.us-east-2.compute.internal rook-ceph-mon-b-857fbdd86f-cbrjn 1/1 Running 11m 10.131.0.10 ip-10-0-177-178.us-east-2.compute.internal rook-ceph-mon-c-7975ddfdb9-p86sr 1/1 Running 60m 10.129.2.18 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-operator-6546fc9ccc-twjp4 1/1 Running 11m 10.129.2.32 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-osd-0-54dc779b9f-759pb 1/1 Running 11m 10.128.2.7 ip-10-0-132-167.us-east-2.compute.internal rook-ceph-osd-1-5b846dc49d-v42tz 1/1 Running 59m 10.129.2.23 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-osd-2-7669fbfdd5-sjggb 1/1 Running 11m 10.131.0.9 ip-10-0-177-178.us-east-2.compute.internal rook-ceph-osd-prepare-ocs-deviceset-1-data-0-5sf9x-qtnnn 0/1 Completed 60m 10.129.2.20 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-tools-cb97b47d6-cmgwd 1/1 Running 11m 10.0.216.231 ip-10-0-216-231.us-east-2.compute.internal *'csi-cephfsplugin-provisioner-745957785f-qz7l7' and 'csi-cephfsplugin-provisioner-745957785f-z9zdz' located on same node (Bug) *'csi-rbdplugin-provisioner-7d4596b7d6-7ds28' and 'csi-rbdplugin-provisioner-7d4596b7d6-l5pb9' located on same node (Bug) Actual results: provisioner pods running on a same worker node Expected results: provisioner pods running on separate worker node Additional info:
@Madhu Looks like the bot didn't pick up the backport to 1.3 (4.5), so we will need to open a backport PR manually for that. There must have been a merge conflict with the backport, but the bot isn't showing the details. The risk for backport is low, so I'll ack it with that assumption. @Oded please mark it as a blocker if considered as such.
Please add blocker flag so that we get all the acks.
Bug Not Reconstructed SetUp: Provider:Vmware OCP version:4.5.0-0.nightly-2020-08-15-052753 OCS version:4.5.0-54.ci sh-4.4# rook version rook: 4.5-43.884c3eee.release_4.5 go: go1.13.4 sh-4.4# ceph versions { "mon": { "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2 }, "mds": { "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2 }, "rgw": { "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 10 } } Test Process: 1.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage) $ oc get pods -n openshift-storage -o wide | grep csi-rbdplugin-provisioner csi-rbdplugin-provisioner-8c87b76ff-8b25m 5/5 Running 0 7d6h 10.128.2.8 compute-0 <none> <none> csi-rbdplugin-provisioner-8c87b76ff-wfkhk 5/5 Running 0 7d6h 10.131.0.15 compute-1 <none> <none> $ oc get pods -n openshift-storage -o wide | grep csi-cephfsplugin-provisioner csi-cephfsplugin-provisioner-c748c89bf-cdgpp 5/5 Running 0 7d6h 10.131.0.16 compute-1 <none> <none> csi-cephfsplugin-provisioner-c748c89bf-hhk2j 5/5 Running 0 7d6h 10.128.2.12 compute-0 <none> <none> 2.Shut down nodes [compute-0, compute-1] 3.Check Nodes status: $ oc get nodes NAME STATUS ROLES AGE VERSION compute-0 NotReady worker 7d6h v1.18.3+2cf11e2 compute-1 NotReady worker 7d6h v1.18.3+2cf11e2 compute-2 Ready worker 7d6h v1.18.3+2cf11e2 control-plane-0 Ready master 7d6h v1.18.3+2cf11e2 control-plane-1 Ready master 7d6h v1.18.3+2cf11e2 control-plane-2 Ready master 7d6h v1.18.3+2cf11e2 4.Wait 10 min 5.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage) $ oc get pods -n openshift-storage -o wide | grep provisioner csi-cephfsplugin-provisioner-c748c89bf-cdgpp 5/5 Terminating 0 7d6h 10.131.0.16 compute-1 <none> <none> csi-cephfsplugin-provisioner-c748c89bf-hhk2j 5/5 Terminating 0 7d6h 10.128.2.12 compute-0 <none> <none> csi-cephfsplugin-provisioner-c748c89bf-qk6mp 0/5 Pending 0 4m8s <none> <none> <none> <none> csi-cephfsplugin-provisioner-c748c89bf-s54rl 5/5 Running 0 4m28s 10.129.2.69 compute-2 <none> <none> csi-rbdplugin-provisioner-8c87b76ff-475kk 0/5 Pending 0 4m8s <none> <none> <none> <none> csi-rbdplugin-provisioner-8c87b76ff-8b25m 5/5 Terminating 0 7d6h 10.128.2.8 compute-0 <none> <none> csi-rbdplugin-provisioner-8c87b76ff-q4mq7 5/5 Running 0 4m28s 10.129.2.70 compute-2 <none> <none> csi-rbdplugin-provisioner-8c87b76ff-wfkhk 5/5 Terminating 0 7d6h 10.131.0.15 compute-1 <none> <none> 6.Power UP Nodes [compute-0,compute-1] 7.Check Nodes status: $ oc get nodes NAME STATUS ROLES AGE VERSION compute-0 Ready worker 7d6h v1.18.3+2cf11e2 compute-1 Ready worker 7d6h v1.18.3+2cf11e2 compute-2 Ready worker 7d6h v1.18.3+2cf11e2 control-plane-0 Ready master 7d7h v1.18.3+2cf11e2 control-plane-1 Ready master 7d7h v1.18.3+2cf11e2 control-plane-2 Ready master 7d7h v1.18.3+2cf11e2 8.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage) provisioner pods are not running on a same worker node $ oc get pods -n openshift-storage -o wide | grep provisioner csi-cephfsplugin-provisioner-c748c89bf-qk6mp 5/5 Running 0 7m42s 10.131.0.7 compute-1 <none> <none> csi-cephfsplugin-provisioner-c748c89bf-s54rl 5/5 Running 0 8m2s 10.129.2.69 compute-2 <none> <none> csi-rbdplugin-provisioner-8c87b76ff-475kk 5/5 Running 0 7m42s 10.128.2.4 compute-0 <none> <none> csi-rbdplugin-provisioner-8c87b76ff-q4mq7 5/5 Running 0 8m2s 10.129.2.70 compute-2 <none> <none>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754
The automation test can be found here: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/z_cluster/nodes/test_check_pod_status_after_two_nodes_shutdown_recovery.py Polarion link: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-2315