Bug 1867092
| Summary: | After 2 OCS node shutdown, both provisioner pods running on a same worker node | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Oded <oviner> |
| Component: | rook | Assignee: | Madhu Rajanna <mrajanna> |
| Status: | CLOSED ERRATA | QA Contact: | Oded <oviner> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.5 | CC: | assingh, hnallurv, ikave, madam, mrajanna, muagarwa, ocs-bugs, ratamir, sostapov, tdesala, tnielsen |
| Target Milestone: | --- | Keywords: | Automation |
| Target Release: | OCS 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | 4.5.0-526.ci | Doc Type: | If docs needed, set a value |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-09-15 10:18:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
@Madhu Looks like the bot didn't pick up the backport to 1.3 (4.5), so we will need to open a backport PR manually for that. There must have been a merge conflict with the backport, but the bot isn't showing the details. The risk for backport is low, so I'll ack it with that assumption. @Oded please mark it as a blocker if considered as such. Please add blocker flag so that we get all the acks. Bug Not Reconstructed
SetUp:
Provider:Vmware
OCP version:4.5.0-0.nightly-2020-08-15-052753
OCS version:4.5.0-54.ci
sh-4.4# rook version
rook: 4.5-43.884c3eee.release_4.5
go: go1.13.4
sh-4.4# ceph versions
{
"mon": {
"ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 3
},
"mgr": {
"ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 1
},
"osd": {
"ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2
},
"mds": {
"ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2
},
"rgw": {
"ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2
},
"overall": {
"ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 10
}
}
Test Process:
1.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage)
$ oc get pods -n openshift-storage -o wide | grep csi-rbdplugin-provisioner
csi-rbdplugin-provisioner-8c87b76ff-8b25m 5/5 Running 0 7d6h 10.128.2.8 compute-0 <none> <none>
csi-rbdplugin-provisioner-8c87b76ff-wfkhk 5/5 Running 0 7d6h 10.131.0.15 compute-1 <none> <none>
$ oc get pods -n openshift-storage -o wide | grep csi-cephfsplugin-provisioner
csi-cephfsplugin-provisioner-c748c89bf-cdgpp 5/5 Running 0 7d6h 10.131.0.16 compute-1 <none> <none>
csi-cephfsplugin-provisioner-c748c89bf-hhk2j 5/5 Running 0 7d6h 10.128.2.12 compute-0 <none> <none>
2.Shut down nodes [compute-0, compute-1]
3.Check Nodes status:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
compute-0 NotReady worker 7d6h v1.18.3+2cf11e2
compute-1 NotReady worker 7d6h v1.18.3+2cf11e2
compute-2 Ready worker 7d6h v1.18.3+2cf11e2
control-plane-0 Ready master 7d6h v1.18.3+2cf11e2
control-plane-1 Ready master 7d6h v1.18.3+2cf11e2
control-plane-2 Ready master 7d6h v1.18.3+2cf11e2
4.Wait 10 min
5.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage)
$ oc get pods -n openshift-storage -o wide | grep provisioner
csi-cephfsplugin-provisioner-c748c89bf-cdgpp 5/5 Terminating 0 7d6h 10.131.0.16 compute-1 <none> <none>
csi-cephfsplugin-provisioner-c748c89bf-hhk2j 5/5 Terminating 0 7d6h 10.128.2.12 compute-0 <none> <none>
csi-cephfsplugin-provisioner-c748c89bf-qk6mp 0/5 Pending 0 4m8s <none> <none> <none> <none>
csi-cephfsplugin-provisioner-c748c89bf-s54rl 5/5 Running 0 4m28s 10.129.2.69 compute-2 <none> <none>
csi-rbdplugin-provisioner-8c87b76ff-475kk 0/5 Pending 0 4m8s <none> <none> <none> <none>
csi-rbdplugin-provisioner-8c87b76ff-8b25m 5/5 Terminating 0 7d6h 10.128.2.8 compute-0 <none> <none>
csi-rbdplugin-provisioner-8c87b76ff-q4mq7 5/5 Running 0 4m28s 10.129.2.70 compute-2 <none> <none>
csi-rbdplugin-provisioner-8c87b76ff-wfkhk 5/5 Terminating 0 7d6h 10.131.0.15 compute-1 <none> <none>
6.Power UP Nodes [compute-0,compute-1]
7.Check Nodes status:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
compute-0 Ready worker 7d6h v1.18.3+2cf11e2
compute-1 Ready worker 7d6h v1.18.3+2cf11e2
compute-2 Ready worker 7d6h v1.18.3+2cf11e2
control-plane-0 Ready master 7d7h v1.18.3+2cf11e2
control-plane-1 Ready master 7d7h v1.18.3+2cf11e2
control-plane-2 Ready master 7d7h v1.18.3+2cf11e2
8.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage)
provisioner pods are not running on a same worker node
$ oc get pods -n openshift-storage -o wide | grep provisioner
csi-cephfsplugin-provisioner-c748c89bf-qk6mp 5/5 Running 0 7m42s 10.131.0.7 compute-1 <none> <none>
csi-cephfsplugin-provisioner-c748c89bf-s54rl 5/5 Running 0 8m2s 10.129.2.69 compute-2 <none> <none>
csi-rbdplugin-provisioner-8c87b76ff-475kk 5/5 Running 0 7m42s 10.128.2.4 compute-0 <none> <none>
csi-rbdplugin-provisioner-8c87b76ff-q4mq7 5/5 Running 0 8m2s 10.129.2.70 compute-2 <none> <none>
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:3754 The automation test can be found here: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/z_cluster/nodes/test_check_pod_status_after_two_nodes_shutdown_recovery.py Polarion link: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-2315 |
Description of problem (please be detailed as possible and provide log snippests): After 2 OCS node shutdown, both provisioner pods running on a same worker node Version of all relevant components (if applicable): Provider:AWS_IPI OCP version:4.5.0-0.nightly-2020-08-06-102404 OCS version:ocs-operator.v4.5.0-515.ci sh-4.4# ceph version ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable) sh-4.4# rook version rook: 4.5-38.e7a77d32.release_4.5 go: go1.13.4 sh-4.4# ceph versions { "mon": { "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 3 }, "mds": { "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 2 }, "overall": { "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 9 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Get Nodes: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-132-167.us-east-2.compute.internal Ready worker 68m v1.18.3+08c38ef ip-10-0-142-171.us-east-2.compute.internal Ready master 77m v1.18.3+08c38ef ip-10-0-168-36.us-east-2.compute.internal Ready master 77m v1.18.3+08c38ef ip-10-0-177-178.us-east-2.compute.internal Ready worker 68m v1.18.3+08c38ef ip-10-0-216-231.us-east-2.compute.internal Ready worker 67m v1.18.3+08c38ef ip-10-0-223-41.us-east-2.compute.internal Ready master 77m v1.18.3+08c38ef 2.Shutdown 2 worker nodes via Amazon UI: *ip-10-0-132-167.us-east-2.compute.internal *ip-10-0-177-178.us-east-2.compute.internal 3.Get Nodes: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-132-167.us-east-2.compute.internal NotReady worker 73m v1.18.3+08c38ef ip-10-0-142-171.us-east-2.compute.internal Ready master 82m v1.18.3+08c38ef ip-10-0-168-36.us-east-2.compute.internal Ready master 82m v1.18.3+08c38ef ip-10-0-177-178.us-east-2.compute.internal NotReady worker 73m v1.18.3+08c38ef ip-10-0-216-231.us-east-2.compute.internal Ready worker 72m v1.18.3+08c38ef ip-10-0-223-41.us-east-2.compute.internal Ready master 82m v1.18.3+08c38ef 4.Start relevant nodes via Amazon UI: *ip-10-0-132-167.us-east-2.compute.internal *ip-10-0-177-178.us-east-2.compute.internal 5.Check Nodes status: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-132-167.us-east-2.compute.internal Ready worker 85m v1.18.3+08c38ef ip-10-0-142-171.us-east-2.compute.internal Ready master 94m v1.18.3+08c38ef ip-10-0-168-36.us-east-2.compute.internal Ready master 94m v1.18.3+08c38ef ip-10-0-177-178.us-east-2.compute.internal Ready worker 85m v1.18.3+08c38ef ip-10-0-216-231.us-east-2.compute.internal Ready worker 84m v1.18.3+08c38ef ip-10-0-223-41.us-east-2.compute.internal Ready master 94m v1.18.3+08c38ef 6.Get all pods (openshift-storage) $ oc get pods -n openshift-storage -o wide NAME READY STATUS AGE IP NODE csi-cephfsplugin-d249c 3/3 Running 62m 10.0.216.231 ip-10-0-216-231.us-east-2.compute.internal csi-cephfsplugin-kf7rc 3/3 Running 62m 10.0.132.167 ip-10-0-132-167.us-east-2.compute.internal csi-cephfsplugin-p982p 3/3 Running 62m 10.0.177.178 ip-10-0-177-178.us-east-2.compute.internal csi-cephfsplugin-provisioner-745957785f-qz7l7 5/5 Running 62m 10.129.2.16 ip-10-0-216-231.us-east-2.compute.internal csi-cephfsplugin-provisioner-745957785f-z9zdz 5/5 Running 11m 10.129.2.37 ip-10-0-216-231.us-east-2.compute.internal csi-rbdplugin-kqxqm 3/3 Running 62m 10.0.177.178 ip-10-0-177-178.us-east-2.compute.internal csi-rbdplugin-p8rgl 3/3 Running 62m 10.0.132.167 ip-10-0-132-167.us-east-2.compute.internal csi-rbdplugin-provisioner-7d4596b7d6-7ds28 5/5 Running 11m 10.129.2.43 ip-10-0-216-231.us-east-2.compute.internal csi-rbdplugin-provisioner-7d4596b7d6-l5pb9 5/5 Running 62m 10.129.2.15 ip-10-0-216-231.us-east-2.compute.internal csi-rbdplugin-srnsr 3/3 Running 62m 10.0.216.231 ip-10-0-216-231.us-east-2.compute.internal noobaa-core-0 1/1 Running 4m52s 10.131.0.8 ip-10-0-177-178.us-east-2.compute.internal noobaa-db-0 1/1 Running 4m52s 10.131.0.12 ip-10-0-177-178.us-east-2.compute.internal noobaa-endpoint-d4bccf9d5-dhnzx 1/1 Running 11m 10.129.2.34 ip-10-0-216-231.us-east-2.compute.internal noobaa-operator-7df6dc9b74-rgd5h 1/1 Running 11m 10.129.2.42 ip-10-0-216-231.us-east-2.compute.internal ocs-operator-6c4cbb75d8-w5kqz 1/1 Running 11m 10.129.2.40 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-crashcollector-ip-10-0-132-167-5fb86ccc4b-694f8 1/1 Running 11m 10.128.2.5 ip-10-0-132-167.us-east-2.compute.internal rook-ceph-crashcollector-ip-10-0-177-178-554ddc4b69-qtjrj 1/1 Running 4m52s 10.131.0.7 ip-10-0-177-178.us-east-2.compute.internal rook-ceph-crashcollector-ip-10-0-216-231-6469f797c5-qkc7r 1/1 Running 59m 10.129.2.22 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-drain-canary-57e3b57cc42ecf0d1b5ce0470ce1c9a3-58ql45n 1/1 Running 59m 10.129.2.21 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-drain-canary-5c4fe7c2d0fd0ce702064d89daab3bff-78rcbfz 1/1 Running 11m 10.128.2.6 ip-10-0-132-167.us-east-2.compute.internal rook-ceph-drain-canary-ffddf166409fdafc40a4e743896c1a5d-6cknf9w 1/1 Running 11m 10.131.0.6 ip-10-0-177-178.us-east-2.compute.internal rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-866fd967pngwm 1/1 Running 11m 10.129.2.30 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-56b48bb6tfhcg 1/1 Running 11m 10.131.0.5 ip-10-0-177-178.us-east-2.compute.internal rook-ceph-mgr-a-577f465cf8-98zwt 1/1 Running 11m 10.129.2.33 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-mon-a-67948d76bc-twslw 1/1 Running 11m 10.128.2.8 ip-10-0-132-167.us-east-2.compute.internal rook-ceph-mon-b-857fbdd86f-cbrjn 1/1 Running 11m 10.131.0.10 ip-10-0-177-178.us-east-2.compute.internal rook-ceph-mon-c-7975ddfdb9-p86sr 1/1 Running 60m 10.129.2.18 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-operator-6546fc9ccc-twjp4 1/1 Running 11m 10.129.2.32 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-osd-0-54dc779b9f-759pb 1/1 Running 11m 10.128.2.7 ip-10-0-132-167.us-east-2.compute.internal rook-ceph-osd-1-5b846dc49d-v42tz 1/1 Running 59m 10.129.2.23 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-osd-2-7669fbfdd5-sjggb 1/1 Running 11m 10.131.0.9 ip-10-0-177-178.us-east-2.compute.internal rook-ceph-osd-prepare-ocs-deviceset-1-data-0-5sf9x-qtnnn 0/1 Completed 60m 10.129.2.20 ip-10-0-216-231.us-east-2.compute.internal rook-ceph-tools-cb97b47d6-cmgwd 1/1 Running 11m 10.0.216.231 ip-10-0-216-231.us-east-2.compute.internal *'csi-cephfsplugin-provisioner-745957785f-qz7l7' and 'csi-cephfsplugin-provisioner-745957785f-z9zdz' located on same node (Bug) *'csi-rbdplugin-provisioner-7d4596b7d6-7ds28' and 'csi-rbdplugin-provisioner-7d4596b7d6-l5pb9' located on same node (Bug) Actual results: provisioner pods running on a same worker node Expected results: provisioner pods running on separate worker node Additional info: