Bug 1867092

Summary:	After 2 OCS node shutdown, both provisioner pods running on a same worker node
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Oded <oviner>
Component:	rook	Assignee:	Madhu Rajanna <mrajanna>
Status:	CLOSED ERRATA	QA Contact:	Oded <oviner>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.5	CC:	assingh, hnallurv, ikave, madam, mrajanna, muagarwa, ocs-bugs, ratamir, sostapov, tdesala, tnielsen
Target Milestone:	---	Keywords:	Automation
Target Release:	OCS 4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	4.5.0-526.ci	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-15 10:18:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Oded 2020-08-07 10:43:53 UTC

Description of problem (please be detailed as possible and provide log
snippests):
After 2 OCS node shutdown, both provisioner pods running on a same worker node


Version of all relevant components (if applicable):
Provider:AWS_IPI
OCP version:4.5.0-0.nightly-2020-08-06-102404
OCS version:ocs-operator.v4.5.0-515.ci

sh-4.4# ceph version
ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)


sh-4.4# rook version
rook: 4.5-38.e7a77d32.release_4.5
go: go1.13.4


sh-4.4# ceph versions
{
    "mon": {
        "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 9
    }
}


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

1.Get Nodes:
$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-132-167.us-east-2.compute.internal   Ready    worker   68m   v1.18.3+08c38ef
ip-10-0-142-171.us-east-2.compute.internal   Ready    master   77m   v1.18.3+08c38ef
ip-10-0-168-36.us-east-2.compute.internal    Ready    master   77m   v1.18.3+08c38ef
ip-10-0-177-178.us-east-2.compute.internal   Ready    worker   68m   v1.18.3+08c38ef
ip-10-0-216-231.us-east-2.compute.internal   Ready    worker   67m   v1.18.3+08c38ef
ip-10-0-223-41.us-east-2.compute.internal    Ready    master   77m   v1.18.3+08c38ef


2.Shutdown 2 worker nodes via Amazon UI:
*ip-10-0-132-167.us-east-2.compute.internal
*ip-10-0-177-178.us-east-2.compute.internal

3.Get Nodes:
$ oc get nodes
NAME                                         STATUS     ROLES    AGE   VERSION
ip-10-0-132-167.us-east-2.compute.internal   NotReady   worker   73m   v1.18.3+08c38ef
ip-10-0-142-171.us-east-2.compute.internal   Ready      master   82m   v1.18.3+08c38ef
ip-10-0-168-36.us-east-2.compute.internal    Ready      master   82m   v1.18.3+08c38ef
ip-10-0-177-178.us-east-2.compute.internal   NotReady   worker   73m   v1.18.3+08c38ef
ip-10-0-216-231.us-east-2.compute.internal   Ready      worker   72m   v1.18.3+08c38ef
ip-10-0-223-41.us-east-2.compute.internal    Ready      master   82m   v1.18.3+08c38ef

4.Start relevant nodes via Amazon UI:
*ip-10-0-132-167.us-east-2.compute.internal
*ip-10-0-177-178.us-east-2.compute.internal


5.Check Nodes status:
$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-132-167.us-east-2.compute.internal   Ready    worker   85m   v1.18.3+08c38ef
ip-10-0-142-171.us-east-2.compute.internal   Ready    master   94m   v1.18.3+08c38ef
ip-10-0-168-36.us-east-2.compute.internal    Ready    master   94m   v1.18.3+08c38ef
ip-10-0-177-178.us-east-2.compute.internal   Ready    worker   85m   v1.18.3+08c38ef
ip-10-0-216-231.us-east-2.compute.internal   Ready    worker   84m   v1.18.3+08c38ef
ip-10-0-223-41.us-east-2.compute.internal    Ready    master   94m   v1.18.3+08c38ef


6.Get all pods (openshift-storage)
$ oc get pods -n openshift-storage -o wide
NAME                                                              READY   STATUS     AGE     IP             NODE                                      
csi-cephfsplugin-d249c                                            3/3     Running    62m     10.0.216.231   ip-10-0-216-231.us-east-2.compute.internal  
csi-cephfsplugin-kf7rc                                            3/3     Running    62m     10.0.132.167   ip-10-0-132-167.us-east-2.compute.internal  
csi-cephfsplugin-p982p                                            3/3     Running    62m     10.0.177.178   ip-10-0-177-178.us-east-2.compute.internal  
csi-cephfsplugin-provisioner-745957785f-qz7l7                     5/5     Running    62m     10.129.2.16    ip-10-0-216-231.us-east-2.compute.internal  
csi-cephfsplugin-provisioner-745957785f-z9zdz                     5/5     Running    11m     10.129.2.37    ip-10-0-216-231.us-east-2.compute.internal  
csi-rbdplugin-kqxqm                                               3/3     Running    62m     10.0.177.178   ip-10-0-177-178.us-east-2.compute.internal  
csi-rbdplugin-p8rgl                                               3/3     Running    62m     10.0.132.167   ip-10-0-132-167.us-east-2.compute.internal  
csi-rbdplugin-provisioner-7d4596b7d6-7ds28                        5/5     Running    11m     10.129.2.43    ip-10-0-216-231.us-east-2.compute.internal  
csi-rbdplugin-provisioner-7d4596b7d6-l5pb9                        5/5     Running    62m     10.129.2.15    ip-10-0-216-231.us-east-2.compute.internal  
csi-rbdplugin-srnsr                                               3/3     Running    62m     10.0.216.231   ip-10-0-216-231.us-east-2.compute.internal  
noobaa-core-0                                                     1/1     Running    4m52s   10.131.0.8     ip-10-0-177-178.us-east-2.compute.internal  
noobaa-db-0                                                       1/1     Running    4m52s   10.131.0.12    ip-10-0-177-178.us-east-2.compute.internal  
noobaa-endpoint-d4bccf9d5-dhnzx                                   1/1     Running    11m     10.129.2.34    ip-10-0-216-231.us-east-2.compute.internal  
noobaa-operator-7df6dc9b74-rgd5h                                  1/1     Running    11m     10.129.2.42    ip-10-0-216-231.us-east-2.compute.internal  
ocs-operator-6c4cbb75d8-w5kqz                                     1/1     Running    11m     10.129.2.40    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-crashcollector-ip-10-0-132-167-5fb86ccc4b-694f8         1/1     Running    11m     10.128.2.5     ip-10-0-132-167.us-east-2.compute.internal  
rook-ceph-crashcollector-ip-10-0-177-178-554ddc4b69-qtjrj         1/1     Running    4m52s   10.131.0.7     ip-10-0-177-178.us-east-2.compute.internal  
rook-ceph-crashcollector-ip-10-0-216-231-6469f797c5-qkc7r         1/1     Running    59m     10.129.2.22    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-drain-canary-57e3b57cc42ecf0d1b5ce0470ce1c9a3-58ql45n   1/1     Running    59m     10.129.2.21    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-drain-canary-5c4fe7c2d0fd0ce702064d89daab3bff-78rcbfz   1/1     Running    11m     10.128.2.6     ip-10-0-132-167.us-east-2.compute.internal  
rook-ceph-drain-canary-ffddf166409fdafc40a4e743896c1a5d-6cknf9w   1/1     Running    11m     10.131.0.6     ip-10-0-177-178.us-east-2.compute.internal  
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-866fd967pngwm   1/1     Running    11m     10.129.2.30    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-56b48bb6tfhcg   1/1     Running    11m     10.131.0.5     ip-10-0-177-178.us-east-2.compute.internal  
rook-ceph-mgr-a-577f465cf8-98zwt                                  1/1     Running    11m     10.129.2.33    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-mon-a-67948d76bc-twslw                                  1/1     Running    11m     10.128.2.8     ip-10-0-132-167.us-east-2.compute.internal  
rook-ceph-mon-b-857fbdd86f-cbrjn                                  1/1     Running    11m     10.131.0.10    ip-10-0-177-178.us-east-2.compute.internal  
rook-ceph-mon-c-7975ddfdb9-p86sr                                  1/1     Running    60m     10.129.2.18    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-operator-6546fc9ccc-twjp4                               1/1     Running    11m     10.129.2.32    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-osd-0-54dc779b9f-759pb                                  1/1     Running    11m     10.128.2.7     ip-10-0-132-167.us-east-2.compute.internal  
rook-ceph-osd-1-5b846dc49d-v42tz                                  1/1     Running    59m     10.129.2.23    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-osd-2-7669fbfdd5-sjggb                                  1/1     Running    11m     10.131.0.9     ip-10-0-177-178.us-east-2.compute.internal  
rook-ceph-osd-prepare-ocs-deviceset-1-data-0-5sf9x-qtnnn          0/1     Completed  60m     10.129.2.20    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-tools-cb97b47d6-cmgwd                                   1/1     Running    11m     10.0.216.231   ip-10-0-216-231.us-east-2.compute.internal  


*'csi-cephfsplugin-provisioner-745957785f-qz7l7' and 'csi-cephfsplugin-provisioner-745957785f-z9zdz' located on same node (Bug)
*'csi-rbdplugin-provisioner-7d4596b7d6-7ds28' and 'csi-rbdplugin-provisioner-7d4596b7d6-l5pb9' located on same node (Bug)


Actual results:
provisioner pods running on a same worker node

Expected results:
provisioner pods running on separate worker node

Additional info:

Comment 3 Travis Nielsen 2020-08-07 13:47:41 UTC

@Madhu Looks like the bot didn't pick up the backport to 1.3 (4.5), so we will need to open a backport PR manually for that. There must have been a merge conflict with the backport, but the bot isn't showing the details. 

The risk for backport is low, so I'll ack it with that assumption.
@Oded please mark it as a blocker if considered as such.

Comment 5 Mudit Agarwal 2020-08-08 02:52:51 UTC

Please add blocker flag so that we get all the acks.

Comment 8 Oded 2020-08-24 22:37:15 UTC

Bug Not Reconstructed

SetUp:
Provider:Vmware
OCP version:4.5.0-0.nightly-2020-08-15-052753
OCS version:4.5.0-54.ci

sh-4.4# rook version
rook: 4.5-43.884c3eee.release_4.5
go: go1.13.4

sh-4.4# ceph versions
{
    "mon": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2
    },
    "mds": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 10
    }
}


Test Process:
1.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage)
$ oc get pods -n openshift-storage -o wide | grep csi-rbdplugin-provisioner
csi-rbdplugin-provisioner-8c87b76ff-8b25m                         5/5     Running                 0          7d6h   10.128.2.8     compute-0   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-wfkhk                         5/5     Running                 0          7d6h   10.131.0.15    compute-1   <none>           <none>

$ oc get pods -n openshift-storage -o wide | grep csi-cephfsplugin-provisioner
csi-cephfsplugin-provisioner-c748c89bf-cdgpp                      5/5     Running                 0          7d6h   10.131.0.16    compute-1   <none>           <none>
csi-cephfsplugin-provisioner-c748c89bf-hhk2j                      5/5     Running                 0          7d6h   10.128.2.12    compute-0   <none>           <none>


2.Shut down nodes [compute-0, compute-1]

3.Check Nodes status:
$ oc get nodes
NAME              STATUS     ROLES    AGE    VERSION
compute-0         NotReady   worker   7d6h   v1.18.3+2cf11e2
compute-1         NotReady   worker   7d6h   v1.18.3+2cf11e2
compute-2         Ready      worker   7d6h   v1.18.3+2cf11e2
control-plane-0   Ready      master   7d6h   v1.18.3+2cf11e2
control-plane-1   Ready      master   7d6h   v1.18.3+2cf11e2
control-plane-2   Ready      master   7d6h   v1.18.3+2cf11e2

4.Wait 10 min

5.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage)
$ oc get pods -n openshift-storage -o wide | grep provisioner
csi-cephfsplugin-provisioner-c748c89bf-cdgpp                      5/5     Terminating   0          7d6h    10.131.0.16    compute-1   <none>           <none>
csi-cephfsplugin-provisioner-c748c89bf-hhk2j                      5/5     Terminating   0          7d6h    10.128.2.12    compute-0   <none>           <none>
csi-cephfsplugin-provisioner-c748c89bf-qk6mp                      0/5     Pending       0          4m8s    <none>         <none>      <none>           <none>
csi-cephfsplugin-provisioner-c748c89bf-s54rl                      5/5     Running       0          4m28s   10.129.2.69    compute-2   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-475kk                         0/5     Pending       0          4m8s    <none>         <none>      <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-8b25m                         5/5     Terminating   0          7d6h    10.128.2.8     compute-0   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-q4mq7                         5/5     Running       0          4m28s   10.129.2.70    compute-2   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-wfkhk                         5/5     Terminating   0          7d6h    10.131.0.15    compute-1   <none>           <none>

6.Power UP Nodes [compute-0,compute-1]

7.Check Nodes status:
$ oc get nodes
NAME              STATUS   ROLES    AGE    VERSION
compute-0         Ready    worker   7d6h   v1.18.3+2cf11e2
compute-1         Ready    worker   7d6h   v1.18.3+2cf11e2
compute-2         Ready    worker   7d6h   v1.18.3+2cf11e2
control-plane-0   Ready    master   7d7h   v1.18.3+2cf11e2
control-plane-1   Ready    master   7d7h   v1.18.3+2cf11e2
control-plane-2   Ready    master   7d7h   v1.18.3+2cf11e2

8.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage)
provisioner pods are not running on a same worker node

$ oc get pods -n openshift-storage -o wide | grep provisioner
csi-cephfsplugin-provisioner-c748c89bf-qk6mp                      5/5     Running      0          7m42s   10.131.0.7     compute-1   <none>           <none>
csi-cephfsplugin-provisioner-c748c89bf-s54rl                      5/5     Running      0          8m2s    10.129.2.69    compute-2   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-475kk                         5/5     Running      0          7m42s   10.128.2.4     compute-0   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-q4mq7                         5/5     Running      0          8m2s    10.129.2.70    compute-2   <none>           <none>

Comment 11 errata-xmlrpc 2020-09-15 10:18:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Comment 12 Itzhak 2021-06-02 12:34:36 UTC

The automation test can be found here: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/z_cluster/nodes/test_check_pod_status_after_two_nodes_shutdown_recovery.py

Polarion link: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-2315