1867092 – After 2 OCS node shutdown, both provisioner pods running on a same worker node

Bug 1867092 - After 2 OCS node shutdown, both provisioner pods running on a same worker node

Summary: After 2 OCS node shutdown, both provisioner pods running on a same worker node

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	OCS 4.5.0
Assignee:	Madhu Rajanna
QA Contact:	Oded
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-07 10:43 UTC by Oded
Modified:	2021-06-02 12:34 UTC (History)
CC List:	11 users (show)
Fixed In Version:	4.5.0-526.ci
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-15 10:18:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift rook pull 99	0	None	closed	BUG 1867092: ceph: set pod anti affinity to provisioner pod	2020-10-12 08:57:47 UTC
Red Hat Product Errata	RHBA-2020:3754	0	None	None	None	2020-09-15 10:19:02 UTC

Description Oded 2020-08-07 10:43:53 UTC

Description of problem (please be detailed as possible and provide log
snippests):
After 2 OCS node shutdown, both provisioner pods running on a same worker node


Version of all relevant components (if applicable):
Provider:AWS_IPI
OCP version:4.5.0-0.nightly-2020-08-06-102404
OCS version:ocs-operator.v4.5.0-515.ci

sh-4.4# ceph version
ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)


sh-4.4# rook version
rook: 4.5-38.e7a77d32.release_4.5
go: go1.13.4


sh-4.4# ceph versions
{
    "mon": {
        "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 3
    },
    "mds": {
        "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.8-81.el8cp (0336e23b7404496341b988c8057538b8185ca5ec) nautilus (stable)": 9
    }
}


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

1.Get Nodes:
$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-132-167.us-east-2.compute.internal   Ready    worker   68m   v1.18.3+08c38ef
ip-10-0-142-171.us-east-2.compute.internal   Ready    master   77m   v1.18.3+08c38ef
ip-10-0-168-36.us-east-2.compute.internal    Ready    master   77m   v1.18.3+08c38ef
ip-10-0-177-178.us-east-2.compute.internal   Ready    worker   68m   v1.18.3+08c38ef
ip-10-0-216-231.us-east-2.compute.internal   Ready    worker   67m   v1.18.3+08c38ef
ip-10-0-223-41.us-east-2.compute.internal    Ready    master   77m   v1.18.3+08c38ef


2.Shutdown 2 worker nodes via Amazon UI:
*ip-10-0-132-167.us-east-2.compute.internal
*ip-10-0-177-178.us-east-2.compute.internal

3.Get Nodes:
$ oc get nodes
NAME                                         STATUS     ROLES    AGE   VERSION
ip-10-0-132-167.us-east-2.compute.internal   NotReady   worker   73m   v1.18.3+08c38ef
ip-10-0-142-171.us-east-2.compute.internal   Ready      master   82m   v1.18.3+08c38ef
ip-10-0-168-36.us-east-2.compute.internal    Ready      master   82m   v1.18.3+08c38ef
ip-10-0-177-178.us-east-2.compute.internal   NotReady   worker   73m   v1.18.3+08c38ef
ip-10-0-216-231.us-east-2.compute.internal   Ready      worker   72m   v1.18.3+08c38ef
ip-10-0-223-41.us-east-2.compute.internal    Ready      master   82m   v1.18.3+08c38ef

4.Start relevant nodes via Amazon UI:
*ip-10-0-132-167.us-east-2.compute.internal
*ip-10-0-177-178.us-east-2.compute.internal


5.Check Nodes status:
$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-132-167.us-east-2.compute.internal   Ready    worker   85m   v1.18.3+08c38ef
ip-10-0-142-171.us-east-2.compute.internal   Ready    master   94m   v1.18.3+08c38ef
ip-10-0-168-36.us-east-2.compute.internal    Ready    master   94m   v1.18.3+08c38ef
ip-10-0-177-178.us-east-2.compute.internal   Ready    worker   85m   v1.18.3+08c38ef
ip-10-0-216-231.us-east-2.compute.internal   Ready    worker   84m   v1.18.3+08c38ef
ip-10-0-223-41.us-east-2.compute.internal    Ready    master   94m   v1.18.3+08c38ef


6.Get all pods (openshift-storage)
$ oc get pods -n openshift-storage -o wide
NAME                                                              READY   STATUS     AGE     IP             NODE                                      
csi-cephfsplugin-d249c                                            3/3     Running    62m     10.0.216.231   ip-10-0-216-231.us-east-2.compute.internal  
csi-cephfsplugin-kf7rc                                            3/3     Running    62m     10.0.132.167   ip-10-0-132-167.us-east-2.compute.internal  
csi-cephfsplugin-p982p                                            3/3     Running    62m     10.0.177.178   ip-10-0-177-178.us-east-2.compute.internal  
csi-cephfsplugin-provisioner-745957785f-qz7l7                     5/5     Running    62m     10.129.2.16    ip-10-0-216-231.us-east-2.compute.internal  
csi-cephfsplugin-provisioner-745957785f-z9zdz                     5/5     Running    11m     10.129.2.37    ip-10-0-216-231.us-east-2.compute.internal  
csi-rbdplugin-kqxqm                                               3/3     Running    62m     10.0.177.178   ip-10-0-177-178.us-east-2.compute.internal  
csi-rbdplugin-p8rgl                                               3/3     Running    62m     10.0.132.167   ip-10-0-132-167.us-east-2.compute.internal  
csi-rbdplugin-provisioner-7d4596b7d6-7ds28                        5/5     Running    11m     10.129.2.43    ip-10-0-216-231.us-east-2.compute.internal  
csi-rbdplugin-provisioner-7d4596b7d6-l5pb9                        5/5     Running    62m     10.129.2.15    ip-10-0-216-231.us-east-2.compute.internal  
csi-rbdplugin-srnsr                                               3/3     Running    62m     10.0.216.231   ip-10-0-216-231.us-east-2.compute.internal  
noobaa-core-0                                                     1/1     Running    4m52s   10.131.0.8     ip-10-0-177-178.us-east-2.compute.internal  
noobaa-db-0                                                       1/1     Running    4m52s   10.131.0.12    ip-10-0-177-178.us-east-2.compute.internal  
noobaa-endpoint-d4bccf9d5-dhnzx                                   1/1     Running    11m     10.129.2.34    ip-10-0-216-231.us-east-2.compute.internal  
noobaa-operator-7df6dc9b74-rgd5h                                  1/1     Running    11m     10.129.2.42    ip-10-0-216-231.us-east-2.compute.internal  
ocs-operator-6c4cbb75d8-w5kqz                                     1/1     Running    11m     10.129.2.40    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-crashcollector-ip-10-0-132-167-5fb86ccc4b-694f8         1/1     Running    11m     10.128.2.5     ip-10-0-132-167.us-east-2.compute.internal  
rook-ceph-crashcollector-ip-10-0-177-178-554ddc4b69-qtjrj         1/1     Running    4m52s   10.131.0.7     ip-10-0-177-178.us-east-2.compute.internal  
rook-ceph-crashcollector-ip-10-0-216-231-6469f797c5-qkc7r         1/1     Running    59m     10.129.2.22    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-drain-canary-57e3b57cc42ecf0d1b5ce0470ce1c9a3-58ql45n   1/1     Running    59m     10.129.2.21    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-drain-canary-5c4fe7c2d0fd0ce702064d89daab3bff-78rcbfz   1/1     Running    11m     10.128.2.6     ip-10-0-132-167.us-east-2.compute.internal  
rook-ceph-drain-canary-ffddf166409fdafc40a4e743896c1a5d-6cknf9w   1/1     Running    11m     10.131.0.6     ip-10-0-177-178.us-east-2.compute.internal  
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-866fd967pngwm   1/1     Running    11m     10.129.2.30    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-56b48bb6tfhcg   1/1     Running    11m     10.131.0.5     ip-10-0-177-178.us-east-2.compute.internal  
rook-ceph-mgr-a-577f465cf8-98zwt                                  1/1     Running    11m     10.129.2.33    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-mon-a-67948d76bc-twslw                                  1/1     Running    11m     10.128.2.8     ip-10-0-132-167.us-east-2.compute.internal  
rook-ceph-mon-b-857fbdd86f-cbrjn                                  1/1     Running    11m     10.131.0.10    ip-10-0-177-178.us-east-2.compute.internal  
rook-ceph-mon-c-7975ddfdb9-p86sr                                  1/1     Running    60m     10.129.2.18    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-operator-6546fc9ccc-twjp4                               1/1     Running    11m     10.129.2.32    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-osd-0-54dc779b9f-759pb                                  1/1     Running    11m     10.128.2.7     ip-10-0-132-167.us-east-2.compute.internal  
rook-ceph-osd-1-5b846dc49d-v42tz                                  1/1     Running    59m     10.129.2.23    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-osd-2-7669fbfdd5-sjggb                                  1/1     Running    11m     10.131.0.9     ip-10-0-177-178.us-east-2.compute.internal  
rook-ceph-osd-prepare-ocs-deviceset-1-data-0-5sf9x-qtnnn          0/1     Completed  60m     10.129.2.20    ip-10-0-216-231.us-east-2.compute.internal  
rook-ceph-tools-cb97b47d6-cmgwd                                   1/1     Running    11m     10.0.216.231   ip-10-0-216-231.us-east-2.compute.internal  


*'csi-cephfsplugin-provisioner-745957785f-qz7l7' and 'csi-cephfsplugin-provisioner-745957785f-z9zdz' located on same node (Bug)
*'csi-rbdplugin-provisioner-7d4596b7d6-7ds28' and 'csi-rbdplugin-provisioner-7d4596b7d6-l5pb9' located on same node (Bug)


Actual results:
provisioner pods running on a same worker node

Expected results:
provisioner pods running on separate worker node

Additional info:

Comment 3 Travis Nielsen 2020-08-07 13:47:41 UTC

@Madhu Looks like the bot didn't pick up the backport to 1.3 (4.5), so we will need to open a backport PR manually for that. There must have been a merge conflict with the backport, but the bot isn't showing the details. 

The risk for backport is low, so I'll ack it with that assumption.
@Oded please mark it as a blocker if considered as such.

Comment 5 Mudit Agarwal 2020-08-08 02:52:51 UTC

Please add blocker flag so that we get all the acks.

Comment 8 Oded 2020-08-24 22:37:15 UTC

Bug Not Reconstructed

SetUp:
Provider:Vmware
OCP version:4.5.0-0.nightly-2020-08-15-052753
OCS version:4.5.0-54.ci

sh-4.4# rook version
rook: 4.5-43.884c3eee.release_4.5
go: go1.13.4

sh-4.4# ceph versions
{
    "mon": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2
    },
    "mds": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2
    },
    "rgw": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 2
    },
    "overall": {
        "ceph version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)": 10
    }
}


Test Process:
1.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage)
$ oc get pods -n openshift-storage -o wide | grep csi-rbdplugin-provisioner
csi-rbdplugin-provisioner-8c87b76ff-8b25m                         5/5     Running                 0          7d6h   10.128.2.8     compute-0   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-wfkhk                         5/5     Running                 0          7d6h   10.131.0.15    compute-1   <none>           <none>

$ oc get pods -n openshift-storage -o wide | grep csi-cephfsplugin-provisioner
csi-cephfsplugin-provisioner-c748c89bf-cdgpp                      5/5     Running                 0          7d6h   10.131.0.16    compute-1   <none>           <none>
csi-cephfsplugin-provisioner-c748c89bf-hhk2j                      5/5     Running                 0          7d6h   10.128.2.12    compute-0   <none>           <none>


2.Shut down nodes [compute-0, compute-1]

3.Check Nodes status:
$ oc get nodes
NAME              STATUS     ROLES    AGE    VERSION
compute-0         NotReady   worker   7d6h   v1.18.3+2cf11e2
compute-1         NotReady   worker   7d6h   v1.18.3+2cf11e2
compute-2         Ready      worker   7d6h   v1.18.3+2cf11e2
control-plane-0   Ready      master   7d6h   v1.18.3+2cf11e2
control-plane-1   Ready      master   7d6h   v1.18.3+2cf11e2
control-plane-2   Ready      master   7d6h   v1.18.3+2cf11e2

4.Wait 10 min

5.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage)
$ oc get pods -n openshift-storage -o wide | grep provisioner
csi-cephfsplugin-provisioner-c748c89bf-cdgpp                      5/5     Terminating   0          7d6h    10.131.0.16    compute-1   <none>           <none>
csi-cephfsplugin-provisioner-c748c89bf-hhk2j                      5/5     Terminating   0          7d6h    10.128.2.12    compute-0   <none>           <none>
csi-cephfsplugin-provisioner-c748c89bf-qk6mp                      0/5     Pending       0          4m8s    <none>         <none>      <none>           <none>
csi-cephfsplugin-provisioner-c748c89bf-s54rl                      5/5     Running       0          4m28s   10.129.2.69    compute-2   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-475kk                         0/5     Pending       0          4m8s    <none>         <none>      <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-8b25m                         5/5     Terminating   0          7d6h    10.128.2.8     compute-0   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-q4mq7                         5/5     Running       0          4m28s   10.129.2.70    compute-2   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-wfkhk                         5/5     Terminating   0          7d6h    10.131.0.15    compute-1   <none>           <none>

6.Power UP Nodes [compute-0,compute-1]

7.Check Nodes status:
$ oc get nodes
NAME              STATUS   ROLES    AGE    VERSION
compute-0         Ready    worker   7d6h   v1.18.3+2cf11e2
compute-1         Ready    worker   7d6h   v1.18.3+2cf11e2
compute-2         Ready    worker   7d6h   v1.18.3+2cf11e2
control-plane-0   Ready    master   7d7h   v1.18.3+2cf11e2
control-plane-1   Ready    master   7d7h   v1.18.3+2cf11e2
control-plane-2   Ready    master   7d7h   v1.18.3+2cf11e2

8.Get 'csi-cephfsplugin-provisioner' and 'csi-rbdplugin-provisioner' pods (openshift-storage)
provisioner pods are not running on a same worker node

$ oc get pods -n openshift-storage -o wide | grep provisioner
csi-cephfsplugin-provisioner-c748c89bf-qk6mp                      5/5     Running      0          7m42s   10.131.0.7     compute-1   <none>           <none>
csi-cephfsplugin-provisioner-c748c89bf-s54rl                      5/5     Running      0          8m2s    10.129.2.69    compute-2   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-475kk                         5/5     Running      0          7m42s   10.128.2.4     compute-0   <none>           <none>
csi-rbdplugin-provisioner-8c87b76ff-q4mq7                         5/5     Running      0          8m2s    10.129.2.70    compute-2   <none>           <none>

Comment 11 errata-xmlrpc 2020-09-15 10:18:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.5.0 bug fix and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3754

Comment 12 Itzhak 2021-06-02 12:34:36 UTC

The automation test can be found here: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/z_cluster/nodes/test_check_pod_status_after_two_nodes_shutdown_recovery.py

Polarion link: https://polarion.engineering.redhat.com/polarion/#/project/OpenShiftContainerStorage/workitem?id=OCS-2315

Note You need to log in before you can comment on or make changes to this bug.