Bug 1915851 - OCS PodDisruptionBudget redesign for OSDs to allow multiple nodes to drain in the same failure domain
Summary: OCS PodDisruptionBudget redesign for OSDs to allow multiple nodes to drain in...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: OCS 4.7.0
Assignee: Santosh Pillai
QA Contact: Shrivaibavi Raghaventhiran
URL:
Whiteboard:
Depends On: 1861104 1924682
Blocks: 1899743 1916585
TreeView+ depends on / blocked
 
Reported: 2021-01-13 14:44 UTC by Santosh Pillai
Modified: 2021-06-01 08:48 UTC (History)
32 users (show)

Fixed In Version: 4.7.0-185.ci
Doc Type: No Doc Update
Doc Text:
Clone Of: 1861104
: 1916585 (view as bug list)
Environment:
Last Closed: 2021-05-19 09:18:01 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2041 0 None None None 2021-05-19 09:18:40 UTC

Comment 2 Santosh Pillai 2021-01-13 14:49:25 UTC
Some notes on testing:

1. OCS/OCP upgrades should work correctly. 
2. User should be able to drain multiple nodes in the same failure domains. Should test with different failure domains like zones, racks etc. 
3. Important: Test with load
4. The node-drain-canary pods should be removed after upgrade 
5. The old PodDisruptionBudgets for OSDs (where we had one PDB for each OSD) should be removed after upgrade.

Comment 4 Harish NV Rao 2021-01-27 06:45:54 UTC
@santosh, can you please update "Fixed In Version:" for this BZ?

Comment 9 Shrivaibavi Raghaventhiran 2021-04-10 15:43:30 UTC
Followed same procedure as comment #2

Tested Environment:
--------------------
AWS-IPI 3M and 6W
With load

Test Steps:
------------

1. Upgraded OCP version from 4.6.23 to 4.7.6

2. Upgraded OCS version from ocs-operator.v4.6.0-195.ci to ocs-operator.v4.7.0-344.ci
$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES                     PHASE
ocs-operator.v4.7.0-344.ci   OpenShift Container Storage   4.7.0-344.ci   ocs-operator.v4.6.0-195.ci   Succeeded

3. Drained multiple nodes from different zones (topology.kubernetes.io/zone=us-east-2b) and (topology.kubernetes.io/zone=us-east-2a)
, Mons and OSDs started running on other nodes of respective zones

Initial mons and osds before drains
------------------------------------
rook-ceph-mon-b-8545666cd9-b2kbf                                  2/2     Running   0          6h39m   10.129.3.59    ip-10-0-221-215.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-589dd4b76f-s5ns6                                  2/2     Running   0          4h33m   10.131.0.90    ip-10-0-156-145.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-e-84dc9bd6f7-b8hmz                                  2/2     Running   0          4h33m   10.128.2.109   ip-10-0-182-218.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-0-776d4d8487-xlpwj                                  2/2     Running   0          4h33m   10.131.0.89    ip-10-0-156-145.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-1-64786f9cc-kjfxl                                   2/2     Running   0          5h6m    10.128.2.108   ip-10-0-182-218.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-2-5b8cb477bf-pjzbq                                  2/2     Running   0          6h39m   10.129.3.58    ip-10-0-221-215.us-east-2.compute.internal   <none>           <none>

Mons and osds after drains
--------------------------
rook-ceph-mon-b-8545666cd9-b2kbf                                  2/2     Running   0          6h48m   10.129.3.59    ip-10-0-221-215.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-589dd4b76f-c7b2t                                  2/2     Running   0          6m49s   10.128.4.23    ip-10-0-130-102.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-e-84dc9bd6f7-99h2l                                  2/2     Running   0          7m17s   10.130.2.31    ip-10-0-180-234.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-0-776d4d8487-j8pfx                                  2/2     Running   0          88s     10.128.4.25    ip-10-0-130-102.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-1-64786f9cc-t8js7                                   2/2     Running   0          7m17s   10.130.2.30    ip-10-0-180-234.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-2-5b8cb477bf-pjzbq                                  2/2     Running   0          6h48m   10.129.3.58    ip-10-0-221-215.us-east-2.compute.internal   <none>           <none>

Nodes:
------

$ oc get nodes --show-labels | grep ocs
ip-10-0-130-102.us-east-2.compute.internal   Ready                      worker   68m   v1.20.0+bafe72f   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-130-102,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-156-145.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   32h   v1.20.0+bafe72f   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-156-145,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a
ip-10-0-180-234.us-east-2.compute.internal   Ready                      worker   68m   v1.20.0+bafe72f   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-180-234,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b
ip-10-0-182-218.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   32h   v1.20.0+bafe72f   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-182-218,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b
ip-10-0-210-213.us-east-2.compute.internal   Ready                      worker   68m   v1.20.0+bafe72f   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-210-213,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c
ip-10-0-221-215.us-east-2.compute.internal   Ready                      worker   32h   v1.20.0+bafe72f   beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-221-215,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c

4. drain canary pods got removed post upgrade

5. Old PDB design got removed

Before upgrade:
$ oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     61m
rook-ceph-mon-pdb                                 2               N/A               1                     61m
rook-ceph-osd-0                                   N/A             0                 0                     58m
rook-ceph-osd-1                                   N/A             0                 0                     58m
rook-ceph-osd-2                                   N/A             0                 0                     58m

After upgrade:
$ oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     23h
rook-ceph-mon-pdb                                 N/A             1                 1                     23h
rook-ceph-osd                                     N/A             1                 1                     37m

6. Drained nodes from same zone topology.kubernetes.io/zone=us-east-2a, Drain got completed and left mon and osd in pending state as expected (ip-10-0-156-145.us-east-2.compute.internal and ip-10-0-130-102.us-east-2.compute.internal) 

rook-ceph-mon-b-8545666cd9-b2kbf                                  2/2     Running   0          7h29m   10.129.3.59    ip-10-0-221-215.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-c-589dd4b76f-6mcnn                                  0/2     Pending   0          108s    <none>         <none>                                       <none>           <none>
rook-ceph-mon-e-84dc9bd6f7-99h2l                                  2/2     Running   0          48m     10.130.2.31    ip-10-0-180-234.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-0-776d4d8487-s5lxz                                  0/2     Pending   0          108s    <none>         <none>                                       <none>           <none>
rook-ceph-osd-1-64786f9cc-t8js7                                   2/2     Running   0          48m     10.130.2.30    ip-10-0-180-234.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-2-5b8cb477bf-pjzbq                                  2/2     Running   0          7h29m   10.129.3.58    ip-10-0-221-215.us-east-2.compute.internal   <none>

7. Recovered the cluster, All pods were running fine

Based on the above observations moving the bug to verified state

Comment 11 errata-xmlrpc 2021-05-19 09:18:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041


Note You need to log in before you can comment on or make changes to this bug.