Bug 1915851
| Summary: | OCS PodDisruptionBudget redesign for OSDs to allow multiple nodes to drain in the same failure domain | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Santosh Pillai <sapillai> | |
| Component: | rook | Assignee: | Santosh Pillai <sapillai> | |
| Status: | CLOSED ERRATA | QA Contact: | Shrivaibavi Raghaventhiran <sraghave> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.6 | CC: | aclewett, alchan, apolak, assingh, bengland, bkunal, cblum, dahernan, dkochuka, dmoessne, dyocum, etamir, hnallurv, jelopez, jhopper, lars, madam, mgugino, muagarwa, mwasher, nberry, nelluri, ocs-bugs, owasserm, ratamir, r.martinez, rojoseph, rperiyas, shan, tnielsen, wking, ykaul | |
| Target Milestone: | --- | Keywords: | AutomationBackLog, Upgrades | |
| Target Release: | OCS 4.7.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | 4.7.0-185.ci | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 1861104 | |||
| : | 1916585 (view as bug list) | Environment: | ||
| Last Closed: | 2021-05-19 09:18:01 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1861104, 1924682 | |||
| Bug Blocks: | 1899743, 1916585 | |||
|
Comment 2
Santosh Pillai
2021-01-13 14:49:25 UTC
@santosh, can you please update "Fixed In Version:" for this BZ? Followed same procedure as comment #2 Tested Environment: -------------------- AWS-IPI 3M and 6W With load Test Steps: ------------ 1. Upgraded OCP version from 4.6.23 to 4.7.6 2. Upgraded OCS version from ocs-operator.v4.6.0-195.ci to ocs-operator.v4.7.0-344.ci $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0-344.ci OpenShift Container Storage 4.7.0-344.ci ocs-operator.v4.6.0-195.ci Succeeded 3. Drained multiple nodes from different zones (topology.kubernetes.io/zone=us-east-2b) and (topology.kubernetes.io/zone=us-east-2a) , Mons and OSDs started running on other nodes of respective zones Initial mons and osds before drains ------------------------------------ rook-ceph-mon-b-8545666cd9-b2kbf 2/2 Running 0 6h39m 10.129.3.59 ip-10-0-221-215.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-589dd4b76f-s5ns6 2/2 Running 0 4h33m 10.131.0.90 ip-10-0-156-145.us-east-2.compute.internal <none> <none> rook-ceph-mon-e-84dc9bd6f7-b8hmz 2/2 Running 0 4h33m 10.128.2.109 ip-10-0-182-218.us-east-2.compute.internal <none> <none> rook-ceph-osd-0-776d4d8487-xlpwj 2/2 Running 0 4h33m 10.131.0.89 ip-10-0-156-145.us-east-2.compute.internal <none> <none> rook-ceph-osd-1-64786f9cc-kjfxl 2/2 Running 0 5h6m 10.128.2.108 ip-10-0-182-218.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-5b8cb477bf-pjzbq 2/2 Running 0 6h39m 10.129.3.58 ip-10-0-221-215.us-east-2.compute.internal <none> <none> Mons and osds after drains -------------------------- rook-ceph-mon-b-8545666cd9-b2kbf 2/2 Running 0 6h48m 10.129.3.59 ip-10-0-221-215.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-589dd4b76f-c7b2t 2/2 Running 0 6m49s 10.128.4.23 ip-10-0-130-102.us-east-2.compute.internal <none> <none> rook-ceph-mon-e-84dc9bd6f7-99h2l 2/2 Running 0 7m17s 10.130.2.31 ip-10-0-180-234.us-east-2.compute.internal <none> <none> rook-ceph-osd-0-776d4d8487-j8pfx 2/2 Running 0 88s 10.128.4.25 ip-10-0-130-102.us-east-2.compute.internal <none> <none> rook-ceph-osd-1-64786f9cc-t8js7 2/2 Running 0 7m17s 10.130.2.30 ip-10-0-180-234.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-5b8cb477bf-pjzbq 2/2 Running 0 6h48m 10.129.3.58 ip-10-0-221-215.us-east-2.compute.internal <none> <none> Nodes: ------ $ oc get nodes --show-labels | grep ocs ip-10-0-130-102.us-east-2.compute.internal Ready worker 68m v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-130-102,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-156-145.us-east-2.compute.internal Ready,SchedulingDisabled worker 32h v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-156-145,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-180-234.us-east-2.compute.internal Ready worker 68m v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-180-234,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b ip-10-0-182-218.us-east-2.compute.internal Ready,SchedulingDisabled worker 32h v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-182-218,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b ip-10-0-210-213.us-east-2.compute.internal Ready worker 68m v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-210-213,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c ip-10-0-221-215.us-east-2.compute.internal Ready worker 32h v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-221-215,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c 4. drain canary pods got removed post upgrade 5. Old PDB design got removed Before upgrade: $ oc get pdb -n openshift-storage NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 61m rook-ceph-mon-pdb 2 N/A 1 61m rook-ceph-osd-0 N/A 0 0 58m rook-ceph-osd-1 N/A 0 0 58m rook-ceph-osd-2 N/A 0 0 58m After upgrade: $ oc get pdb -n openshift-storage NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 23h rook-ceph-mon-pdb N/A 1 1 23h rook-ceph-osd N/A 1 1 37m 6. Drained nodes from same zone topology.kubernetes.io/zone=us-east-2a, Drain got completed and left mon and osd in pending state as expected (ip-10-0-156-145.us-east-2.compute.internal and ip-10-0-130-102.us-east-2.compute.internal) rook-ceph-mon-b-8545666cd9-b2kbf 2/2 Running 0 7h29m 10.129.3.59 ip-10-0-221-215.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-589dd4b76f-6mcnn 0/2 Pending 0 108s <none> <none> <none> <none> rook-ceph-mon-e-84dc9bd6f7-99h2l 2/2 Running 0 48m 10.130.2.31 ip-10-0-180-234.us-east-2.compute.internal <none> <none> rook-ceph-osd-0-776d4d8487-s5lxz 0/2 Pending 0 108s <none> <none> <none> <none> rook-ceph-osd-1-64786f9cc-t8js7 2/2 Running 0 48m 10.130.2.30 ip-10-0-180-234.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-5b8cb477bf-pjzbq 2/2 Running 0 7h29m 10.129.3.58 ip-10-0-221-215.us-east-2.compute.internal <none> 7. Recovered the cluster, All pods were running fine Based on the above observations moving the bug to verified state Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041 |