Some notes on testing: 1. OCS/OCP upgrades should work correctly. 2. User should be able to drain multiple nodes in the same failure domains. Should test with different failure domains like zones, racks etc. 3. Important: Test with load 4. The node-drain-canary pods should be removed after upgrade 5. The old PodDisruptionBudgets for OSDs (where we had one PDB for each OSD) should be removed after upgrade.
@santosh, can you please update "Fixed In Version:" for this BZ?
Followed same procedure as comment #2 Tested Environment: -------------------- AWS-IPI 3M and 6W With load Test Steps: ------------ 1. Upgraded OCP version from 4.6.23 to 4.7.6 2. Upgraded OCS version from ocs-operator.v4.6.0-195.ci to ocs-operator.v4.7.0-344.ci $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0-344.ci OpenShift Container Storage 4.7.0-344.ci ocs-operator.v4.6.0-195.ci Succeeded 3. Drained multiple nodes from different zones (topology.kubernetes.io/zone=us-east-2b) and (topology.kubernetes.io/zone=us-east-2a) , Mons and OSDs started running on other nodes of respective zones Initial mons and osds before drains ------------------------------------ rook-ceph-mon-b-8545666cd9-b2kbf 2/2 Running 0 6h39m 10.129.3.59 ip-10-0-221-215.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-589dd4b76f-s5ns6 2/2 Running 0 4h33m 10.131.0.90 ip-10-0-156-145.us-east-2.compute.internal <none> <none> rook-ceph-mon-e-84dc9bd6f7-b8hmz 2/2 Running 0 4h33m 10.128.2.109 ip-10-0-182-218.us-east-2.compute.internal <none> <none> rook-ceph-osd-0-776d4d8487-xlpwj 2/2 Running 0 4h33m 10.131.0.89 ip-10-0-156-145.us-east-2.compute.internal <none> <none> rook-ceph-osd-1-64786f9cc-kjfxl 2/2 Running 0 5h6m 10.128.2.108 ip-10-0-182-218.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-5b8cb477bf-pjzbq 2/2 Running 0 6h39m 10.129.3.58 ip-10-0-221-215.us-east-2.compute.internal <none> <none> Mons and osds after drains -------------------------- rook-ceph-mon-b-8545666cd9-b2kbf 2/2 Running 0 6h48m 10.129.3.59 ip-10-0-221-215.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-589dd4b76f-c7b2t 2/2 Running 0 6m49s 10.128.4.23 ip-10-0-130-102.us-east-2.compute.internal <none> <none> rook-ceph-mon-e-84dc9bd6f7-99h2l 2/2 Running 0 7m17s 10.130.2.31 ip-10-0-180-234.us-east-2.compute.internal <none> <none> rook-ceph-osd-0-776d4d8487-j8pfx 2/2 Running 0 88s 10.128.4.25 ip-10-0-130-102.us-east-2.compute.internal <none> <none> rook-ceph-osd-1-64786f9cc-t8js7 2/2 Running 0 7m17s 10.130.2.30 ip-10-0-180-234.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-5b8cb477bf-pjzbq 2/2 Running 0 6h48m 10.129.3.58 ip-10-0-221-215.us-east-2.compute.internal <none> <none> Nodes: ------ $ oc get nodes --show-labels | grep ocs ip-10-0-130-102.us-east-2.compute.internal Ready worker 68m v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-130-102,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-156-145.us-east-2.compute.internal Ready,SchedulingDisabled worker 32h v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2a,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-156-145,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2a,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2a ip-10-0-180-234.us-east-2.compute.internal Ready worker 68m v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-180-234,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b ip-10-0-182-218.us-east-2.compute.internal Ready,SchedulingDisabled worker 32h v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2b,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-182-218,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2b,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2b ip-10-0-210-213.us-east-2.compute.internal Ready worker 68m v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-210-213,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c ip-10-0-221-215.us-east-2.compute.internal Ready worker 32h v1.20.0+bafe72f beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=m5.4xlarge,beta.kubernetes.io/os=linux,cluster.ocs.openshift.io/openshift-storage=,failure-domain.beta.kubernetes.io/region=us-east-2,failure-domain.beta.kubernetes.io/zone=us-east-2c,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-10-0-221-215,kubernetes.io/os=linux,node-role.kubernetes.io/worker=,node.kubernetes.io/instance-type=m5.4xlarge,node.openshift.io/os_id=rhcos,topology.ebs.csi.aws.com/zone=us-east-2c,topology.kubernetes.io/region=us-east-2,topology.kubernetes.io/zone=us-east-2c 4. drain canary pods got removed post upgrade 5. Old PDB design got removed Before upgrade: $ oc get pdb -n openshift-storage NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 61m rook-ceph-mon-pdb 2 N/A 1 61m rook-ceph-osd-0 N/A 0 0 58m rook-ceph-osd-1 N/A 0 0 58m rook-ceph-osd-2 N/A 0 0 58m After upgrade: $ oc get pdb -n openshift-storage NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 23h rook-ceph-mon-pdb N/A 1 1 23h rook-ceph-osd N/A 1 1 37m 6. Drained nodes from same zone topology.kubernetes.io/zone=us-east-2a, Drain got completed and left mon and osd in pending state as expected (ip-10-0-156-145.us-east-2.compute.internal and ip-10-0-130-102.us-east-2.compute.internal) rook-ceph-mon-b-8545666cd9-b2kbf 2/2 Running 0 7h29m 10.129.3.59 ip-10-0-221-215.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-589dd4b76f-6mcnn 0/2 Pending 0 108s <none> <none> <none> <none> rook-ceph-mon-e-84dc9bd6f7-99h2l 2/2 Running 0 48m 10.130.2.31 ip-10-0-180-234.us-east-2.compute.internal <none> <none> rook-ceph-osd-0-776d4d8487-s5lxz 0/2 Pending 0 108s <none> <none> <none> <none> rook-ceph-osd-1-64786f9cc-t8js7 2/2 Running 0 48m 10.130.2.30 ip-10-0-180-234.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-5b8cb477bf-pjzbq 2/2 Running 0 7h29m 10.129.3.58 ip-10-0-221-215.us-east-2.compute.internal <none> 7. Recovered the cluster, All pods were running fine Based on the above observations moving the bug to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041