Platforms: AWS and VMWARE dynamic Performed following tests: 1. Simultaneous node drain on 2 worker nodes present on different zones (only 3 worker node) Observations: - Drain on first node succeeds which blocks the other node drain - On drained node, OSDs pods moves to pending state - PDB list is changed to rook-ceph-osd-zone-us-east-2x & rook-ceph-osd-zone-us-east-2x, instead of rook-ceph-osd - Once we uncordon drain node and PGs become active+clean, other drain node command (which was blocked lately) continues and succeeds 2. Simultaneous node drain on 2 worker nodes present on different zones (only 6 worker nodes) when there is one available node on each zone. Observations: - Drain on first node succeeds which blocks the other node drain in another zone - OSDs pod hosted on drained node moves to another node available on same zone - PDB list is changed to rook-ceph-osd-zone-us-east-2x & rook-ceph-osd-zone-us-east-2x, instead of rook-ceph-osd - Once we uncordon drain node and pgs in Active+Clean state, other drain node command (which was blocked lately) continues and succeeds 3. Simultaneous node drain on 2 worker nodes present on the same zone (only 6 worker nodes), such that each node in the zone contains at least one OSDs. Observations: - Node drains on same zone succeeds without any issue - OSDs pods moves to pending state since there is no extra nodes available on same zone - PDB list is changed to rook-ceph-osd-zone-us-east-2x & rook-ceph-osd-zone-us-east-2x, instead of rook-ceph-osd - Blocks the node drain of another zones - Once we uncordon drain node and pgs in Active+Clean state, other drain node command (which was blocked lately) continues and succeeds - Note: While performing this test, observed that 2 mons are running in a single zone and one of the zones had no mon running. We suspect that we are hitting "Bug 1861093 - MON gets rescheduled in wrong failure domain after node failure test". This BZ is made as dup of 1788492 which is ON_QA in 4.7.0. Please note that the scenarios mentioned in these BZs and what we tried(simultaneous node drains on the nodes in the same AZ) are different though. @Santosh could you please check these BZs and confirm about the mon behaviour? Apart from the above observation, the OSD PodDisruptionBudget and drain worked as expected. Versions OCP: 4.7.0-fc.4 OCS: ocs-operator.v4.6.2-233.ci ceph versions { "mon": { "ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)": 1 }, "osd": { "ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)": 6 }, "mds": { "ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)": 11 } } Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1916585/must-gather/
Correction: In test 3 of comment 9, please replace "Bug 1861093 - MON gets rescheduled in wrong failure domain after node failure test" with "Bug 1783204 - In AWS 3 AZ setup, multiple Mon pods running in same zone when doing node failure". Sorry for the inconvenience caused.
To summarize... 1. the PDBs worked correctly for OSDs 2. two mons were observed to run in the same zone, which is due to https://bugzilla.redhat.com/show_bug.cgi?id=1788492 (fixed in 4.7) The PDB redesign would only affect the OSDs, so the mon issue in 2 would be independent of that change. That issue is quite serious so we should consider backporting it for 4.6.3. Thanks for the verification!
Thanks Travis, Based on above comment11 and our test results moving the bz to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.6.2 container bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0305