Bug 1916585 - OCS PodDisruptionBudget redesign for OSDs to allow multiple nodes to drain in the same failure domain
Summary: OCS PodDisruptionBudget redesign for OSDs to allow multiple nodes to drain in...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: rook
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: OCS 4.6.2
Assignee: Santosh Pillai
QA Contact: akarsha
URL:
Whiteboard:
Depends On: 1861104 1915851
Blocks: 1899743
TreeView+ depends on / blocked
 
Reported: 2021-01-15 07:24 UTC by Mudit Agarwal
Modified: 2021-06-01 08:44 UTC (History)
39 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously, the OSD design had one blocking PodDistributionBudget (PDB) for each OSD. This meant users could only drain one node at a time. OSDs have been redesigned to have one OSD PDB in the beginning which allows only one OSD to go down at a time. Once the OSD goes down, its failure domain is determined and blocking OSD PDBs are created for other failure domains. The originally created OSD is then deleted, and all of the OSDs can go down in the failure domain. With this new design, multiple nodes can be drained in the same failure domain.
Clone Of: 1915851
Environment:
Last Closed: 2021-02-01 13:18:34 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2021:0305 0 None None None 2021-02-01 13:18:49 UTC

Comment 9 akarsha 2021-01-29 14:32:37 UTC
Platforms: AWS and VMWARE dynamic

Performed following tests:

1. Simultaneous node drain on 2 worker nodes present on different zones (only 3 worker node)
	
	Observations:
	- Drain on first node succeeds which blocks the other node drain
	- On drained node, OSDs pods moves to pending state
	- PDB list is changed to rook-ceph-osd-zone-us-east-2x & rook-ceph-osd-zone-us-east-2x, instead of rook-ceph-osd
	- Once we uncordon drain node and PGs become active+clean, other drain node command (which was blocked lately) continues and succeeds

2. Simultaneous node drain on 2 worker nodes present on different zones (only 6 worker nodes) when there is one available node on each zone.

	Observations:
	- Drain on first node succeeds which blocks the other node drain in another zone 
	- OSDs pod hosted on drained node moves to another node available on same zone
	- PDB list is changed to rook-ceph-osd-zone-us-east-2x & rook-ceph-osd-zone-us-east-2x, instead of rook-ceph-osd
	- Once we uncordon drain node and pgs in Active+Clean state, other drain node command (which was blocked lately) continues and succeeds

3. Simultaneous node drain on 2 worker nodes present on the same zone (only 6 worker nodes), such that each node in the zone contains at least one OSDs.

	Observations:
	- Node drains on same zone succeeds without any issue
	- OSDs pods moves to pending state since there is no extra nodes available on same zone
	- PDB list is changed to rook-ceph-osd-zone-us-east-2x & rook-ceph-osd-zone-us-east-2x, instead of rook-ceph-osd
	- Blocks the node drain of another zones
	- Once we uncordon drain node and pgs in Active+Clean state, other drain node command (which was blocked lately) continues and succeeds
- Note: While performing this test, observed that 2 mons are running in a single zone and one of the zones had no mon running. We suspect that we are hitting "Bug 1861093 - MON gets rescheduled in wrong failure domain after node failure test".  This BZ is made as dup of 1788492 which is ON_QA in 4.7.0. Please note that the scenarios mentioned in these BZs and what we tried(simultaneous node drains on the nodes in the same AZ) are different though. @Santosh could you please check these BZs and confirm about the mon behaviour?

Apart from the above observation, the OSD PodDisruptionBudget and drain worked as expected.

Versions

OCP: 4.7.0-fc.4
OCS: ocs-operator.v4.6.2-233.ci
ceph versions
{
    "mon": {
        "ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)": 3
    },
    "mgr": {
        "ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)": 6
    },
    "mds": {
        "ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.8-115.el8cp (183dfafff0de1f79fccc983d82e733fedc0e988b) nautilus (stable)": 11
    }
}

Logs: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-1916585/must-gather/

Comment 10 Harish NV Rao 2021-01-29 14:41:12 UTC
Correction: In test 3 of comment 9, please replace "Bug 1861093 - MON gets rescheduled in wrong failure domain after node failure test" with "Bug 1783204 - In AWS 3 AZ setup, multiple Mon pods running in same zone when doing node failure".

Sorry for the inconvenience caused.

Comment 11 Travis Nielsen 2021-01-29 17:04:10 UTC
To summarize... 
1. the PDBs worked correctly for OSDs
2. two mons were observed to run in the same zone, which is due to https://bugzilla.redhat.com/show_bug.cgi?id=1788492 (fixed in 4.7)

The PDB redesign would only affect the OSDs, so the mon issue in 2 would be independent of that change. That issue is quite serious so we should consider backporting it for 4.6.3.

Thanks for the verification!

Comment 12 akarsha 2021-01-29 17:11:29 UTC
Thanks Travis, Based on above comment11 and our test results moving the bz to verified state

Comment 16 errata-xmlrpc 2021-02-01 13:18:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenShift Container Storage 4.6.2 container bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0305


Note You need to log in before you can comment on or make changes to this bug.