This is a fork off of the research in https://bugzilla.redhat.com/show_bug.cgi?id=1788126. It was suggested that research for fixing the PDB be done in an RFE; this is it.
quoting https://bugzilla.redhat.com/show_bug.cgi?id=1788126#c37: > My suggestion is that we try to escalate the alert inhibition with the OCP architect team as they did ACK on this exception to how PDBs are used. > > In the meantime, since this seems to be a major source of confusion, we should explore the possibility of getting the user to install the inhibition at install-time. > Or at the very least somewhere in the documentation that reassures users who know how PDBs are normally supposed to be used that we're not completely crazy. So the deiscussion is still continuing in that other BZ.
https://bugzilla.redhat.com/show_bug.cgi?id=1788126 is targeted for 4.7
The design discussion is happening in https://bugzilla.redhat.com/show_bug.cgi?id=1861104
Assigning to Santosh based on his recent work on PDB in Rook. Santosh, is this doable for 4.7?
This has been merged downstream to release-4.7. The BZ for backporting to 4.6.z is here: https://bugzilla.redhat.com/show_bug.cgi?id=1899743
If this merged downstream, we should move this to MODIFIED, right?
(In reply to leseb from comment #10) > If this merged downstream, we should move this to MODIFIED, right? Yes, I missed moving it to modified. Thanks. (removing need info)
- This BZ is based on https://bugzilla.redhat.com/show_bug.cgi?id=1788126 which is about continuous OSD PDB alerts when upgrading. - New OSD PDB design was merged in 4.7. - The new design includes changes on how the blocking and non-blocking PDBs are created and cleaned up. It does not directly affect the `alert inhibition` in anyway. - The new design can be tested to see its affect on the alerts. (Although, if a node drain is blocked, due to OSDs from previous drain are not up or pgs are still re balancing , we can still see the alerts).
changing the BZ title to `use appropriate PDB values for OSD` because the original issue was only because of the OSD pdbs.
Test Environment: ------------------- AWS-IPI 3W 3M Test Steps: ----------- 1. Verify old PDB design on OCP 4.6.23 and OCS ocs-operator.v4.6.0-195.ci $ oc get pdb -n openshift-storage NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 61m rook-ceph-mon-pdb 2 N/A 1 61m rook-ceph-osd-0 N/A 0 0 58m rook-ceph-osd-1 N/A 0 0 58m rook-ceph-osd-2 N/A 0 0 58m 2. Upgrade OCP 4.6.23 to 4.7.6 3. Upgrade OCS 4.6.0-195.ci to ocs-operator.v4.7.0-344.ci $ oc get csv -n openshift-storage NAME DISPLAY VERSION REPLACES PHASE ocs-operator.v4.7.0-344.ci OpenShift Container Storage 4.7.0-344.ci ocs-operator.v4.6.0-195.ci Succeeded 4. Verify new PDB design $ oc get pdb -n openshift-storage NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 15h rook-ceph-mon-pdb N/A 1 1 15h rook-ceph-osd N/A 1 1 80s 5. Performed Node drains (Single, multiple, simultaneous Node drains) blocking PDBs got created and no issues found. $ oc get pdb -n openshift-storage NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mds-ocs-storagecluster-cephfilesystem 1 N/A 1 16h rook-ceph-mon-pdb N/A 1 0 16h rook-ceph-osd-zone-us-east-2b N/A 0 0 63s rook-ceph-osd-zone-us-east-2c N/A 0 0 63s With all the observation, Moving the bug to verified state
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041