Bug 1861878

Summary: [RFE] use appropriate PDB values for OSD
Product: [Red Hat Storage] Red Hat OpenShift Container Storage Reporter: Robert Bost <rbost>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: CLOSED ERRATA QA Contact: Shrivaibavi Raghaventhiran <sraghave>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.6CC: aeyal, assingh, bkunal, bniver, cblum, etamir, gmeno, madam, muagarwa, nberry, ocs-bugs, sapillai, shan, sostapov, tnielsen, uchapaga
Target Milestone: ---Keywords: FutureFeature
Target Release: OCS 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-19 09:14:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1924682    
Bug Blocks:    

Description Robert Bost 2020-07-29 19:20:48 UTC
This is a fork off of the research in https://bugzilla.redhat.com/show_bug.cgi?id=1788126. It was suggested that research for fixing the PDB be done in an RFE; this is it.

Comment 5 Michael Adam 2020-09-08 14:44:40 UTC
quoting https://bugzilla.redhat.com/show_bug.cgi?id=1788126#c37:

> My suggestion is that we try to escalate the alert inhibition with the OCP architect team as they did ACK on this exception to how PDBs are used.
> 
> In the meantime, since this seems to be a major source of confusion, we should explore the possibility of getting the user to install the inhibition at install-time.
> Or at the very least somewhere in the documentation that reassures users who know how PDBs are normally supposed to be used that we're not completely crazy.

So the deiscussion is still continuing in that other BZ.

Comment 6 Mudit Agarwal 2020-09-28 04:04:12 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1788126 is targeted for 4.7

Comment 7 Travis Nielsen 2020-10-15 18:37:30 UTC
The design discussion is happening in https://bugzilla.redhat.com/show_bug.cgi?id=1861104

Comment 8 Sébastien Han 2020-11-26 14:34:59 UTC
Assigning to Santosh based on his recent work on PDB in Rook.
Santosh, is this doable for 4.7?

Comment 9 Santosh Pillai 2020-11-26 14:55:26 UTC
This has been merged downstream to release-4.7.

The BZ for backporting to 4.6.z is here: https://bugzilla.redhat.com/show_bug.cgi?id=1899743

Comment 10 Sébastien Han 2020-11-26 14:58:42 UTC
If this merged downstream, we should move this to MODIFIED, right?

Comment 11 Santosh Pillai 2020-12-01 05:47:11 UTC
(In reply to leseb from comment #10)
> If this merged downstream, we should move this to MODIFIED, right?

Yes, I missed moving it to modified. Thanks.  (removing need info)

Comment 12 Santosh Pillai 2021-02-01 06:57:49 UTC
 
- This BZ is based on https://bugzilla.redhat.com/show_bug.cgi?id=1788126 which is about continuous OSD PDB alerts when upgrading. 
- New OSD PDB design was merged in 4.7. 
- The new design includes changes on how the blocking and non-blocking PDBs are created and cleaned up. It does not directly affect the `alert inhibition` in anyway. 
- The new design can be tested to see its affect on the alerts. 
  (Although, if a node drain is blocked, due to OSDs from previous drain are not up or pgs are still re balancing , we can still see the alerts).

Comment 17 Santosh Pillai 2021-03-31 10:08:46 UTC
changing the BZ title to `use appropriate PDB values for OSD` because the original issue was only because of the OSD pdbs.

Comment 18 Shrivaibavi Raghaventhiran 2021-04-10 08:50:32 UTC
Test Environment:
-------------------
AWS-IPI 3W 3M

Test Steps:
-----------
1. Verify old PDB design on OCP 4.6.23 and OCS ocs-operator.v4.6.0-195.ci

$ oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     61m
rook-ceph-mon-pdb                                 2               N/A               1                     61m
rook-ceph-osd-0                                   N/A             0                 0                     58m
rook-ceph-osd-1                                   N/A             0                 0                     58m
rook-ceph-osd-2                                   N/A             0                 0                     58m

2. Upgrade OCP 4.6.23 to 4.7.6

3. Upgrade OCS 4.6.0-195.ci to ocs-operator.v4.7.0-344.ci
$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES                     PHASE
ocs-operator.v4.7.0-344.ci   OpenShift Container Storage   4.7.0-344.ci   ocs-operator.v4.6.0-195.ci   Succeeded

4. Verify new PDB design
$ oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     15h
rook-ceph-mon-pdb                                 N/A             1                 1                     15h
rook-ceph-osd                                     N/A             1                 1                     80s

5. Performed Node drains (Single, multiple, simultaneous Node drains) blocking PDBs got created and no issues found.
$ oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     16h
rook-ceph-mon-pdb                                 N/A             1                 0                     16h
rook-ceph-osd-zone-us-east-2b                     N/A             0                 0                     63s
rook-ceph-osd-zone-us-east-2c                     N/A             0                 0                     63s



With all the observation, Moving the bug to verified state

Comment 20 errata-xmlrpc 2021-05-19 09:14:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041