1861878 – [RFE] use appropriate PDB values for OSD

Bug 1861878 - [RFE] use appropriate PDB values for OSD

Summary: [RFE] use appropriate PDB values for OSD

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Santosh Pillai
QA Contact:	Shrivaibavi Raghaventhiran
Docs Contact:
URL:
Whiteboard:
Depends On:	1924682
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-29 19:20 UTC by Robert Bost
Modified:	2021-05-19 09:16 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-19 09:14:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2041	0	None	None	None	2021-05-19 09:16:07 UTC

Description Robert Bost 2020-07-29 19:20:48 UTC

This is a fork off of the research in https://bugzilla.redhat.com/show_bug.cgi?id=1788126. It was suggested that research for fixing the PDB be done in an RFE; this is it.

Comment 5 Michael Adam 2020-09-08 14:44:40 UTC

quoting https://bugzilla.redhat.com/show_bug.cgi?id=1788126#c37:

> My suggestion is that we try to escalate the alert inhibition with the OCP architect team as they did ACK on this exception to how PDBs are used.
> 
> In the meantime, since this seems to be a major source of confusion, we should explore the possibility of getting the user to install the inhibition at install-time.
> Or at the very least somewhere in the documentation that reassures users who know how PDBs are normally supposed to be used that we're not completely crazy.

So the deiscussion is still continuing in that other BZ.

Comment 6 Mudit Agarwal 2020-09-28 04:04:12 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1788126 is targeted for 4.7

Comment 7 Travis Nielsen 2020-10-15 18:37:30 UTC

The design discussion is happening in https://bugzilla.redhat.com/show_bug.cgi?id=1861104

Comment 8 Sébastien Han 2020-11-26 14:34:59 UTC

Assigning to Santosh based on his recent work on PDB in Rook.
Santosh, is this doable for 4.7?

Comment 9 Santosh Pillai 2020-11-26 14:55:26 UTC

This has been merged downstream to release-4.7.

The BZ for backporting to 4.6.z is here: https://bugzilla.redhat.com/show_bug.cgi?id=1899743

Comment 10 Sébastien Han 2020-11-26 14:58:42 UTC

If this merged downstream, we should move this to MODIFIED, right?

Comment 11 Santosh Pillai 2020-12-01 05:47:11 UTC

(In reply to leseb from comment #10)
> If this merged downstream, we should move this to MODIFIED, right?

Yes, I missed moving it to modified. Thanks.  (removing need info)

Comment 12 Santosh Pillai 2021-02-01 06:57:49 UTC

 
- This BZ is based on https://bugzilla.redhat.com/show_bug.cgi?id=1788126 which is about continuous OSD PDB alerts when upgrading. 
- New OSD PDB design was merged in 4.7. 
- The new design includes changes on how the blocking and non-blocking PDBs are created and cleaned up. It does not directly affect the `alert inhibition` in anyway. 
- The new design can be tested to see its affect on the alerts. 
  (Although, if a node drain is blocked, due to OSDs from previous drain are not up or pgs are still re balancing , we can still see the alerts).

Comment 17 Santosh Pillai 2021-03-31 10:08:46 UTC

changing the BZ title to `use appropriate PDB values for OSD` because the original issue was only because of the OSD pdbs.

Comment 18 Shrivaibavi Raghaventhiran 2021-04-10 08:50:32 UTC

Test Environment:
-------------------
AWS-IPI 3W 3M

Test Steps:
-----------
1. Verify old PDB design on OCP 4.6.23 and OCS ocs-operator.v4.6.0-195.ci

$ oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     61m
rook-ceph-mon-pdb                                 2               N/A               1                     61m
rook-ceph-osd-0                                   N/A             0                 0                     58m
rook-ceph-osd-1                                   N/A             0                 0                     58m
rook-ceph-osd-2                                   N/A             0                 0                     58m

2. Upgrade OCP 4.6.23 to 4.7.6

3. Upgrade OCS 4.6.0-195.ci to ocs-operator.v4.7.0-344.ci
$ oc get csv -n openshift-storage
NAME                         DISPLAY                       VERSION        REPLACES                     PHASE
ocs-operator.v4.7.0-344.ci   OpenShift Container Storage   4.7.0-344.ci   ocs-operator.v4.6.0-195.ci   Succeeded

4. Verify new PDB design
$ oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     15h
rook-ceph-mon-pdb                                 N/A             1                 1                     15h
rook-ceph-osd                                     N/A             1                 1                     80s

5. Performed Node drains (Single, multiple, simultaneous Node drains) blocking PDBs got created and no issues found.
$ oc get pdb -n openshift-storage
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mds-ocs-storagecluster-cephfilesystem   1               N/A               1                     16h
rook-ceph-mon-pdb                                 N/A             1                 0                     16h
rook-ceph-osd-zone-us-east-2b                     N/A             0                 0                     63s
rook-ceph-osd-zone-us-east-2c                     N/A             0                 0                     63s



With all the observation, Moving the bug to verified state

Comment 20 errata-xmlrpc 2021-05-19 09:14:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.