1889518 – Unable to update cluster running OCS due to PodDisruptionBudget on ceph-rook-osd pods

Bug 1889518 - Unable to update cluster running OCS due to PodDisruptionBudget on ceph-rook-osd pods

Summary: Unable to update cluster running OCS due to PodDisruptionBudget on ceph-rook-...

Keywords:
Status:	CLOSED DUPLICATE of bug 1861104
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	unclassified
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Michael Adam
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-19 20:14 UTC by Lars Kellogg-Stedman
Modified:	2020-10-21 12:18 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-21 12:18:54 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Lars Kellogg-Stedman 2020-10-19 20:14:01 UTC

Description of problem (please be detailed as possible and provide log
snippests):

I'm running OCS 4.5 on an OCP 4.5.13 baremetal cluster. I tried upgrading to 4.5.14, but the ugprade got stuck because of a failure to drain some of the worker nodes. It turns out this was due to the rook-ceph-osd pods.  For example:


I1019 19:59:18.058276 2555207 daemon.go:320] evicting pod openshift-storage/rook-ceph-osd-4-668448bd77-jdjjj
I1019 19:59:18.058586 2555207 daemon.go:320] evicting pod openshift-storage/rook-ceph-osd-3-5cc7d9b9bc-s6hrx
I1019 19:59:18.058822 2555207 daemon.go:320] evicting pod openshift-storage/rook-ceph-osd-5-dc7967dbc-h7jz2
E1019 19:59:18.065127 2555207 daemon.go:320] error when evicting pod "rook-ceph-osd-5-dc7967dbc-h7jz2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E1019 19:59:18.065679 2555207 daemon.go:320] error when evicting pod "rook-ceph-osd-4-668448bd77-jdjjj" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E1019 19:59:18.065830 2555207 daemon.go:320] error when evicting pod "rook-ceph-osd-3-5cc7d9b9bc-s6hrx" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.

Looking at the PodDisruptionBudget, I see:


$ oc -n openshift-storage get poddisruptionbudget | grep rook-ceph-osd
rook-ceph-osd-0                                   N/A             0                 0                     9d
rook-ceph-osd-1                                   N/A             0                 0                     9d
rook-ceph-osd-2                                   N/A             0                 0                     9d
rook-ceph-osd-3                                   N/A             0                 0                     9d
rook-ceph-osd-4                                   N/A             0                 0                     9d
rook-ceph-osd-5                                   N/A             0                 0                     9d
rook-ceph-osd-6                                   N/A             0                 0                     9d
rook-ceph-osd-7                                   N/A             0                 0                     9d
rook-ceph-osd-8                                   N/A             0                 0                     9d


If I understand correctly how this works, it means that we can never update the cluster (or even install a machineconfigobject) because these pods will prevent the nodes from draining. What is the recommended way to deal with this?


Version of all relevant components (if applicable):

OCP 4.5.13 baremetal (IPI)
OCS 4.5.0

Is there any workaround available to the best of your knowledge?

I modified the PodDisruptionBudget for these pods to set maxUnavailable to 1. I don't know if this was correct or appropriate; the upgrade is still running so I don't know if things will come back up correctly.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

4

Can this issue reproducible?

Yes

Can this issue reproduce from the UI?

Yes

Comment 2 Jose A. Rivera 2020-10-20 14:39:25 UTC

PDBs are working as intended, so this is not a bug. Since this is a request for a workaround and not an immediate work item for OCS 4.6, moving to OCS 4.7.

I believe there is another RFE that is tracking the work to fully fix this, link will be provided shortly.

Comment 3 Lars Kellogg-Stedman 2020-10-21 12:14:54 UTC

I kind of think that "not able to upgrade the cluster when OCS is installed" is a bug. I'm not asking for a workaround. I'm asking for a permanent fix.

Comment 4 Lars Kellogg-Stedman 2020-10-21 12:18:54 UTC


*** This bug has been marked as a duplicate of bug 1861104 ***

Note You need to log in before you can comment on or make changes to this bug.