Description of problem (please be detailed as possible and provide log
I'm running OCS 4.5 on an OCP 4.5.13 baremetal cluster. I tried upgrading to 4.5.14, but the ugprade got stuck because of a failure to drain some of the worker nodes. It turns out this was due to the rook-ceph-osd pods. For example:
I1019 19:59:18.058276 2555207 daemon.go:320] evicting pod openshift-storage/rook-ceph-osd-4-668448bd77-jdjjj
I1019 19:59:18.058586 2555207 daemon.go:320] evicting pod openshift-storage/rook-ceph-osd-3-5cc7d9b9bc-s6hrx
I1019 19:59:18.058822 2555207 daemon.go:320] evicting pod openshift-storage/rook-ceph-osd-5-dc7967dbc-h7jz2
E1019 19:59:18.065127 2555207 daemon.go:320] error when evicting pod "rook-ceph-osd-5-dc7967dbc-h7jz2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E1019 19:59:18.065679 2555207 daemon.go:320] error when evicting pod "rook-ceph-osd-4-668448bd77-jdjjj" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
E1019 19:59:18.065830 2555207 daemon.go:320] error when evicting pod "rook-ceph-osd-3-5cc7d9b9bc-s6hrx" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
Looking at the PodDisruptionBudget, I see:
$ oc -n openshift-storage get poddisruptionbudget | grep rook-ceph-osd
rook-ceph-osd-0 N/A 0 0 9d
rook-ceph-osd-1 N/A 0 0 9d
rook-ceph-osd-2 N/A 0 0 9d
rook-ceph-osd-3 N/A 0 0 9d
rook-ceph-osd-4 N/A 0 0 9d
rook-ceph-osd-5 N/A 0 0 9d
rook-ceph-osd-6 N/A 0 0 9d
rook-ceph-osd-7 N/A 0 0 9d
rook-ceph-osd-8 N/A 0 0 9d
If I understand correctly how this works, it means that we can never update the cluster (or even install a machineconfigobject) because these pods will prevent the nodes from draining. What is the recommended way to deal with this?
Version of all relevant components (if applicable):
OCP 4.5.13 baremetal (IPI)
Is there any workaround available to the best of your knowledge?
I modified the PodDisruptionBudget for these pods to set maxUnavailable to 1. I don't know if this was correct or appropriate; the upgrade is still running so I don't know if things will come back up correctly.
Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
Can this issue reproducible?
Can this issue reproduce from the UI?
PDBs are working as intended, so this is not a bug. Since this is a request for a workaround and not an immediate work item for OCS 4.6, moving to OCS 4.7.
I believe there is another RFE that is tracking the work to fully fix this, link will be provided shortly.
I kind of think that "not able to upgrade the cluster when OCS is installed" is a bug. I'm not asking for a workaround. I'm asking for a permanent fix.
*** This bug has been marked as a duplicate of bug 1861104 ***