Description of problem (please be detailed as possible and provide log snippests): I'm running OCS 4.5 on an OCP 4.5.13 baremetal cluster. I tried upgrading to 4.5.14, but the ugprade got stuck because of a failure to drain some of the worker nodes. It turns out this was due to the rook-ceph-osd pods. For example: I1019 19:59:18.058276 2555207 daemon.go:320] evicting pod openshift-storage/rook-ceph-osd-4-668448bd77-jdjjj I1019 19:59:18.058586 2555207 daemon.go:320] evicting pod openshift-storage/rook-ceph-osd-3-5cc7d9b9bc-s6hrx I1019 19:59:18.058822 2555207 daemon.go:320] evicting pod openshift-storage/rook-ceph-osd-5-dc7967dbc-h7jz2 E1019 19:59:18.065127 2555207 daemon.go:320] error when evicting pod "rook-ceph-osd-5-dc7967dbc-h7jz2" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E1019 19:59:18.065679 2555207 daemon.go:320] error when evicting pod "rook-ceph-osd-4-668448bd77-jdjjj" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. E1019 19:59:18.065830 2555207 daemon.go:320] error when evicting pod "rook-ceph-osd-3-5cc7d9b9bc-s6hrx" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. Looking at the PodDisruptionBudget, I see: $ oc -n openshift-storage get poddisruptionbudget | grep rook-ceph-osd rook-ceph-osd-0 N/A 0 0 9d rook-ceph-osd-1 N/A 0 0 9d rook-ceph-osd-2 N/A 0 0 9d rook-ceph-osd-3 N/A 0 0 9d rook-ceph-osd-4 N/A 0 0 9d rook-ceph-osd-5 N/A 0 0 9d rook-ceph-osd-6 N/A 0 0 9d rook-ceph-osd-7 N/A 0 0 9d rook-ceph-osd-8 N/A 0 0 9d If I understand correctly how this works, it means that we can never update the cluster (or even install a machineconfigobject) because these pods will prevent the nodes from draining. What is the recommended way to deal with this? Version of all relevant components (if applicable): OCP 4.5.13 baremetal (IPI) OCS 4.5.0 Is there any workaround available to the best of your knowledge? I modified the PodDisruptionBudget for these pods to set maxUnavailable to 1. I don't know if this was correct or appropriate; the upgrade is still running so I don't know if things will come back up correctly. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 4 Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes
PDBs are working as intended, so this is not a bug. Since this is a request for a workaround and not an immediate work item for OCS 4.6, moving to OCS 4.7. I believe there is another RFE that is tracking the work to fully fix this, link will be provided shortly.
I kind of think that "not able to upgrade the cluster when OCS is installed" is a bug. I'm not asking for a workaround. I'm asking for a permanent fix.
*** This bug has been marked as a duplicate of bug 1861104 ***