Bug 2222606

Summary: Changes on MCP that require node drain are stuck due to ODF PDBs
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Ram Lavi <ralavi>
Component: rookAssignee: Santosh Pillai <sapillai>
Status: NEW --- QA Contact: Neha Berry <nberry>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.14CC: odf-bz-bot, sapillai, tnielsen
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Ram Lavi 2023-07-13 09:13:48 UTC
Description of problem (please be detailed as possible and provide log
snippests):
On a 4.14 OCP cluster with >1 nodes, after adding/updating a MCP that's relevant to one node, the MCP update cycle (where the node is cordoned, drained, rebooted etc..)  is stuck indefinitely by the ODF PDBs on the openshift-storage namespace. 

Version of all relevant components (if applicable):
OCP 4.14

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
there is a workaround so no.

Is there any workaround available to the best of your knowledge?
working workaround - remove all the PDBs: 
oc delete -n openshift-storage pdb --all

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
100%

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:
not being able to update MCP is a regression

Steps to Reproduce:
1. Follow changes on the MCP on the DPDK-readiness D/S documenation: https://docs.openshift.com/container-platform/4.13/virt/virtual_machines/vm_networking/virt-attaching-vm-to-sriov-network.html#virt-configuring-cluster-dpdk_virt-attaching-vm-to-sriov-network
2. monitor the node's status: `oc get nodes` and notice that the node is not rebooted and stuck during drain.
3. review logs of `machine-config-controller` pods on the `openshift-machine-config-operator` namespace, and see that the ODF pods are not being able to delete, due to their PDB.


Actual results:
`oc get mcp` - shows that the node is stuck in update


Expected results:
node should reboot and move to be ready

Additional info:

Comment 2 Santosh Pillai 2023-07-31 07:21:09 UTC
Hi Ram

Can you provide the following info:
- ODF must gather
- What was the ceph health before the MCP operation was performed?

Thanks

Comment 3 Ram Lavi 2023-07-31 07:24:59 UTC
Hey Santosh
can you provide a set of commands for me to follow in order to get the ODF must gather and ceph health?

Comment 4 Ram Lavi 2023-07-31 07:35:34 UTC
I currently do not have an available cluster, I'll update when I get the information you need.

Comment 5 Ram Lavi 2023-08-01 06:54:06 UTC
>Hey Santosh
>can you provide a set of commands for me to follow in order to get the ODF must gather and ceph health?

Comment 6 Santosh Pillai 2023-08-01 10:42:05 UTC
(In reply to Ram Lavi from comment #5)
> >Hey Santosh
> >can you provide a set of commands for me to follow in order to get the ODF must gather and ceph health?

Hi. Shared the steps on google chat.

Comment 8 Santosh Pillai 2023-08-08 05:39:48 UTC
Hi Ram. 
did you get a chance to generate the ODF must gather?

Comment 9 Ram Lavi 2023-08-09 05:47:33 UTC
Unfortunately not yet. My cluster is SNO (so there is no drain).
I've asked a few cluster admins to tell me when they cause a drain so that we could fetch this info for you.