Bug 2222606

Summary:	Changes on MCP that require node drain are stuck due to ODF PDBs
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	Ram Lavi <ralavi>
Component:	rook	Assignee:	Santosh Pillai <sapillai>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Neha Berry <nberry>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.14	CC:	odf-bz-bot, sapillai, tnielsen
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-11-14 15:25:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ram Lavi 2023-07-13 09:13:48 UTC

Description of problem (please be detailed as possible and provide log
snippests):
On a 4.14 OCP cluster with >1 nodes, after adding/updating a MCP that's relevant to one node, the MCP update cycle (where the node is cordoned, drained, rebooted etc..)  is stuck indefinitely by the ODF PDBs on the openshift-storage namespace. 

Version of all relevant components (if applicable):
OCP 4.14

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
there is a workaround so no.

Is there any workaround available to the best of your knowledge?
working workaround - remove all the PDBs: 
oc delete -n openshift-storage pdb --all

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
100%

Can this issue reproduce from the UI?
yes

If this is a regression, please provide more details to justify this:
not being able to update MCP is a regression

Steps to Reproduce:
1. Follow changes on the MCP on the DPDK-readiness D/S documenation: https://docs.openshift.com/container-platform/4.13/virt/virtual_machines/vm_networking/virt-attaching-vm-to-sriov-network.html#virt-configuring-cluster-dpdk_virt-attaching-vm-to-sriov-network
2. monitor the node's status: `oc get nodes` and notice that the node is not rebooted and stuck during drain.
3. review logs of `machine-config-controller` pods on the `openshift-machine-config-operator` namespace, and see that the ODF pods are not being able to delete, due to their PDB.


Actual results:
`oc get mcp` - shows that the node is stuck in update


Expected results:
node should reboot and move to be ready

Additional info:

Comment 2 Santosh Pillai 2023-07-31 07:21:09 UTC

Hi Ram

Can you provide the following info:
- ODF must gather
- What was the ceph health before the MCP operation was performed?

Thanks

Comment 3 Ram Lavi 2023-07-31 07:24:59 UTC

Hey Santosh
can you provide a set of commands for me to follow in order to get the ODF must gather and ceph health?

Comment 4 Ram Lavi 2023-07-31 07:35:34 UTC

I currently do not have an available cluster, I'll update when I get the information you need.

Comment 5 Ram Lavi 2023-08-01 06:54:06 UTC

>Hey Santosh
>can you provide a set of commands for me to follow in order to get the ODF must gather and ceph health?

Comment 6 Santosh Pillai 2023-08-01 10:42:05 UTC

(In reply to Ram Lavi from comment #5)
> >Hey Santosh
> >can you provide a set of commands for me to follow in order to get the ODF must gather and ceph health?

Hi. Shared the steps on google chat.

Comment 8 Santosh Pillai 2023-08-08 05:39:48 UTC

Hi Ram. 
did you get a chance to generate the ODF must gather?

Comment 9 Ram Lavi 2023-08-09 05:47:33 UTC

Unfortunately not yet. My cluster is SNO (so there is no drain).
I've asked a few cluster admins to tell me when they cause a drain so that we could fetch this info for you.

Comment 10 Ram Lavi 2023-09-04 08:01:48 UTC

update: we still didn't get the chance to see the issue again - our internal cluster is not using ODF anymore, and I only use SNOs where the issue is not relevant. 
Will update when we see this issue again - with the appropriate must gather info.

Comment 11 Santosh Pillai 2023-11-14 15:25:52 UTC

Closing this for now. Please reopen if still an issue.