1943564 – MCO failed to drain Node because of the pending pod and marked node as degraded.

Bug 1943564 - MCO failed to drain Node because of the pending pod and marked node as degraded.

Summary: MCO failed to drain Node because of the pending pod and marked node as degraded.

Keywords:
Status:	CLOSED DUPLICATE of bug 1903228
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.6.z
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Yu Qi Zhang
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-03-26 13:19 UTC by Nirupma Kashyap
Modified:	2024-10-01 17:47 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-11-01 16:19:56 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Nirupma Kashyap 2021-03-26 13:19:29 UTC

Description of problem:
OCP Fresh Installation on OCP-4.6.13 on Baremetal with only three nodes with master and worker role. Where MCO failed to drain Node because of the pending pod and marked node as degraded. It happened on 3 different clusters for a same customer.

Version-Release number of selected component (if applicable):
OCP-4.6.13

How reproducible:
Applying MCP after compliance-operator scan with auto remediation stuck on failure to terminate oauth-openshift POD on one of the master node.
oauth pod failed to terminate from the maser node'' due to global timeout error and some of the cluster operator become degraded for ex (authentication, ingress, machine-config and openshift-apiserver) because of this node become degraded and scheduling disable.

Steps to Reproduce:
1.
2.
3.

Actual results:
Node marked as degraded because it failed to drain the node beacuse of pending pod while applying machine config.


Expected results:
Node should be successfully updated with the latest machine config and should be in done state not in degraded.


Additional info:
Its fresh installation of OCP-46.13 on baremetal with only 3 nodes with master and worke roles. And customer reproduced this issue on 3 different clusters.
Attaching all the 3 must-gathers for your reference.

Comment 4 Nirupma Kashyap 2021-03-30 09:40:05 UTC

Hi jerzhang,

Thanks for analyzing it and sharing your feedback. You are right this is failing for different pod's and customer replicated in 3 cluster with the same kind of setup(baremetal, with only 3 nodes having mater and worker role). I am attaching another must-gather as well just for your review as you mentioned you were not able to check the first one.

But in my Opinion, shouldn't MCO be able to drain the node with the pending pod instead of marking the node a degraded? 

Attaching the second must-gather for your analysis and accordingly I will take it further.

Thanks & Regards,
Nirupma

Comment 7 Nirupma Kashyap 2021-04-01 12:01:15 UTC


Apologies, Access given now. Thanks for sharing feedback on the issue.

Note You need to log in before you can comment on or make changes to this bug.