Bug 1943564 - MCO failed to drain Node because of the pending pod and marked node as degraded.
Summary: MCO failed to drain Node because of the pending pod and marked node as degraded.
Keywords:
Status: CLOSED DUPLICATE of bug 1903228
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Yu Qi Zhang
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-03-26 13:19 UTC by Nirupma Kashyap
Modified: 2021-11-01 16:19 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-11-01 16:19:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Nirupma Kashyap 2021-03-26 13:19:29 UTC
Description of problem:
OCP Fresh Installation on OCP-4.6.13 on Baremetal with only three nodes with master and worker role. Where MCO failed to drain Node because of the pending pod and marked node as degraded. It happened on 3 different clusters for a same customer.

Version-Release number of selected component (if applicable):
OCP-4.6.13

How reproducible:
Applying MCP after compliance-operator scan with auto remediation stuck on failure to terminate oauth-openshift POD on one of the master node.
oauth pod failed to terminate from the maser node'' due to global timeout error and some of the cluster operator become degraded for ex (authentication, ingress, machine-config and openshift-apiserver) because of this node become degraded and scheduling disable.

Steps to Reproduce:
1.
2.
3.

Actual results:
Node marked as degraded because it failed to drain the node beacuse of pending pod while applying machine config.


Expected results:
Node should be successfully updated with the latest machine config and should be in done state not in degraded.


Additional info:
Its fresh installation of OCP-46.13 on baremetal with only 3 nodes with master and worke roles. And customer reproduced this issue on 3 different clusters.
Attaching all the 3 must-gathers for your reference.

Comment 4 Nirupma Kashyap 2021-03-30 09:40:05 UTC
Hi jerzhang,

Thanks for analyzing it and sharing your feedback. You are right this is failing for different pod's and customer replicated in 3 cluster with the same kind of setup(baremetal, with only 3 nodes having mater and worker role). I am attaching another must-gather as well just for your review as you mentioned you were not able to check the first one.

But in my Opinion, shouldn't MCO be able to drain the node with the pending pod instead of marking the node a degraded? 

Attaching the second must-gather for your analysis and accordingly I will take it further.

Thanks & Regards,
Nirupma

Comment 7 Nirupma Kashyap 2021-04-01 12:01:15 UTC

Apologies, Access given now. Thanks for sharing feedback on the issue.


Note You need to log in before you can comment on or make changes to this bug.