Description of problem: OCP Fresh Installation on OCP-4.6.13 on Baremetal with only three nodes with master and worker role. Where MCO failed to drain Node because of the pending pod and marked node as degraded. It happened on 3 different clusters for a same customer. Version-Release number of selected component (if applicable): OCP-4.6.13 How reproducible: Applying MCP after compliance-operator scan with auto remediation stuck on failure to terminate oauth-openshift POD on one of the master node. oauth pod failed to terminate from the maser node'' due to global timeout error and some of the cluster operator become degraded for ex (authentication, ingress, machine-config and openshift-apiserver) because of this node become degraded and scheduling disable. Steps to Reproduce: 1. 2. 3. Actual results: Node marked as degraded because it failed to drain the node beacuse of pending pod while applying machine config. Expected results: Node should be successfully updated with the latest machine config and should be in done state not in degraded. Additional info: Its fresh installation of OCP-46.13 on baremetal with only 3 nodes with master and worke roles. And customer reproduced this issue on 3 different clusters. Attaching all the 3 must-gathers for your reference.
Hi jerzhang, Thanks for analyzing it and sharing your feedback. You are right this is failing for different pod's and customer replicated in 3 cluster with the same kind of setup(baremetal, with only 3 nodes having mater and worker role). I am attaching another must-gather as well just for your review as you mentioned you were not able to check the first one. But in my Opinion, shouldn't MCO be able to drain the node with the pending pod instead of marking the node a degraded? Attaching the second must-gather for your analysis and accordingly I will take it further. Thanks & Regards, Nirupma
Apologies, Access given now. Thanks for sharing feedback on the issue.