Description of problem: After running the e2e job on a cluster and then attempting to upgrade the cluster, the cluster has an unscheduleable node. The node reports it is unscheduledable because it is in the process of draining pods in preparation for reboot, but the drain never completes because of a scheduling test related pod: machineconfiguration.openshift.io/reason: 'failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod "forbid-1605161940-7fpr8": pods "forbid-1605161940-7fpr8" is forbidden: Version-Release number of selected component (if applicable): 4.6 How reproducible: unknown, probably always Steps to Reproduce: 1. run e2e job against a 4.6 cluster 2. upgrade the cluster to a new version 3. see nodes fail to drain Actual results: nodes fail to drain due to unevictable pods Expected results: nodes finish draining Additional Info: The test needs to remove this "forbidden" pod prior to completing.
Ben, can you provide a link to logs for further inspection? Based on the description and lack of logs, the issue is about node draining being stuck. Which falls under cloud team.
*** Bug 1898612 has been marked as a duplicate of this bug. ***
Thanks for providing logs. I will include a separate comment for each node I investigate. Node ip-10-0-208-80.us-east-2.compute.internal: e2e-pods-2601 pod-submit-status-2-1 0/1 Terminating 0 10d <none> ip-10-0-208-80.us-east-2.compute.internal <none> <none> Jan 28 04:51:46 ip-10-0-208-80 hyperkube[1543]: I0128 04:51:46.353775 1543 kubelet_pods.go:980] Pod "pod-submit-status-2-1_e2e-pods-2601(d37641cd-d814-498e-90f9-b1870f215885)" is terminated, but pod cgroup sandbox has not been cleaned up e2e-pods-5889 pod-submit-status-1-6 0/1 Terminating 0 9d <none> ip-10-0-208-80.us-east-2.compute.internal <none> <none> Jan 28 04:52:06 ip-10-0-208-80 hyperkube[1543]: I0128 04:52:06.356010 1543 kubelet_pods.go:980] Pod "pod-submit-status-1-6_e2e-pods-5889(742e427d-8cd9-4f61-beaa-5903ca24a5b2)" is terminated, but pod cgroup sandbox has not been cleaned up over and over. e2e-test-build-webhooks-wrssw pushbuild-3-build 0/1 Terminating 0 11d <none> ip-10-0-208-80.us-east-2.compute.internal <none> <none> Jan 28 04:52:21 ip-10-0-208-80 hyperkube[1543]: I0128 04:52:21.622142 1543 kubelet_pods.go:980] Pod "pushbuild-3-build_e2e-test-build-webhooks-wrssw(4d061bea-86d5-46e6-8734-cdf6ed181c04)" is terminated, but pod cgroup sandbox has not been cleaned up All the same error. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L974-L980 These were deleted too long ago for me to be able to see how they got stuck in this loop, but these pods are not otherwise showing up in the syncloop. I suspect these pods were affected by #1915085. I can confirm they were all created and deleted in a short period of time, which would trigger that race. name: pod-submit-status-2-1 namespace: e2e-pods-2601 creationTimestamp: "2021-01-19T06:29:59Z" deletionTimestamp: "2021-01-19T06:30:01Z" name: pod-submit-status-1-6 namespace: e2e-pods-5889 creationTimestamp: "2021-01-20T06:33:07Z" deletionTimestamp: "2021-01-20T06:33:09Z" name: pushbuild-3-build namespace: e2e-test-build-webhooks-wrssw creationTimestamp: "2021-01-18T06:20:34Z" deletionTimestamp: "2021-01-18T06:21:10Z"
Thanks... is there anything else you'd like to get from the cluster before i reinstall it? (Though i realize i'm likely to hit the issue again, as 1915085 is not resolved yet, let alone backported to 4.6.z)
Node ip-10-0-165-70.us-east-2.compute.internal: I queried the node as follows to see if there were any patterns in termination times and statuses, and found that essentially all of the pods had termination time set between 2021-01-26T06:46:45Z and 2021-01-26T06:56:53Z. Query: oc get pod -A -o custom-columns=NODE:.spec.nodeName,NAME:.metadata.name,NAMESPACE:.metadata.namespace,STATUS:.status.phase,KILLTIME:.metadata.deletionTimestamp | head -1 | grep ip-10-0-165-70.us-east-2.compute.internal Node status indicates this node went offline around then. The node is still NotReady/Unknown status and has been in that state since the time of the interruption. The only thing I'm puzzled by is that these pods are shown as Pending/Running status as opposed to Unknown status. But since this kubelet still hasn't reconnected to the cluster, I'm not surprised it's in a wonky state. conditions: - lastHeartbeatTime: "2021-01-26T06:46:16Z" lastTransitionTime: "2021-01-26T06:49:48Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: MemoryPressure - lastHeartbeatTime: "2021-01-26T06:46:16Z" lastTransitionTime: "2021-01-26T06:49:48Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: DiskPressure - lastHeartbeatTime: "2021-01-26T06:46:16Z" lastTransitionTime: "2021-01-26T06:49:48Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: PIDPressure - lastHeartbeatTime: "2021-01-26T06:46:17Z" lastTransitionTime: "2021-01-26T06:49:48Z" message: Kubelet stopped posting node status. reason: NodeStatusUnknown status: Unknown type: Ready I'll double check on this as I would have expected these pods to be marked as Unknown.