1898614 – nodes fail to drain due to leftover "forbid" pod from e2e run

Bug 1898614 - nodes fail to drain due to leftover "forbid" pod from e2e run

Summary: nodes fail to drain due to leftover "forbid" pod from e2e run

Keywords:
Status:	CLOSED DUPLICATE of bug 1915085
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Elana Hashman
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1898612 (view as bug list)
Depends On:
Blocks:	1912521 1912880
TreeView+	depends on / blocked

Reported:	2020-11-17 16:25 UTC by Ben Parees
Modified:	2021-02-01 19:05 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-01 19:05:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ben Parees 2020-11-17 16:25:42 UTC

Description of problem:
After running the e2e job on a cluster and then attempting to upgrade the cluster, the cluster has an unscheduleable node.

The node reports it is unscheduledable because it is in the process of draining pods in preparation for reboot, but the drain never completes because of a scheduling test related pod:

   machineconfiguration.openshift.io/reason: 'failed to drain node (5 tries): timed
      out waiting for the condition: [error when evicting
      pod "forbid-1605161940-7fpr8": pods "forbid-1605161940-7fpr8" is forbidden:


Version-Release number of selected component (if applicable):
4.6

How reproducible:
unknown, probably always

Steps to Reproduce:
1. run e2e job against a 4.6 cluster
2. upgrade the cluster to a new version
3. see nodes fail to drain

Actual results:
nodes fail to drain due to unevictable pods

Expected results:
nodes finish draining


Additional Info:

The test needs to remove this "forbidden" pod prior to completing.

Comment 1 Jan Chaloupka 2020-11-18 12:55:48 UTC

Ben, can you provide a link to logs for further inspection? Based on the description and lack of logs, the issue is about node draining being stuck. Which falls under cloud team.

Comment 8 Ben Parees 2021-01-25 14:56:36 UTC

*** Bug 1898612 has been marked as a duplicate of this bug. ***

Comment 13 Elana Hashman 2021-01-29 23:01:22 UTC

Thanks for providing logs. I will include a separate comment for each node I investigate.


Node ip-10-0-208-80.us-east-2.compute.internal:

e2e-pods-2601                                      pod-submit-status-2-1                                                 0/1     Terminating         0          10d     <none>         ip-10-0-208-80.us-east-2.compute.internal    <none>           <none>

Jan 28 04:51:46 ip-10-0-208-80 hyperkube[1543]: I0128 04:51:46.353775    1543 kubelet_pods.go:980] Pod "pod-submit-status-2-1_e2e-pods-2601(d37641cd-d814-498e-90f9-b1870f215885)" is terminated, but pod cgroup sandbox has not been cleaned up



e2e-pods-5889                                      pod-submit-status-1-6                                                 0/1     Terminating         0          9d      <none>         ip-10-0-208-80.us-east-2.compute.internal    <none>           <none>

Jan 28 04:52:06 ip-10-0-208-80 hyperkube[1543]: I0128 04:52:06.356010    1543 kubelet_pods.go:980] Pod "pod-submit-status-1-6_e2e-pods-5889(742e427d-8cd9-4f61-beaa-5903ca24a5b2)" is terminated, but pod cgroup sandbox has not been cleaned up

over and over.


e2e-test-build-webhooks-wrssw                      pushbuild-3-build                                                     0/1     Terminating         0          11d     <none>         ip-10-0-208-80.us-east-2.compute.internal    <none>           <none>

Jan 28 04:52:21 ip-10-0-208-80 hyperkube[1543]: I0128 04:52:21.622142    1543 kubelet_pods.go:980] Pod "pushbuild-3-build_e2e-test-build-webhooks-wrssw(4d061bea-86d5-46e6-8734-cdf6ed181c04)" is terminated, but pod cgroup sandbox has not been cleaned up


All the same error. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_pods.go#L974-L980

These were deleted too long ago for me to be able to see how they got stuck in this loop, but these pods are not otherwise showing up in the syncloop.

I suspect these pods were affected by #1915085. I can confirm they were all created and deleted in a short period of time, which would trigger that race.

  name: pod-submit-status-2-1
  namespace: e2e-pods-2601
  creationTimestamp: "2021-01-19T06:29:59Z"
  deletionTimestamp: "2021-01-19T06:30:01Z"

  name: pod-submit-status-1-6
  namespace: e2e-pods-5889
  creationTimestamp: "2021-01-20T06:33:07Z"
  deletionTimestamp: "2021-01-20T06:33:09Z"

  name: pushbuild-3-build
  namespace: e2e-test-build-webhooks-wrssw
  creationTimestamp: "2021-01-18T06:20:34Z"
  deletionTimestamp: "2021-01-18T06:21:10Z"

Comment 14 Ben Parees 2021-01-29 23:12:46 UTC

Thanks...  is there anything else you'd like to get from the cluster before i reinstall it?  (Though i realize i'm likely to hit the issue again, as 1915085 is not resolved yet, let alone backported to 4.6.z)

Comment 15 Elana Hashman 2021-01-29 23:14:19 UTC

Node ip-10-0-165-70.us-east-2.compute.internal:

I queried the node as follows to see if there were any patterns in termination times and statuses, and found that essentially all of the pods had termination time set between 2021-01-26T06:46:45Z and 2021-01-26T06:56:53Z.

Query:
oc get pod -A -o custom-columns=NODE:.spec.nodeName,NAME:.metadata.name,NAMESPACE:.metadata.namespace,STATUS:.status.phase,KILLTIME:.metadata.deletionTimestamp | head -1 | grep ip-10-0-165-70.us-east-2.compute.internal

Node status indicates this node went offline around then. The node is still NotReady/Unknown status and has been in that state since the time of the interruption. The only thing I'm puzzled by is that these pods are shown as Pending/Running status as opposed to Unknown status. But since this kubelet still hasn't reconnected to the cluster, I'm not surprised it's in a wonky state.

conditions:
- lastHeartbeatTime: "2021-01-26T06:46:16Z"
lastTransitionTime: "2021-01-26T06:49:48Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: MemoryPressure
- lastHeartbeatTime: "2021-01-26T06:46:16Z"
lastTransitionTime: "2021-01-26T06:49:48Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: DiskPressure
- lastHeartbeatTime: "2021-01-26T06:46:16Z"
lastTransitionTime: "2021-01-26T06:49:48Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: PIDPressure
- lastHeartbeatTime: "2021-01-26T06:46:17Z"
lastTransitionTime: "2021-01-26T06:49:48Z"
message: Kubelet stopped posting node status.
reason: NodeStatusUnknown
status: Unknown
type: Ready

I'll double check on this as I would have expected these pods to be marked as Unknown.

Note You need to log in before you can comment on or make changes to this bug.