Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1898612

Summary: build test leaves node un-drainable
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: NodeAssignee: Elana Hashman <ehashman>
Node sub component: Kubelet QA Contact: Weinan Liu <weinliu>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: adam.kaplan, aos-bugs, nagrawal, rphillips, tsweeney, wking
Version: 4.6Keywords: UpcomingSprint
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-25 14:56:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1912521, 1912880    

Description Ben Parees 2020-11-17 16:23:14 UTC
Description of problem:
After running the e2e job on a cluster and then attempting to upgrade the cluster, the cluster has an unscheduleable node.

The node reports it is unscheduledable because it is in the process of draining pods in preparation for reboot, but the drain never completes because of a build-related pod:

 machineconfiguration.openshift.io/reason: 'failed to drain node (5 tries): timed
      out waiting for the condition: [error when evicting pod "pushbuild-4-build":
      pods "pushbuild-4-build" is forbidden: unable to create new content in namespace
      e2e-test-build-webhooks-42jlp because it is being terminated


Version-Release number of selected component (if applicable):
4.6

How reproducible:
unknown, probably always

Steps to Reproduce:
1. run e2e job against a 4.6 cluster
2. upgrade the cluster to a new version
3. see nodes fail to drain

Actual results:
nodes fail to drain due to unevictable pods

Expected results:
nodes finish draining

Comment 1 Corey Daley 2020-12-17 13:50:50 UTC
Ben, 
Is it always the same pod name? Or does it change each time?

Comment 2 Ben Parees 2020-12-17 14:05:19 UTC
I only have the one example so far, but the pod name looks like it would be produced by kicking off a build (in particular this was the 4th one?) and i imagine the test kicks off the same number of builds each time, so i'd expect it to be consistent.

Comment 3 Adam Kaplan 2020-12-22 16:41:07 UTC
I ran the following test on a GCP 4.6 cluster:

1. Create a Dockerfile build with a long-running RUN instruction (bash with long while loop)
2. Run the build
3. Delete the namespace

Namespace seems to clean itself up just fine, within 10-15 seconds the pods are deleted. Build containers seem to be doing the right thing and terminating on SIGTERM.

Comment 7 Ben Parees 2021-01-10 00:19:36 UTC
it's a 4.6 nightly from 12/21, so i'd say it's even odds as to whether or not it has the fixed.  i'll install a new one and see if the problem recurs.

Comment 18 Ben Parees 2021-01-25 14:56:36 UTC

*** This bug has been marked as a duplicate of bug 1898614 ***