Bug 1898612 - build test leaves node un-drainable
Summary: build test leaves node un-drainable
Keywords:
Status: CLOSED DUPLICATE of bug 1898614
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Elana Hashman
QA Contact: Weinan Liu
URL:
Whiteboard:
Depends On:
Blocks: 1912521 1912880
TreeView+ depends on / blocked
 
Reported: 2020-11-17 16:23 UTC by Ben Parees
Modified: 2021-01-25 14:56 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-25 14:56:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ben Parees 2020-11-17 16:23:14 UTC
Description of problem:
After running the e2e job on a cluster and then attempting to upgrade the cluster, the cluster has an unscheduleable node.

The node reports it is unscheduledable because it is in the process of draining pods in preparation for reboot, but the drain never completes because of a build-related pod:

 machineconfiguration.openshift.io/reason: 'failed to drain node (5 tries): timed
      out waiting for the condition: [error when evicting pod "pushbuild-4-build":
      pods "pushbuild-4-build" is forbidden: unable to create new content in namespace
      e2e-test-build-webhooks-42jlp because it is being terminated


Version-Release number of selected component (if applicable):
4.6

How reproducible:
unknown, probably always

Steps to Reproduce:
1. run e2e job against a 4.6 cluster
2. upgrade the cluster to a new version
3. see nodes fail to drain

Actual results:
nodes fail to drain due to unevictable pods

Expected results:
nodes finish draining

Comment 1 Corey Daley 2020-12-17 13:50:50 UTC
Ben, 
Is it always the same pod name? Or does it change each time?

Comment 2 Ben Parees 2020-12-17 14:05:19 UTC
I only have the one example so far, but the pod name looks like it would be produced by kicking off a build (in particular this was the 4th one?) and i imagine the test kicks off the same number of builds each time, so i'd expect it to be consistent.

Comment 3 Adam Kaplan 2020-12-22 16:41:07 UTC
I ran the following test on a GCP 4.6 cluster:

1. Create a Dockerfile build with a long-running RUN instruction (bash with long while loop)
2. Run the build
3. Delete the namespace

Namespace seems to clean itself up just fine, within 10-15 seconds the pods are deleted. Build containers seem to be doing the right thing and terminating on SIGTERM.

Comment 7 Ben Parees 2021-01-10 00:19:36 UTC
it's a 4.6 nightly from 12/21, so i'd say it's even odds as to whether or not it has the fixed.  i'll install a new one and see if the problem recurs.

Comment 18 Ben Parees 2021-01-25 14:56:36 UTC

*** This bug has been marked as a duplicate of bug 1898614 ***


Note You need to log in before you can comment on or make changes to this bug.