Description of problem: After running the e2e job on a cluster and then attempting to upgrade the cluster, the cluster has an unscheduleable node. The node reports it is unscheduledable because it is in the process of draining pods in preparation for reboot, but the drain never completes because of a build-related pod: machineconfiguration.openshift.io/reason: 'failed to drain node (5 tries): timed out waiting for the condition: [error when evicting pod "pushbuild-4-build": pods "pushbuild-4-build" is forbidden: unable to create new content in namespace e2e-test-build-webhooks-42jlp because it is being terminated Version-Release number of selected component (if applicable): 4.6 How reproducible: unknown, probably always Steps to Reproduce: 1. run e2e job against a 4.6 cluster 2. upgrade the cluster to a new version 3. see nodes fail to drain Actual results: nodes fail to drain due to unevictable pods Expected results: nodes finish draining
Ben, Is it always the same pod name? Or does it change each time?
I only have the one example so far, but the pod name looks like it would be produced by kicking off a build (in particular this was the 4th one?) and i imagine the test kicks off the same number of builds each time, so i'd expect it to be consistent.
I ran the following test on a GCP 4.6 cluster: 1. Create a Dockerfile build with a long-running RUN instruction (bash with long while loop) 2. Run the build 3. Delete the namespace Namespace seems to clean itself up just fine, within 10-15 seconds the pods are deleted. Build containers seem to be doing the right thing and terminating on SIGTERM.
it's a 4.6 nightly from 12/21, so i'd say it's even odds as to whether or not it has the fixed. i'll install a new one and see if the problem recurs.
*** This bug has been marked as a duplicate of bug 1898614 ***