1855444 – Post-upgrade the MCO failing to evict "build" pods during drain

Bug 1855444 - Post-upgrade the MCO failing to evict "build" pods during drain

Summary: Post-upgrade the MCO failing to evict "build" pods during drain

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Ryan Phillips
QA Contact:	Sunil Choudhary
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1855477 (view as bug list)
Depends On:	1853786
Blocks:	1855478
TreeView+	depends on / blocked

Reported:	2020-07-09 20:58 UTC by W. Trevor King
Modified:	2021-04-05 17:36 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1853786
Environment:
Last Closed:	2020-07-13 17:44:51 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25263	0	None	closed	Bug 1855444: UPSTREAM: 91648: Changes to ManagedFields is not mutation for GC	2020-12-10 07:40:02 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-07-13 17:45:10 UTC

Description W. Trevor King 2020-07-09 20:58:02 UTC

+++ This bug was initially created as a clone of Bug #1853786 +++

Description of problem:
During the rollout of the machine-config, post-upgrade, the MCO is getting stuck trying to drain nodes that have "build" pods running on them.  The nodes appear to be stuck in a loop trying to evict the pods. 

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-rc.6   True        False         8h      Cluster version is 4.5.0-rc.6

How reproducible:
This has happened on both production clusters, 2 upgrades in a row (4.5.0-rc.5 and 4.5.0-rc.6).

Steps to Reproduce:
1.We have pretty large clusters (53 nodes), with lots of users on them.  Folks are running build pods continuously on these clusters.  Begin an upgrade to 4.5.0-rc.5 (or 4.5.0-rc.6).  
2.The upgrade goes all the way thorough successfully.  The clusterversion is reporting complete, all operators are reporting healthy, but the machine-config rollout is hung.

Actual results:
The machine-config rollout has been hung for hours.

Expected results:
The machine-config upgrade should roll through all nodes successfully.

Additional info:
$ oc get nodes
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-128-103.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+6025c28
ip-10-0-128-106.us-west-1.compute.internal   Ready                      master   309d   v1.18.3+6025c28
ip-10-0-133-180.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   309d   v1.18.3+f291db1
ip-10-0-138-37.us-west-1.compute.internal    Ready                      worker   309d   v1.18.3+f291db1
ip-10-0-138-69.us-west-1.compute.internal    Ready                      master   309d   v1.18.3+6025c28
ip-10-0-139-193.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   309d   v1.18.3+f291db1
ip-10-0-140-219.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+6025c28
ip-10-0-140-241.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+f291db1
...

$ oc get pods -o wide | grep ip-10-0-133-180.us-west-1.compute.internal
machine-config-daemon-kcv4c                  2/2     Running   0          8h    10.0.133.180   ip-10-0-133-180.us-west-1.compute.internal   <none>           <none>

$ oc logs machine-config-daemon-kcv4c -c machine-config-daemon 
<snip>
I0704 03:13:12.894123 2719952 daemon.go:321] evicting pod eta-karthika-demo/django-ex-1-build
I0704 03:13:12.894138 2719952 daemon.go:321] evicting pod eta-karthika-demo/ruby-hello-world-2-build
I0704 03:13:12.894406 2719952 daemon.go:321] evicting pod openshiftdemosree/ruby-hello-world-1-build
I0704 03:14:42.918021 2719952 update.go:173] Draining failed with: [error when waiting for pod "ruby-hello-world-2-build" terminating: global timeout reached: 1m30s, error when waiting for pod "django-ex-1-build" terminating: global timeout reached: 1m30s, error when waiting for pod "ruby-hello-world-1-build" terminating: global timeout reached: 1m30s], retrying
E0704 03:15:03.066532 2719952 daemon.go:321] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-mpvdk, openshift-dns/dns-default-pmd46, openshift-image-registry/node-ca-ntctx, openshift-machine-config-operator/machine-config-daemon-kcv4c, openshift-monitoring/crel-monitors-clamav-container-scanner-wfrcb, openshift-monitoring/crel-monitors-sdn-health-dpw6b, openshift-monitoring/node-exporter-pxxkm, openshift-multus/multus-g5kn2, openshift-sdn/ovs-6bvfd, openshift-sdn/sdn-bhx4g
</snip>

Our nodes have been stuck in this state for over 8 hours now.

--- Additional comment from  on 2020-07-04 03:35:31 UTC ---

I have been able to manually move things along by forcefully deleting the specified pod:
$ oc delete -n eta-karthika-demo --force --grace-period=0 pod/ruby-hello-world-1-build pod/ruby-hello-world-2-build

--- Additional comment from David Eads on 2020-07-09 16:57:44 UTC ---

2020-07-04T02:23:53.430453488Z E0704 02:23:53.430413       1 garbagecollector.go:314] error syncing item ***** unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]




is fixed by https://github.com/kubernetes/kubernetes/pull/91648 in 1.18.5.

Comment 3 W. Trevor King 2020-07-09 21:05:26 UTC

Eric fixed the bot.  Resetting the target to 4.5.0.

Comment 6 Scott Dodson 2020-07-10 12:07:09 UTC

*** Bug 1855477 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2020-07-13 17:44:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 9 W. Trevor King 2021-04-05 17:36:12 UTC

Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475

Note You need to log in before you can comment on or make changes to this bug.