Bug 1855444

Summary: Post-upgrade the MCO failing to evict "build" pods during drain
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: NodeAssignee: Ryan Phillips <rphillips>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: urgent CC: aos-bugs, brad.williams, ccoleman, deads, dwalsh, jhou, jokerman, lmohanty, mgugino, minmli, mpatel, openshift-bugzilla-robot, rphillips, scuppett, vlaad, wking
Version: 4.5Keywords: UpcomingSprint, Upgrades
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1853786 Environment:
Last Closed: 2020-07-13 17:44:51 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1853786    
Bug Blocks: 1855478    

Description W. Trevor King 2020-07-09 20:58:02 UTC
+++ This bug was initially created as a clone of Bug #1853786 +++

Description of problem:
During the rollout of the machine-config, post-upgrade, the MCO is getting stuck trying to drain nodes that have "build" pods running on them.  The nodes appear to be stuck in a loop trying to evict the pods. 

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-rc.6   True        False         8h      Cluster version is 4.5.0-rc.6

How reproducible:
This has happened on both production clusters, 2 upgrades in a row (4.5.0-rc.5 and 4.5.0-rc.6).

Steps to Reproduce:
1.We have pretty large clusters (53 nodes), with lots of users on them.  Folks are running build pods continuously on these clusters.  Begin an upgrade to 4.5.0-rc.5 (or 4.5.0-rc.6).  
2.The upgrade goes all the way thorough successfully.  The clusterversion is reporting complete, all operators are reporting healthy, but the machine-config rollout is hung.

Actual results:
The machine-config rollout has been hung for hours.

Expected results:
The machine-config upgrade should roll through all nodes successfully.

Additional info:
$ oc get nodes
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-128-103.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+6025c28
ip-10-0-128-106.us-west-1.compute.internal   Ready                      master   309d   v1.18.3+6025c28
ip-10-0-133-180.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   309d   v1.18.3+f291db1
ip-10-0-138-37.us-west-1.compute.internal    Ready                      worker   309d   v1.18.3+f291db1
ip-10-0-138-69.us-west-1.compute.internal    Ready                      master   309d   v1.18.3+6025c28
ip-10-0-139-193.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   309d   v1.18.3+f291db1
ip-10-0-140-219.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+6025c28
ip-10-0-140-241.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+f291db1
...

$ oc get pods -o wide | grep ip-10-0-133-180.us-west-1.compute.internal
machine-config-daemon-kcv4c                  2/2     Running   0          8h    10.0.133.180   ip-10-0-133-180.us-west-1.compute.internal   <none>           <none>

$ oc logs machine-config-daemon-kcv4c -c machine-config-daemon 
<snip>
I0704 03:13:12.894123 2719952 daemon.go:321] evicting pod eta-karthika-demo/django-ex-1-build
I0704 03:13:12.894138 2719952 daemon.go:321] evicting pod eta-karthika-demo/ruby-hello-world-2-build
I0704 03:13:12.894406 2719952 daemon.go:321] evicting pod openshiftdemosree/ruby-hello-world-1-build
I0704 03:14:42.918021 2719952 update.go:173] Draining failed with: [error when waiting for pod "ruby-hello-world-2-build" terminating: global timeout reached: 1m30s, error when waiting for pod "django-ex-1-build" terminating: global timeout reached: 1m30s, error when waiting for pod "ruby-hello-world-1-build" terminating: global timeout reached: 1m30s], retrying
E0704 03:15:03.066532 2719952 daemon.go:321] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-mpvdk, openshift-dns/dns-default-pmd46, openshift-image-registry/node-ca-ntctx, openshift-machine-config-operator/machine-config-daemon-kcv4c, openshift-monitoring/crel-monitors-clamav-container-scanner-wfrcb, openshift-monitoring/crel-monitors-sdn-health-dpw6b, openshift-monitoring/node-exporter-pxxkm, openshift-multus/multus-g5kn2, openshift-sdn/ovs-6bvfd, openshift-sdn/sdn-bhx4g
</snip>

Our nodes have been stuck in this state for over 8 hours now.

--- Additional comment from  on 2020-07-04 03:35:31 UTC ---

I have been able to manually move things along by forcefully deleting the specified pod:
$ oc delete -n eta-karthika-demo --force --grace-period=0 pod/ruby-hello-world-1-build pod/ruby-hello-world-2-build

--- Additional comment from David Eads on 2020-07-09 16:57:44 UTC ---

2020-07-04T02:23:53.430453488Z E0704 02:23:53.430413       1 garbagecollector.go:314] error syncing item ***** unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]




is fixed by https://github.com/kubernetes/kubernetes/pull/91648 in 1.18.5.

Comment 3 W. Trevor King 2020-07-09 21:05:26 UTC
Eric fixed the bot.  Resetting the target to 4.5.0.

Comment 6 Scott Dodson 2020-07-10 12:07:09 UTC
*** Bug 1855477 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2020-07-13 17:44:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 9 W. Trevor King 2021-04-05 17:36:12 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475