+++ This bug was initially created as a clone of Bug #1853786 +++ Description of problem: During the rollout of the machine-config, post-upgrade, the MCO is getting stuck trying to drain nodes that have "build" pods running on them. The nodes appear to be stuck in a loop trying to evict the pods. Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-rc.6 True False 8h Cluster version is 4.5.0-rc.6 How reproducible: This has happened on both production clusters, 2 upgrades in a row (4.5.0-rc.5 and 4.5.0-rc.6). Steps to Reproduce: 1.We have pretty large clusters (53 nodes), with lots of users on them. Folks are running build pods continuously on these clusters. Begin an upgrade to 4.5.0-rc.5 (or 4.5.0-rc.6). 2.The upgrade goes all the way thorough successfully. The clusterversion is reporting complete, all operators are reporting healthy, but the machine-config rollout is hung. Actual results: The machine-config rollout has been hung for hours. Expected results: The machine-config upgrade should roll through all nodes successfully. Additional info: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-103.us-west-1.compute.internal Ready worker 309d v1.18.3+6025c28 ip-10-0-128-106.us-west-1.compute.internal Ready master 309d v1.18.3+6025c28 ip-10-0-133-180.us-west-1.compute.internal Ready,SchedulingDisabled worker 309d v1.18.3+f291db1 ip-10-0-138-37.us-west-1.compute.internal Ready worker 309d v1.18.3+f291db1 ip-10-0-138-69.us-west-1.compute.internal Ready master 309d v1.18.3+6025c28 ip-10-0-139-193.us-west-1.compute.internal Ready,SchedulingDisabled worker 309d v1.18.3+f291db1 ip-10-0-140-219.us-west-1.compute.internal Ready worker 309d v1.18.3+6025c28 ip-10-0-140-241.us-west-1.compute.internal Ready worker 309d v1.18.3+f291db1 ... $ oc get pods -o wide | grep ip-10-0-133-180.us-west-1.compute.internal machine-config-daemon-kcv4c 2/2 Running 0 8h 10.0.133.180 ip-10-0-133-180.us-west-1.compute.internal <none> <none> $ oc logs machine-config-daemon-kcv4c -c machine-config-daemon <snip> I0704 03:13:12.894123 2719952 daemon.go:321] evicting pod eta-karthika-demo/django-ex-1-build I0704 03:13:12.894138 2719952 daemon.go:321] evicting pod eta-karthika-demo/ruby-hello-world-2-build I0704 03:13:12.894406 2719952 daemon.go:321] evicting pod openshiftdemosree/ruby-hello-world-1-build I0704 03:14:42.918021 2719952 update.go:173] Draining failed with: [error when waiting for pod "ruby-hello-world-2-build" terminating: global timeout reached: 1m30s, error when waiting for pod "django-ex-1-build" terminating: global timeout reached: 1m30s, error when waiting for pod "ruby-hello-world-1-build" terminating: global timeout reached: 1m30s], retrying E0704 03:15:03.066532 2719952 daemon.go:321] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-mpvdk, openshift-dns/dns-default-pmd46, openshift-image-registry/node-ca-ntctx, openshift-machine-config-operator/machine-config-daemon-kcv4c, openshift-monitoring/crel-monitors-clamav-container-scanner-wfrcb, openshift-monitoring/crel-monitors-sdn-health-dpw6b, openshift-monitoring/node-exporter-pxxkm, openshift-multus/multus-g5kn2, openshift-sdn/ovs-6bvfd, openshift-sdn/sdn-bhx4g </snip> Our nodes have been stuck in this state for over 8 hours now. --- Additional comment from on 2020-07-04 03:35:31 UTC --- I have been able to manually move things along by forcefully deleting the specified pod: $ oc delete -n eta-karthika-demo --force --grace-period=0 pod/ruby-hello-world-1-build pod/ruby-hello-world-2-build --- Additional comment from David Eads on 2020-07-09 16:57:44 UTC --- 2020-07-04T02:23:53.430453488Z E0704 02:23:53.430413 1 garbagecollector.go:314] error syncing item ***** unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed] is fixed by https://github.com/kubernetes/kubernetes/pull/91648 in 1.18.5.
Eric fixed the bot. Resetting the target to 4.5.0.
*** Bug 1855477 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475