+++ This bug was initially created as a clone of Bug #1853786 +++
Description of problem:
During the rollout of the machine-config, post-upgrade, the MCO is getting stuck trying to drain nodes that have "build" pods running on them. The nodes appear to be stuck in a loop trying to evict the pods.
Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.5.0-rc.6 True False 8h Cluster version is 4.5.0-rc.6
How reproducible:
This has happened on both production clusters, 2 upgrades in a row (4.5.0-rc.5 and 4.5.0-rc.6).
Steps to Reproduce:
1.We have pretty large clusters (53 nodes), with lots of users on them. Folks are running build pods continuously on these clusters. Begin an upgrade to 4.5.0-rc.5 (or 4.5.0-rc.6).
2.The upgrade goes all the way thorough successfully. The clusterversion is reporting complete, all operators are reporting healthy, but the machine-config rollout is hung.
Actual results:
The machine-config rollout has been hung for hours.
Expected results:
The machine-config upgrade should roll through all nodes successfully.
Additional info:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-128-103.us-west-1.compute.internal Ready worker 309d v1.18.3+6025c28
ip-10-0-128-106.us-west-1.compute.internal Ready master 309d v1.18.3+6025c28
ip-10-0-133-180.us-west-1.compute.internal Ready,SchedulingDisabled worker 309d v1.18.3+f291db1
ip-10-0-138-37.us-west-1.compute.internal Ready worker 309d v1.18.3+f291db1
ip-10-0-138-69.us-west-1.compute.internal Ready master 309d v1.18.3+6025c28
ip-10-0-139-193.us-west-1.compute.internal Ready,SchedulingDisabled worker 309d v1.18.3+f291db1
ip-10-0-140-219.us-west-1.compute.internal Ready worker 309d v1.18.3+6025c28
ip-10-0-140-241.us-west-1.compute.internal Ready worker 309d v1.18.3+f291db1
...
$ oc get pods -o wide | grep ip-10-0-133-180.us-west-1.compute.internal
machine-config-daemon-kcv4c 2/2 Running 0 8h 10.0.133.180 ip-10-0-133-180.us-west-1.compute.internal <none> <none>
$ oc logs machine-config-daemon-kcv4c -c machine-config-daemon
<snip>
I0704 03:13:12.894123 2719952 daemon.go:321] evicting pod eta-karthika-demo/django-ex-1-build
I0704 03:13:12.894138 2719952 daemon.go:321] evicting pod eta-karthika-demo/ruby-hello-world-2-build
I0704 03:13:12.894406 2719952 daemon.go:321] evicting pod openshiftdemosree/ruby-hello-world-1-build
I0704 03:14:42.918021 2719952 update.go:173] Draining failed with: [error when waiting for pod "ruby-hello-world-2-build" terminating: global timeout reached: 1m30s, error when waiting for pod "django-ex-1-build" terminating: global timeout reached: 1m30s, error when waiting for pod "ruby-hello-world-1-build" terminating: global timeout reached: 1m30s], retrying
E0704 03:15:03.066532 2719952 daemon.go:321] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-mpvdk, openshift-dns/dns-default-pmd46, openshift-image-registry/node-ca-ntctx, openshift-machine-config-operator/machine-config-daemon-kcv4c, openshift-monitoring/crel-monitors-clamav-container-scanner-wfrcb, openshift-monitoring/crel-monitors-sdn-health-dpw6b, openshift-monitoring/node-exporter-pxxkm, openshift-multus/multus-g5kn2, openshift-sdn/ovs-6bvfd, openshift-sdn/sdn-bhx4g
</snip>
Our nodes have been stuck in this state for over 8 hours now.
--- Additional comment from on 2020-07-04 03:35:31 UTC ---
I have been able to manually move things along by forcefully deleting the specified pod:
$ oc delete -n eta-karthika-demo --force --grace-period=0 pod/ruby-hello-world-1-build pod/ruby-hello-world-2-build
--- Additional comment from David Eads on 2020-07-09 16:57:44 UTC ---
2020-07-04T02:23:53.430453488Z E0704 02:23:53.430413 1 garbagecollector.go:314] error syncing item ***** unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]
is fixed by https://github.com/kubernetes/kubernetes/pull/91648 in 1.18.5.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.
https://access.redhat.com/errata/RHBA-2020:2409
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again.
[1]: https://github.com/openshift/enhancements/pull/475