Description of problem: During the rollout of the machine-config, post-upgrade, the MCO is getting stuck trying to drain nodes that have "build" pods running on them. The nodes appear to be stuck in a loop trying to evict the pods. Version-Release number of selected component (if applicable): $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-rc.6 True False 8h Cluster version is 4.5.0-rc.6 How reproducible: This has happened on both production clusters, 2 upgrades in a row (4.5.0-rc.5 and 4.5.0-rc.6). Steps to Reproduce: 1.We have pretty large clusters (53 nodes), with lots of users on them. Folks are running build pods continuously on these clusters. Begin an upgrade to 4.5.0-rc.5 (or 4.5.0-rc.6). 2.The upgrade goes all the way thorough successfully. The clusterversion is reporting complete, all operators are reporting healthy, but the machine-config rollout is hung. Actual results: The machine-config rollout has been hung for hours. Expected results: The machine-config upgrade should roll through all nodes successfully. Additional info: $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-103.us-west-1.compute.internal Ready worker 309d v1.18.3+6025c28 ip-10-0-128-106.us-west-1.compute.internal Ready master 309d v1.18.3+6025c28 ip-10-0-133-180.us-west-1.compute.internal Ready,SchedulingDisabled worker 309d v1.18.3+f291db1 ip-10-0-138-37.us-west-1.compute.internal Ready worker 309d v1.18.3+f291db1 ip-10-0-138-69.us-west-1.compute.internal Ready master 309d v1.18.3+6025c28 ip-10-0-139-193.us-west-1.compute.internal Ready,SchedulingDisabled worker 309d v1.18.3+f291db1 ip-10-0-140-219.us-west-1.compute.internal Ready worker 309d v1.18.3+6025c28 ip-10-0-140-241.us-west-1.compute.internal Ready worker 309d v1.18.3+f291db1 ... $ oc get pods -o wide | grep ip-10-0-133-180.us-west-1.compute.internal machine-config-daemon-kcv4c 2/2 Running 0 8h 10.0.133.180 ip-10-0-133-180.us-west-1.compute.internal <none> <none> $ oc logs machine-config-daemon-kcv4c -c machine-config-daemon <snip> I0704 03:13:12.894123 2719952 daemon.go:321] evicting pod eta-karthika-demo/django-ex-1-build I0704 03:13:12.894138 2719952 daemon.go:321] evicting pod eta-karthika-demo/ruby-hello-world-2-build I0704 03:13:12.894406 2719952 daemon.go:321] evicting pod openshiftdemosree/ruby-hello-world-1-build I0704 03:14:42.918021 2719952 update.go:173] Draining failed with: [error when waiting for pod "ruby-hello-world-2-build" terminating: global timeout reached: 1m30s, error when waiting for pod "django-ex-1-build" terminating: global timeout reached: 1m30s, error when waiting for pod "ruby-hello-world-1-build" terminating: global timeout reached: 1m30s], retrying E0704 03:15:03.066532 2719952 daemon.go:321] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-mpvdk, openshift-dns/dns-default-pmd46, openshift-image-registry/node-ca-ntctx, openshift-machine-config-operator/machine-config-daemon-kcv4c, openshift-monitoring/crel-monitors-clamav-container-scanner-wfrcb, openshift-monitoring/crel-monitors-sdn-health-dpw6b, openshift-monitoring/node-exporter-pxxkm, openshift-multus/multus-g5kn2, openshift-sdn/ovs-6bvfd, openshift-sdn/sdn-bhx4g </snip> Our nodes have been stuck in this state for over 8 hours now.
I have been able to manually move things along by forcefully deleting the specified pod: $ oc delete -n eta-karthika-demo --force --grace-period=0 pod/ruby-hello-world-1-build pod/ruby-hello-world-2-build
We've always behaved like this - we do just retry over and over until a clean drain finishes. Why that build pod is still around, I don't know, a must gather and system logs would shed some light but as far as the MCO goes. The expected result that you posted in the first comment isn't actionable. It's either we force kill that pod, at the cost of losing data/whatever or we don't and keep retrying until it succeeds. If there's an issue with that pod specifically, it's again not on the MCO, if it takes a long time to drain we do respect that and we cannot decide to kill it. I want to echo this has always been the behavior unless I'm missing something, we (MCD) waits _forever_ by retrying over and over and we never killed something to move on. Can we narrow down what we're really expecting here? what's the issue with the MCO? the wait-forever isn't a bug, it's by design until maybe we get smarter (but this isn't the case). Is this about that specific build pod? is this for pods that takes a long time to drain?
Antonio, Thanks for the information. I understand that this is the expected behavior and that this probably is not a MCO bug. However, I am concerned why all of a sudden the build pods are preventing the nodes from draining. I had posted a must-gather in comment #1 above. Im collecting the journal logs now and will post them shortly. Any help/guidance is greatly appreciated. Also, if you'd like access to the environment, I can provide it. Thanks, Brad
That's kind of the issue, all the "build" pods that appear to be impacted are in a "Terminating" state... $ oc get pods -n <REDACTED> NAME READY STATUS RESTARTS AGE django-ex-1-build 0/1 Terminating 0 4d8h django-ex-f684456b-dx8tm 1/1 Running 0 4d5h karthi-nodejs-1-build 0/1 Terminating 0 4d9h mysql-1-deploy 0/1 Completed 0 4d8h nodejs-ex-git-1-build 0/1 Completed 0 4d5h nodejs-ex-git-new-1-29v5s 1/1 Running 0 3d9h nodejs-ex-git-new-1-build 0/1 Completed 0 3d9h nodejs-ex-git-new-1-deploy 0/1 Completed 0 3d9h ruby-ex-1-build 0/1 Terminating 0 4d8h ruby-hello-world-1-build 0/1 Terminating 0 4d8h ruby-hello-world-2-build 0/1 Terminating 0 4d8h
(In reply to brad.williams from comment #9) > That's kind of the issue, all the "build" pods that appear to be impacted > are in a "Terminating" state... > > $ oc get pods -n <REDACTED> > NAME READY STATUS RESTARTS AGE > django-ex-1-build 0/1 Terminating 0 4d8h > django-ex-f684456b-dx8tm 1/1 Running 0 4d5h > karthi-nodejs-1-build 0/1 Terminating 0 4d9h > mysql-1-deploy 0/1 Completed 0 4d8h > nodejs-ex-git-1-build 0/1 Completed 0 4d5h > nodejs-ex-git-new-1-29v5s 1/1 Running 0 3d9h > nodejs-ex-git-new-1-build 0/1 Completed 0 3d9h > nodejs-ex-git-new-1-deploy 0/1 Completed 0 3d9h > ruby-ex-1-build 0/1 Terminating 0 4d8h > ruby-hello-world-1-build 0/1 Terminating 0 4d8h > ruby-hello-world-2-build 0/1 Terminating 0 4d8h right, that's what I suspected - can you oc describe them? maybe that'll give a clue before diving into crio/runc world
Moving to Containers to further check why we can't even access logs from those pods/containers - somehow they're stuck there in Terminating (MCO is retrying over and over as it should).
Here is relevant code from GarbageCollector#attemptToDeleteWorker: ``` err := gc.attemptToDeleteItem(n) if err != nil { ``` I found similar errors for two of the stuck Pods in ./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-db464f9453800beb0a9744e4d4320e77e1e0ac6da6399f0081ee2e555171c83a/namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ip-10-0-145-145.us-west-1.compute.internal/kube-controller-manager/kube-controller-manager/logs/current.log: ``` 2020-07-04T02:23:53.430453488Z E0704 02:23:53.430413 1 garbagecollector.go:314] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"v1", Kind:"Pod", Name:"django-ex-1-build", UID:"f870ac4d-205d-4030-adaa-fa1e6c7fa0d9", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}, Namespace:"eta-karthika-demo"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:1, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{}, deletingDependents:true, deletingDependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, beingDeleted:true, beingDeletedLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, virtual:false, virtualLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, owners:[]v1.OwnerReference{v1.OwnerReference{APIVersion:"build.openshift.io/v1", Kind:"Build", Name:"django-ex-1", UID:"799a6aba-9aed-485c-8a25-e7a806392b8c", Controller:(*bool)(0xc005d43ddb), BlockOwnerDeletion:(*bool)(nil)}}}: pods "django-ex-1-build" is forbidden: unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed] 2020-07-04T02:23:53.398886469Z E0704 02:23:53.398847 1 garbagecollector.go:314] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"v1", Kind:"Pod", Name:"akhilpvghi-devapp-1-build", UID:"b7a8a3cb-aa41-4548-950e-03ff3f095f32", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}, Namespace:"bluff-backend"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:1, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{}, deletingDependents:true, deletingDependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, beingDeleted:true, beingDeletedLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, virtual:false, virtualLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, owners:[]v1.OwnerReference{v1.OwnerReference{APIVersion:"build.openshift.io/v1", Kind:"Build", Name:"akhilpvghi-devapp-1", UID:"2dfde915-ef3e-47f7-8c98-f61e2067a77d", Controller:(*bool)(0xc00ae8e55d), BlockOwnerDeletion:(*bool)(nil)}}}: pods "akhilpvghi-devapp-1-build" is forbidden: unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed] ```
> Invalid value: true: Privileged containers are not allowed This looks as if a pod spec that was previously valid prior to the upgrade is now failing SCC after cluster upgrade. As if we enabled or tightened some security policy during the upgrade.
2020-07-04T02:23:53.430453488Z E0704 02:23:53.430413 1 garbagecollector.go:314] error syncing item ***** unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed] is fixed by https://github.com/kubernetes/kubernetes/pull/91648 in 1.18.5.
not reproduce when upgrade from 4.5.0-0.nightly-2020-07-17-221127 to 4.6.0-0.nightly-2020-07-17-230106, mark as verified!
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196