1853786 – Post-upgrade the MCO failing to evict "build" pods during drain

Bug 1853786 - Post-upgrade the MCO failing to evict "build" pods during drain

Summary: Post-upgrade the MCO failing to evict "build" pods during drain

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Ryan Phillips
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1855444
TreeView+	depends on / blocked

Reported:	2020-07-04 03:21 UTC by brad.williams
Modified:	2020-10-27 16:12 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1855444 (view as bug list)
Environment:
Last Closed:	2020-10-27 16:12:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25261	0	None	closed	bug 1853786: UPSTREAM: 91648: Changes to ManagedFields is not mutation for GC	2020-12-10 07:40:06 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:12:53 UTC

Description brad.williams 2020-07-04 03:21:58 UTC

Description of problem:
During the rollout of the machine-config, post-upgrade, the MCO is getting stuck trying to drain nodes that have "build" pods running on them.  The nodes appear to be stuck in a loop trying to evict the pods. 

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-rc.6   True        False         8h      Cluster version is 4.5.0-rc.6

How reproducible:
This has happened on both production clusters, 2 upgrades in a row (4.5.0-rc.5 and 4.5.0-rc.6).

Steps to Reproduce:
1.We have pretty large clusters (53 nodes), with lots of users on them.  Folks are running build pods continuously on these clusters.  Begin an upgrade to 4.5.0-rc.5 (or 4.5.0-rc.6).  
2.The upgrade goes all the way thorough successfully.  The clusterversion is reporting complete, all operators are reporting healthy, but the machine-config rollout is hung.

Actual results:
The machine-config rollout has been hung for hours.

Expected results:
The machine-config upgrade should roll through all nodes successfully.

Additional info:
$ oc get nodes
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-128-103.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+6025c28
ip-10-0-128-106.us-west-1.compute.internal   Ready                      master   309d   v1.18.3+6025c28
ip-10-0-133-180.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   309d   v1.18.3+f291db1
ip-10-0-138-37.us-west-1.compute.internal    Ready                      worker   309d   v1.18.3+f291db1
ip-10-0-138-69.us-west-1.compute.internal    Ready                      master   309d   v1.18.3+6025c28
ip-10-0-139-193.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   309d   v1.18.3+f291db1
ip-10-0-140-219.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+6025c28
ip-10-0-140-241.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+f291db1
...

$ oc get pods -o wide | grep ip-10-0-133-180.us-west-1.compute.internal
machine-config-daemon-kcv4c                  2/2     Running   0          8h    10.0.133.180   ip-10-0-133-180.us-west-1.compute.internal   <none>           <none>

$ oc logs machine-config-daemon-kcv4c -c machine-config-daemon 
<snip>
I0704 03:13:12.894123 2719952 daemon.go:321] evicting pod eta-karthika-demo/django-ex-1-build
I0704 03:13:12.894138 2719952 daemon.go:321] evicting pod eta-karthika-demo/ruby-hello-world-2-build
I0704 03:13:12.894406 2719952 daemon.go:321] evicting pod openshiftdemosree/ruby-hello-world-1-build
I0704 03:14:42.918021 2719952 update.go:173] Draining failed with: [error when waiting for pod "ruby-hello-world-2-build" terminating: global timeout reached: 1m30s, error when waiting for pod "django-ex-1-build" terminating: global timeout reached: 1m30s, error when waiting for pod "ruby-hello-world-1-build" terminating: global timeout reached: 1m30s], retrying
E0704 03:15:03.066532 2719952 daemon.go:321] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-mpvdk, openshift-dns/dns-default-pmd46, openshift-image-registry/node-ca-ntctx, openshift-machine-config-operator/machine-config-daemon-kcv4c, openshift-monitoring/crel-monitors-clamav-container-scanner-wfrcb, openshift-monitoring/crel-monitors-sdn-health-dpw6b, openshift-monitoring/node-exporter-pxxkm, openshift-multus/multus-g5kn2, openshift-sdn/ovs-6bvfd, openshift-sdn/sdn-bhx4g
</snip>

Our nodes have been stuck in this state for over 8 hours now.

Comment 2 brad.williams 2020-07-04 03:35:31 UTC

I have been able to manually move things along by forcefully deleting the specified pod:
$ oc delete -n eta-karthika-demo --force --grace-period=0 pod/ruby-hello-world-1-build pod/ruby-hello-world-2-build

Comment 3 Antonio Murdaca 2020-07-06 11:33:46 UTC

We've always behaved like this - we do just retry over and over until a clean drain finishes. Why that build pod is still around, I don't know, a must gather and system logs would shed some light but as far as the MCO goes.

The expected result that you posted in the first comment isn't actionable. It's either we force kill that pod, at the cost of losing data/whatever or we don't and keep retrying until it succeeds.
If there's an issue with that pod specifically, it's again not on the MCO, if it takes a long time to drain we do respect that and we cannot decide to kill it.

I want to echo this has always been the behavior unless I'm missing something, we (MCD) waits _forever_ by retrying over and over and we never killed something to move on.

Can we narrow down what we're really expecting here? what's the issue with the MCO? the wait-forever isn't a bug, it's by design until maybe we get smarter (but this isn't the case). Is this about that specific build pod? is this for pods that takes a long time to drain?

Comment 5 brad.williams 2020-07-06 14:26:17 UTC

Antonio,
Thanks for the information. I understand that this is the expected behavior and that this probably is not a MCO bug.  However, I am concerned why all of a sudden the build pods are preventing the nodes from draining.  I had posted a must-gather in comment #1 above.  Im collecting the journal logs now and will post them shortly.
Any help/guidance is greatly appreciated.

Also, if you'd like access to the environment, I can provide it. 
Thanks,
     Brad

Comment 9 brad.williams 2020-07-06 14:48:20 UTC

That's kind of the issue, all the "build" pods that appear to be impacted are in a "Terminating" state...

$ oc get pods -n <REDACTED>
NAME                         READY   STATUS        RESTARTS   AGE
django-ex-1-build            0/1     Terminating   0          4d8h
django-ex-f684456b-dx8tm     1/1     Running       0          4d5h
karthi-nodejs-1-build        0/1     Terminating   0          4d9h
mysql-1-deploy               0/1     Completed     0          4d8h
nodejs-ex-git-1-build        0/1     Completed     0          4d5h
nodejs-ex-git-new-1-29v5s    1/1     Running       0          3d9h
nodejs-ex-git-new-1-build    0/1     Completed     0          3d9h
nodejs-ex-git-new-1-deploy   0/1     Completed     0          3d9h
ruby-ex-1-build              0/1     Terminating   0          4d8h
ruby-hello-world-1-build     0/1     Terminating   0          4d8h
ruby-hello-world-2-build     0/1     Terminating   0          4d8h

Comment 10 Antonio Murdaca 2020-07-06 14:54:21 UTC

(In reply to brad.williams from comment #9)
> That's kind of the issue, all the "build" pods that appear to be impacted
> are in a "Terminating" state...
> 
> $ oc get pods -n <REDACTED>
> NAME                         READY   STATUS        RESTARTS   AGE
> django-ex-1-build            0/1     Terminating   0          4d8h
> django-ex-f684456b-dx8tm     1/1     Running       0          4d5h
> karthi-nodejs-1-build        0/1     Terminating   0          4d9h
> mysql-1-deploy               0/1     Completed     0          4d8h
> nodejs-ex-git-1-build        0/1     Completed     0          4d5h
> nodejs-ex-git-new-1-29v5s    1/1     Running       0          3d9h
> nodejs-ex-git-new-1-build    0/1     Completed     0          3d9h
> nodejs-ex-git-new-1-deploy   0/1     Completed     0          3d9h
> ruby-ex-1-build              0/1     Terminating   0          4d8h
> ruby-hello-world-1-build     0/1     Terminating   0          4d8h
> ruby-hello-world-2-build     0/1     Terminating   0          4d8h

right, that's what I suspected - can you oc describe them? maybe that'll give a clue before diving into crio/runc world

Comment 12 Antonio Murdaca 2020-07-06 15:27:59 UTC

Moving to Containers to further check why we can't even access logs from those pods/containers - somehow they're stuck there in Terminating (MCO is retrying over and over as it should).

Comment 24 Ted Yu 2020-07-09 15:57:58 UTC

Here is relevant code from GarbageCollector#attemptToDeleteWorker:
```
	err := gc.attemptToDeleteItem(n)
	if err != nil {
```
I found similar errors for two of the stuck Pods in ./quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-db464f9453800beb0a9744e4d4320e77e1e0ac6da6399f0081ee2e555171c83a/namespaces/openshift-kube-controller-manager/pods/kube-controller-manager-ip-10-0-145-145.us-west-1.compute.internal/kube-controller-manager/kube-controller-manager/logs/current.log:
```
2020-07-04T02:23:53.430453488Z E0704 02:23:53.430413       1 garbagecollector.go:314] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"v1", Kind:"Pod", Name:"django-ex-1-build", UID:"f870ac4d-205d-4030-adaa-fa1e6c7fa0d9", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}, Namespace:"eta-karthika-demo"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:1, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{}, deletingDependents:true, deletingDependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, beingDeleted:true, beingDeletedLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, virtual:false, virtualLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, owners:[]v1.OwnerReference{v1.OwnerReference{APIVersion:"build.openshift.io/v1", Kind:"Build", Name:"django-ex-1", UID:"799a6aba-9aed-485c-8a25-e7a806392b8c", Controller:(*bool)(0xc005d43ddb), BlockOwnerDeletion:(*bool)(nil)}}}: pods "django-ex-1-build" is forbidden: unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]

2020-07-04T02:23:53.398886469Z E0704 02:23:53.398847       1 garbagecollector.go:314] error syncing item &garbagecollector.node{identity:garbagecollector.objectReference{OwnerReference:v1.OwnerReference{APIVersion:"v1", Kind:"Pod", Name:"akhilpvghi-devapp-1-build", UID:"b7a8a3cb-aa41-4548-950e-03ff3f095f32", Controller:(*bool)(nil), BlockOwnerDeletion:(*bool)(nil)}, Namespace:"bluff-backend"}, dependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:1, readerWait:0}, dependents:map[*garbagecollector.node]struct {}{}, deletingDependents:true, deletingDependentsLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, beingDeleted:true, beingDeletedLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, virtual:false, virtualLock:sync.RWMutex{w:sync.Mutex{state:0, sema:0x0}, writerSem:0x0, readerSem:0x0, readerCount:0, readerWait:0}, owners:[]v1.OwnerReference{v1.OwnerReference{APIVersion:"build.openshift.io/v1", Kind:"Build", Name:"akhilpvghi-devapp-1", UID:"2dfde915-ef3e-47f7-8c98-f61e2067a77d", Controller:(*bool)(0xc00ae8e55d), BlockOwnerDeletion:(*bool)(nil)}}}: pods "akhilpvghi-devapp-1-build" is forbidden: unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]
```

Comment 25 Michael Gugino 2020-07-09 16:32:21 UTC

> Invalid value: true: Privileged containers are not allowed

This looks as if a pod spec that was previously valid prior to the upgrade is now failing SCC after cluster upgrade.  As if we enabled or tightened some security policy during the upgrade.

Comment 27 David Eads 2020-07-09 16:57:44 UTC

2020-07-04T02:23:53.430453488Z E0704 02:23:53.430413       1 garbagecollector.go:314] error syncing item ***** unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]




is fixed by https://github.com/kubernetes/kubernetes/pull/91648 in 1.18.5.

Comment 34 MinLi 2020-07-20 09:27:05 UTC

not reproduce when upgrade from 4.5.0-0.nightly-2020-07-17-221127 to 4.6.0-0.nightly-2020-07-17-230106, mark as verified!

Comment 36 errata-xmlrpc 2020-10-27 16:12:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.