Bug 1855444 - Post-upgrade the MCO failing to evict "build" pods during drain
Summary: Post-upgrade the MCO failing to evict "build" pods during drain
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: ---
: 4.5.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
: 1855477 (view as bug list)
Depends On: 1853786
Blocks: 1855478
TreeView+ depends on / blocked
 
Reported: 2020-07-09 20:58 UTC by W. Trevor King
Modified: 2020-07-13 17:45 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1853786
Environment:
Last Closed: 2020-07-13 17:44:51 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift origin pull 25263 None closed Bug 1855444: UPSTREAM: 91648: Changes to ManagedFields is not mutation for GC 2020-08-31 13:43:56 UTC
Red Hat Product Errata RHBA-2020:2409 None None None 2020-07-13 17:45:10 UTC

Description W. Trevor King 2020-07-09 20:58:02 UTC
+++ This bug was initially created as a clone of Bug #1853786 +++

Description of problem:
During the rollout of the machine-config, post-upgrade, the MCO is getting stuck trying to drain nodes that have "build" pods running on them.  The nodes appear to be stuck in a loop trying to evict the pods. 

Version-Release number of selected component (if applicable):
$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-rc.6   True        False         8h      Cluster version is 4.5.0-rc.6

How reproducible:
This has happened on both production clusters, 2 upgrades in a row (4.5.0-rc.5 and 4.5.0-rc.6).

Steps to Reproduce:
1.We have pretty large clusters (53 nodes), with lots of users on them.  Folks are running build pods continuously on these clusters.  Begin an upgrade to 4.5.0-rc.5 (or 4.5.0-rc.6).  
2.The upgrade goes all the way thorough successfully.  The clusterversion is reporting complete, all operators are reporting healthy, but the machine-config rollout is hung.

Actual results:
The machine-config rollout has been hung for hours.

Expected results:
The machine-config upgrade should roll through all nodes successfully.

Additional info:
$ oc get nodes
NAME                                         STATUS                     ROLES    AGE    VERSION
ip-10-0-128-103.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+6025c28
ip-10-0-128-106.us-west-1.compute.internal   Ready                      master   309d   v1.18.3+6025c28
ip-10-0-133-180.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   309d   v1.18.3+f291db1
ip-10-0-138-37.us-west-1.compute.internal    Ready                      worker   309d   v1.18.3+f291db1
ip-10-0-138-69.us-west-1.compute.internal    Ready                      master   309d   v1.18.3+6025c28
ip-10-0-139-193.us-west-1.compute.internal   Ready,SchedulingDisabled   worker   309d   v1.18.3+f291db1
ip-10-0-140-219.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+6025c28
ip-10-0-140-241.us-west-1.compute.internal   Ready                      worker   309d   v1.18.3+f291db1
...

$ oc get pods -o wide | grep ip-10-0-133-180.us-west-1.compute.internal
machine-config-daemon-kcv4c                  2/2     Running   0          8h    10.0.133.180   ip-10-0-133-180.us-west-1.compute.internal   <none>           <none>

$ oc logs machine-config-daemon-kcv4c -c machine-config-daemon 
<snip>
I0704 03:13:12.894123 2719952 daemon.go:321] evicting pod eta-karthika-demo/django-ex-1-build
I0704 03:13:12.894138 2719952 daemon.go:321] evicting pod eta-karthika-demo/ruby-hello-world-2-build
I0704 03:13:12.894406 2719952 daemon.go:321] evicting pod openshiftdemosree/ruby-hello-world-1-build
I0704 03:14:42.918021 2719952 update.go:173] Draining failed with: [error when waiting for pod "ruby-hello-world-2-build" terminating: global timeout reached: 1m30s, error when waiting for pod "django-ex-1-build" terminating: global timeout reached: 1m30s, error when waiting for pod "ruby-hello-world-1-build" terminating: global timeout reached: 1m30s], retrying
E0704 03:15:03.066532 2719952 daemon.go:321] WARNING: ignoring DaemonSet-managed Pods: openshift-cluster-node-tuning-operator/tuned-mpvdk, openshift-dns/dns-default-pmd46, openshift-image-registry/node-ca-ntctx, openshift-machine-config-operator/machine-config-daemon-kcv4c, openshift-monitoring/crel-monitors-clamav-container-scanner-wfrcb, openshift-monitoring/crel-monitors-sdn-health-dpw6b, openshift-monitoring/node-exporter-pxxkm, openshift-multus/multus-g5kn2, openshift-sdn/ovs-6bvfd, openshift-sdn/sdn-bhx4g
</snip>

Our nodes have been stuck in this state for over 8 hours now.

--- Additional comment from  on 2020-07-04 03:35:31 UTC ---

I have been able to manually move things along by forcefully deleting the specified pod:
$ oc delete -n eta-karthika-demo --force --grace-period=0 pod/ruby-hello-world-1-build pod/ruby-hello-world-2-build

--- Additional comment from David Eads on 2020-07-09 16:57:44 UTC ---

2020-07-04T02:23:53.430453488Z E0704 02:23:53.430413       1 garbagecollector.go:314] error syncing item ***** unable to validate against any security context constraint: [spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed]




is fixed by https://github.com/kubernetes/kubernetes/pull/91648 in 1.18.5.

Comment 3 W. Trevor King 2020-07-09 21:05:26 UTC
Eric fixed the bot.  Resetting the target to 4.5.0.

Comment 6 Scott Dodson 2020-07-10 12:07:09 UTC
*** Bug 1855477 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2020-07-13 17:44:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.