Bug 1755314 - [4.1.z]machine-config clusteroperator degraded for a long time during upgrade
Summary: [4.1.z]machine-config clusteroperator degraded for a long time during upgrade
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.1.z
Hardware: Unspecified
OS: Unspecified
unspecified
low
Target Milestone: ---
: ---
Assignee: Antonio Murdaca
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-09-25 08:53 UTC by Junqi Zhao
Modified: 2019-09-25 13:07 UTC (History)
0 users

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-09-25 13:07:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
machine-config info (961.36 KB, application/gzip)
2019-09-25 11:51 UTC, Junqi Zhao
no flags Details

Comment 3 Junqi Zhao 2019-09-25 11:51:01 UTC
Created attachment 1618993 [details]
machine-config info

Comment 6 Antonio Murdaca 2019-09-25 13:07:53 UTC
The rollout to the master taking some time is expected and it was probably because there was some load. It eventually reconciles indeed.

The workers are instaed hanging because of the failure to fully terminate/remove/drain this pod:

```
$ kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=ip-10-0-66-192.us-east-2.compute.internal
NAMESPACE                                NAME                             READY   STATUS        RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
default                                  orca-operator-5bd67bc7f5-xbmst   1/1     Terminating   0          10h     10.128.2.54   ip-10-0-66-192.us-east-2.compute.internal   <none>           <none>
...

```

The drain in the MCD for that node is just waiting for the eviction to finish but it never finishes:

```
...
I0925 12:59:34.904008   72033 update.go:848] Update prepared; beginning drain
I0925 12:59:35.132759   72033 update.go:93] ignoring DaemonSet-managed pods: hello-daemonset-nbgkv, tuned-p77s5, dns-default-bn6jz, node-ca-wkfgd, fluentd-rz8l8, machine-config-daemon-wvdmv, node-exporter-zslwc, multus-sxx82, ovs-55j4m, sdn-7r9d7, hello-daemonset-8c68p; deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: olm-operators-8ldjt
I0925 12:59:44.165447   72033 update.go:89] pod "olm-operators-8ldjt" removed (evicted)
```

This isn't an MCO issue, please get in touch with orca's operator owners and file a bug against them, should be pretty easy to reproduce as well.


Note You need to log in before you can comment on or make changes to this bug.