Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1755314

Summary:

[4.1.z]machine-config clusteroperator degraded for a long time during upgrade

Product:

OpenShift Container Platform

Reporter:

Junqi Zhao <juzhao>

Component:

Machine Config Operator

Assignee:

Antonio Murdaca <amurdaca>

Status:

CLOSED NOTABUG

QA Contact:

Michael Nguyen <mnguyen>

Severity:

low

Docs Contact:

Priority:

unspecified

Version:

4.1.z

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-09-25 13:07:53 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
machine-config info	none

Comment 3 Junqi Zhao 2019-09-25 11:51:01 UTC

Created attachment 1618993 [details]
machine-config info

Comment 6 Antonio Murdaca 2019-09-25 13:07:53 UTC

The rollout to the master taking some time is expected and it was probably because there was some load. It eventually reconciles indeed.

The workers are instaed hanging because of the failure to fully terminate/remove/drain this pod:

```
$ kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=ip-10-0-66-192.us-east-2.compute.internal
NAMESPACE                                NAME                             READY   STATUS        RESTARTS   AGE     IP            NODE                                        NOMINATED NODE   READINESS GATES
default                                  orca-operator-5bd67bc7f5-xbmst   1/1     Terminating   0          10h     10.128.2.54   ip-10-0-66-192.us-east-2.compute.internal   <none>           <none>
...

```

The drain in the MCD for that node is just waiting for the eviction to finish but it never finishes:

```
...
I0925 12:59:34.904008   72033 update.go:848] Update prepared; beginning drain
I0925 12:59:35.132759   72033 update.go:93] ignoring DaemonSet-managed pods: hello-daemonset-nbgkv, tuned-p77s5, dns-default-bn6jz, node-ca-wkfgd, fluentd-rz8l8, machine-config-daemon-wvdmv, node-exporter-zslwc, multus-sxx82, ovs-55j4m, sdn-7r9d7, hello-daemonset-8c68p; deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: olm-operators-8ldjt
I0925 12:59:44.165447   72033 update.go:89] pod "olm-operators-8ldjt" removed (evicted)
```

This isn't an MCO issue, please get in touch with orca's operator owners and file a bug against them, should be pretty easy to reproduce as well.