Bug 1885921

Summary:	[BM][IPI] MCP degraded: machineconfiguration.openshift.io/desiredConfig annotation not found
Product:	OpenShift Container Platform	Reporter:	Yurii Prokulevych <yprokule>
Component:	Cloud Compute	Assignee:	Beth White <beth.white>
Cloud Compute sub component:	BareMetal Provider	QA Contact:	Amit Ugol <augol>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	amurdaca, jerzhang, mcornea, stbenjam, zbitter
Version:	4.6
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-08 18:15:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Yurii Prokulevych 2020-10-07 09:47:59 UTC

Description of problem:
-----------------------
During scale down procedure MCP got in degraded state.
Logs from machine-config-daemon-* pod showed:

```
I1007 08:38:11.762043   16232 rpm-ostree.go:261] Running captured: journalctl --list-boots
I1007 08:38:11.960350   16232 daemon.go:869] journalctl --list-boots:
-4 5d7e9d3de6f44b8180ceb54dcf27f676 Mon 2020-10-05 20:37:41 UTC—Wed 2020-10-07 07:30:19 UTC
-3 56b4b00596844bb8945bc232f316ba29 Wed 2020-10-07 07:34:29 UTC—Wed 2020-10-07 07:51:18 UTC
-2 ec5b27f98d8b4b6baace0f92fd342cb9 Wed 2020-10-07 07:55:00 UTC—Wed 2020-10-07 08:12:19 UTC
-1 be9c17e727f34d4583e0ac9e83f1363c Wed 2020-10-07 08:16:01 UTC—Wed 2020-10-07 08:33:18 UTC
 0 ec017cfddb434df1ac20158c21a96c92 Wed 2020-10-07 08:36:59 UTC—Wed 2020-10-07 08:38:11 UTC
I1007 08:38:11.960388   16232 rpm-ostree.go:261] Running captured: systemctl list-units --state=failed --no-legend
I1007 08:38:11.973053   16232 daemon.go:882] systemctl --failed:
NetworkManager-wait-online.service loaded failed failed Network Manager Wait Online
I1007 08:38:11.973072   16232 daemon.go:616] Starting MachineConfigDaemon
I1007 08:38:11.973210   16232 daemon.go:623] Enabling Kubelet Healthz Monitor
I1007 08:38:41.194987   16232 trace.go:205] Trace[1106410694]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (07-Oct-
2020 08:38:11.194) (total time: 30000ms):
Trace[1106410694]: [30.000777255s] [30.000777255s] END
E1007 08:38:41.195411   16232 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v
1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout
I1007 08:38:41.194998   16232 trace.go:205] Trace[1747278511]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (07-Oct-2020 08:38:11.194) (total time: 30000ms):
Trace[1747278511]: [30.000801842s] [30.000801842s] END
E1007 08:38:41.195448   16232 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceV
ersion=0": dial tcp 172.30.0.1:443: i/o timeout
I1007 08:38:43.322848   16232 daemon.go:403] Node openshift-worker-5 is not labeled node-role.kubernetes.io/master
E1007 08:38:43.358219   16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5'
E1007 08:38:45.387975   16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5'
E1007 08:38:49.417490   16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5'
E1007 08:38:57.446143   16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5'
E1007 08:39:13.475158   16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5'
E1007 08:39:45.502864   16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5'
E1007 08:40:45.531140   16232 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5'
```

Adding annotation manually resulted in :
```
E1007 09:07:38.313438   13937 writer.go:135] Marking Degraded due to: machineconfiguration.openshift.io/desiredConfig annotation not found on node 'openshift-worker-5'
I1007 09:08:38.340832   13937 daemon.go:808] Current+desired config: rendered-worker-61fbb0947e09c761cb036c96ab321807
I1007 09:08:38.353132   13937 daemon.go:1085] Validating against current config rendered-worker-61fbb0947e09c761cb036c96ab321807
I1007 09:08:38.369501   13937 daemon.go:1093] Validated on-disk state
I1007 09:08:38.386479   13937 daemon.go:1150] In desired config rendered-worker-61fbb0947e09c761cb036c96ab321807
I1007 09:12:14.860336   13937 daemon.go:641] Shutting down MachineConfigDaemon
```

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
4.6.0-0.nightly-2020-10-02-065738


How reproducible:


Steps to Reproduce:
-------------------
1. Annotate machine:

    oc annotate machine worker-0-hb974 machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api

2. Delete BMH(BareMetalHost) linked to the aforementioned machine
   oc delete kmh worker-X -n openshift-machine-api

3. Scale down machinset by one
   oc scale machineset worker-0 --replicas=Y -n openshift-machine-api

4. Wait for mcp to get in degraded state


Additional info:
----------------
BM IPI deployment with 3 masters and 8 workers
Not clear why worker-5 got redeployed(it was a scale down) but for this separate bz would be filed

Comment 2 Yu Qi Zhang 2020-10-07 18:25:16 UTC

So there's a couple of weird things going on in your system. From your node objects:

worker 3 (normal)
- creationTimestamp: "2020-10-02T13:17:48Z"

worker 4 (weird)
- creationTimestamp: "2020-10-02T13:17:40Z"
deletionGracePeriodSeconds: 0
deletionTimestamp: "2020-10-07T07:36:49Z"

worker 5 (weird)
- creationTimestamp: "2020-10-07T07:35:27Z"

So it seems to me that worker 5 was recreated at some point, after which worker 4 attempted to be removed?

reading https://docs.openshift.com/container-platform/4.5/machine_management/manually-scaling-machineset.html and https://docs.openshift.com/container-platform/4.5/machine_management/deleting-machine.html it seems that you double-deleted machines. As in you should only be deleting the machine object or scaling the machineset, not both.

I also tried scaling down on an AWS cluster (oc scale machineset xxx --replicas=Y -n openshift-machine-api) and the scaledown operation completed fine without the MCP degrading. So in this particular case I'd suspect that the scaledown operation itself had issues. Could you check if the scaledown operation you attempted is supported? If you think my assessment makes sense I'd suggest perhaps reassigning to machine-api since they would have more insight on what exactly went wrong in terms of the scaling, and how the node object was recreated when you scale up/down.

As for the MCP degrade, I do think perhaps we should be more bulletproof there. The MCC should be writing the desiredConfig annotation upon syncing the MCP. I can check that but even if that's fixed I don't think your scaleupdown would succeed.

Comment 3 Yu Qi Zhang 2020-10-07 20:03:09 UTC

Also the MCD log you snapshotted shows not the initial config:

2020-10-07T07:36:10.374082881Z W1007 07:36:10.373946   15529 daemon.go:644] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "openshift-worker-5" not found
2020-10-07T07:36:11.373602989Z I1007 07:36:11.373470   15529 daemon.go:403] Node openshift-worker-5 is not labeled node-role.kubernetes.io/master
2020-10-07T07:36:11.373936249Z I1007 07:36:11.373605   15529 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node openshift-worker-5
...
2020-10-07T07:36:11.376719396Z I1007 07:36:11.376626   15529 node.go:34] Setting initial node config based on current configuration on disk: rendered-worker-61fbb0947e09c761cb036c96ab321807

which looks to me that the node object itself had problems upon recreation (again, not sure what recreated it)

Comment 4 Antonio Murdaca 2020-10-08 14:04:54 UTC

Yurii can you reproduce this consistently? I'm leaning towards pushing this out of 4.6 if a) this isn't consistent b) there's a workaround - we'll keep debugging meanwhile.

Comment 5 Yu Qi Zhang 2020-10-08 16:43:51 UTC

I'm going to reassign this current BZ to machine-api to see what happened during the scaledown. This root cause is not the MCO, the scale operation appears to have gone wrong somewhere. Machine-api can tell you if your operations were correct, and why the nodes are flipping like that hopefully.

Comment 6 Zane Bitter 2020-10-08 18:15:03 UTC

I believe this has the same cause as bug 1886028 (the Node not being deleted due to a finalizer left behind), so closing as a duplicate.

*** This bug has been marked as a duplicate of bug 1886028 ***