Bug 1885921
Summary: | [BM][IPI] MCP degraded: machineconfiguration.openshift.io/desiredConfig annotation not found | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yurii Prokulevych <yprokule> |
Component: | Cloud Compute | Assignee: | Beth White <beth.white> |
Cloud Compute sub component: | BareMetal Provider | QA Contact: | Amit Ugol <augol> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | amurdaca, jerzhang, mcornea, stbenjam, zbitter |
Version: | 4.6 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-08 18:15:03 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Yurii Prokulevych
2020-10-07 09:47:59 UTC
So there's a couple of weird things going on in your system. From your node objects: worker 3 (normal) - creationTimestamp: "2020-10-02T13:17:48Z" worker 4 (weird) - creationTimestamp: "2020-10-02T13:17:40Z" deletionGracePeriodSeconds: 0 deletionTimestamp: "2020-10-07T07:36:49Z" worker 5 (weird) - creationTimestamp: "2020-10-07T07:35:27Z" So it seems to me that worker 5 was recreated at some point, after which worker 4 attempted to be removed? reading https://docs.openshift.com/container-platform/4.5/machine_management/manually-scaling-machineset.html and https://docs.openshift.com/container-platform/4.5/machine_management/deleting-machine.html it seems that you double-deleted machines. As in you should only be deleting the machine object or scaling the machineset, not both. I also tried scaling down on an AWS cluster (oc scale machineset xxx --replicas=Y -n openshift-machine-api) and the scaledown operation completed fine without the MCP degrading. So in this particular case I'd suspect that the scaledown operation itself had issues. Could you check if the scaledown operation you attempted is supported? If you think my assessment makes sense I'd suggest perhaps reassigning to machine-api since they would have more insight on what exactly went wrong in terms of the scaling, and how the node object was recreated when you scale up/down. As for the MCP degrade, I do think perhaps we should be more bulletproof there. The MCC should be writing the desiredConfig annotation upon syncing the MCP. I can check that but even if that's fixed I don't think your scaleupdown would succeed. Also the MCD log you snapshotted shows not the initial config: 2020-10-07T07:36:10.374082881Z W1007 07:36:10.373946 15529 daemon.go:644] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "openshift-worker-5" not found 2020-10-07T07:36:11.373602989Z I1007 07:36:11.373470 15529 daemon.go:403] Node openshift-worker-5 is not labeled node-role.kubernetes.io/master 2020-10-07T07:36:11.373936249Z I1007 07:36:11.373605 15529 node.go:24] No machineconfiguration.openshift.io/currentConfig annotation on node openshift-worker-5 ... 2020-10-07T07:36:11.376719396Z I1007 07:36:11.376626 15529 node.go:34] Setting initial node config based on current configuration on disk: rendered-worker-61fbb0947e09c761cb036c96ab321807 which looks to me that the node object itself had problems upon recreation (again, not sure what recreated it) Yurii can you reproduce this consistently? I'm leaning towards pushing this out of 4.6 if a) this isn't consistent b) there's a workaround - we'll keep debugging meanwhile. I'm going to reassign this current BZ to machine-api to see what happened during the scaledown. This root cause is not the MCO, the scale operation appears to have gone wrong somewhere. Machine-api can tell you if your operations were correct, and why the nodes are flipping like that hopefully. I believe this has the same cause as bug 1886028 (the Node not being deleted due to a finalizer left behind), so closing as a duplicate. *** This bug has been marked as a duplicate of bug 1886028 *** |