Description of problem: - MachineConfigPool for the worker was in UPDATING state for the long time after upgrading OpenShift 4.6.34 to OpenShift 4.7.31. - MCC was showing this error: ``` E1011 16:23:51.653195 1 render_controller.go:460] Error updating MachineConfigPool worker: Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again ``` - Machine config daemon was showing that the render was correctly applied to the node ``` 2021-10-08T04:43:36.948527958Z I1008 04:43:36.948470 3193 daemon.go:1085] Validating against pending config rendered-worker-bf941581bf35ab13e1e8656cf78d9a1f 2021-10-08T04:43:37.200706725Z I1008 04:43:37.200652 3193 daemon.go:1096] Validated on-disk state 2021-10-08T04:43:37.326497630Z I1008 04:43:37.326440 3193 daemon.go:1151] Completing pending config rendered-worker-bf941581bf35ab13e1e8656cf78d9a1f 2021-10-08T04:43:47.612183695Z I1008 04:43:47.612032 3193 update.go:199] cordon/uncordon failed with: cordon error: rpc error: code = Unavailable desc = transport is closing, retrying 2021-10-08T04:43:57.622007990Z I1008 04:43:57.621484 3193 update.go:1943] completed update for config rendered-worker-bf941581bf35ab13e1e8656cf78d9a1f 2021-10-08T04:43:57.699433828Z I1008 04:43:57.698383 3193 daemon.go:1167] In desired config rendered-worker-bf941581bf35ab13e1e8656cf78d9a1f ``` - We need to manually uncordon the node to successfully apply the changes to the MachineConfigPool. Actual results: MachineConfigPool was in UPDATING state. Node was in SchedulingDisabled State. Expected results: MachineConfigPool should automatically come in an UPDATED state and the node should not be marked as SchedulingDisabled.
@mkrejci @ableisch @sankarshan.mukhopadhyay @walters @aos-bugs @jason.brenneman @rioliu @ccornejo @inout,@hchaturv
Any updates on this one ? @
Any updates on this one ? @rioliu
Since nothing has happened for the last 20 days, I am kindly asking for an update.
Hi, looking at the must-gather, it seems that the worker MCP never successfully completed the update due to the cordon never being successfully removed. > E1011 16:23:51.653195 1 render_controller.go:460] Error updating MachineConfigPool worker: Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again this is just a transient error message (We should probably just make it a non-error log) and doesn't actually do anything. It should eventually resync. > 2021-10-08T04:43:47.612183695Z I1008 04:43:47.612032 3193 update.go:199] cordon/uncordon failed with: cordon error: rpc error: code = Unavailable desc = transport is closing, retrying this is an interesting issue. Theoretically the node should have retried and the MCD should have uncordoned right after, since there are no other error messages, if it failed to uncordon, it should have had a message like "failed to cordon/uncordon node (x tries): xxx", which it does not. But looking at the node status, I see that: ``` spec: taints: - effect: NoSchedule key: node.kubernetes.io/unschedulable timeAdded: "2021-10-08T04:38:12Z" unschedulable: true ``` Which is when the MCO cordoned the node before attempting the update. MCO attempted to uncordon at 2021-10-08T04:43:47 (5 minutes later, after the update) but it seems that it was not successful despite the logs saying that it was. The workaround should be to manually uncordon the node, and it should complete the update (looking at machineconfigpool status, there are no further updates, and the case appears to indicate that the workaround was successful). I will try to reproduce, but to understand the issue better, is the main point of investigation why this can occur? Is this reproduceable or consistent? Is there any special setup with this cluster that would affect node operations and API? In OCP CI this would have been a fatal failure, but across our countless CI runs I've never seen the uncordon fail like that to my knowledge, so I am unsure how to reproduce without understanding the environment better.
Hi, To provide an update, the team has performed some testing and reproduction attempts. We were unsuccessful in reproducing the issue unfortunately, even in single node setups or forcing a retry to happen. We will continue debugging into this (at a lower priority since the issue can be worked around). Our best guess right now is back in 4.7, the cordon/uncordon library we use from upstream kubernetes has a behaviour where it does not properly report uncordon failure in some rare cases. This BZ and https://bugzilla.redhat.com/show_bug.cgi?id=2022387 appears to exhibit the same behaviour. To consolidate efforts, are we ok with me closing this as a duplicate of the other?
Hi, we understand that it might be difficult to reproduce. Are there any observation from your side which might helps us to figure out automated workaround. What we would like know/found out: - what would be clear symptom that it failed and won;t try ? - how we can make it fully automated without manual intervention ? we might leave with that but we need to have some clear diagnostic to identify that issue happenned and to apply uncordon, this is tricky because I am notsure how does doing uncordon too early would impact the whole process.
(In reply to msi_bacalm from comment #13) > Hi, > > we understand that it might be difficult to reproduce. > > Are there any observation from your side which might helps us to figure out > automated workaround. > What we would like know/found out: > - what would be clear symptom that it failed and won;t try ? > - how we can make it fully automated without manual intervention ? > we might leave with that but we need to have some clear diagnostic to > identify that issue happenned and to apply uncordon, this is tricky because > I am notsure how does doing uncordon too early would impact the whole > process. I am not sure about automating it but best way to figure out that we have hit this issue and MCO has completed update from its end but still marked as unschedulable would be: 1. Check the machine-config-daemon pod log running on that node 2. Look for the last few lines in the MCD log. Check if it has log message "update.go:<line_no>] completed update for config rendered-<pool_name>-<sha>" 3. This would mean that MCD has completed the update on the node and no longer going to perform anything else until node get marked as schedulable. Note that the messaging got reworded in 4.9 ("Update completed for config rendered-<pool_name>-<sha> and node has been successfully uncordoned") and onward releases and we are also sending an even for node uncordon in with 4.9 onward. See https://github.com/openshift/machine-config-operator/blob/release-4.9/pkg/daemon/daemon.go#L1293
To add on to what Sinny said, if you don't want to check pod logs, another way is to check the node object for the following (during updates only): 1. If the currentConfig annotation == desiredConfig annotation, and 2. The node is currently cordoned (unscheduleable = true) Then it should mean that the MCD is not currently performing an update to the node. This is not necessarily safe, since something else could have cordoned the node for other reasons. There could be minor time differences between the updates of desiredconfig and uncordon operation, but I think the MCD won't complain on that.
*** Bug 2022387 has been marked as a duplicate of this bug. ***
Similar behavior recently reported through bug #2054792 and initially fixed by bug #2028525 in 4.7.40. It seems that we still can reproduce in version 4.7.40. Please, see bug #2054792 for additional logs.
*** Bug 2054792 has been marked as a duplicate of this bug. ***
https://bugzilla.redhat.com/show_bug.cgi?id=2054792 has been marked as duplicate of this one, as the problem seems to be focused on the same thing: cordons/uncordons not being properly recorded and/or auto-reverted in the kube-apiserver. It is also important to note that https://bugzilla.redhat.com/show_bug.cgi?id=2028525 should have not been expected to fix anything, because it just intended to add extra logging in order to debug this bugzilla (and any other similar ones in the future).
Apologies for the delay. This BZ and https://bugzilla.redhat.com/show_bug.cgi?id=2040529 has been fixed as of 4.11: 1. The MCD no longer performs uncordons, and the new controller will always retry and has no issues with the pending config which can cause this. 2. The new controller should always be polling for the latest update, so this type of issue should generally not happen, minus API instability which we would otherwise have no control over So I will be closing this once 4.11 ships. The rework which fixes this, however, is a pretty significant effort to backport. If we require this in earlier versions, we would likely need to apply another fix. If you would like the corresponding fix backported, please state the reasoning and the version this needs backport to, and the team will prioritize accordingly. Otherwise this fix will be only in 4.11. Thank you.
Hello @jerzhang, I would like to have this fixed in 4.10, currently this is in GA, full support lifecycle. For this issue: - requires manual intervetion - there is no possible to implement "watchdog" which will automatically recover cluster - `grepping` a pod logs is not a way.... - anyone can hit this issue at any ocp upgrade
*** Bug 2040529 has been marked as a duplicate of this bug. ***
Hi, We will look to backport this. Since bugzilla is deprecated for Openshift, we will continue tracking this via Jira, in https://issues.redhat.com/browse/OCPBUGS-1491
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days