2015589 – MachineConfigPool stuck in UPDATING state after upgrade to 4.7.31

Bug 2015589 - MachineConfigPool stuck in UPDATING state after upgrade to 4.7.31

Summary: MachineConfigPool stuck in UPDATING state after upgrade to 4.7.31

Keywords:
Status:	CLOSED NEXTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Yu Qi Zhang
QA Contact:	Rio Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	2022387 2040529 (view as bug list)
Depends On:	2040529
Blocks:
TreeView+	depends on / blocked

Reported:	2021-10-19 15:29 UTC by Himank
Modified:	2023-09-18 04:27 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-10-21 21:21:49 UTC
Target Upstream Version:
Embargoed:
Flags:	palonsor: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	6548131	0	None	None	None	2021-11-30 21:49:49 UTC

Description Himank 2021-10-19 15:29:17 UTC

Description of problem:

  - MachineConfigPool for the worker was in UPDATING state for the long time after upgrading OpenShift 4.6.34 to OpenShift 4.7.31.

  - MCC was showing this error:

```
E1011 16:23:51.653195       1 render_controller.go:460] Error updating MachineConfigPool worker: Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again
```
  - Machine config daemon was showing that the render was correctly applied to the node

```
2021-10-08T04:43:36.948527958Z I1008 04:43:36.948470    3193 daemon.go:1085] Validating against pending config rendered-worker-bf941581bf35ab13e1e8656cf78d9a1f
2021-10-08T04:43:37.200706725Z I1008 04:43:37.200652    3193 daemon.go:1096] Validated on-disk state
2021-10-08T04:43:37.326497630Z I1008 04:43:37.326440    3193 daemon.go:1151] Completing pending config rendered-worker-bf941581bf35ab13e1e8656cf78d9a1f
2021-10-08T04:43:47.612183695Z I1008 04:43:47.612032    3193 update.go:199] cordon/uncordon failed with: cordon error: rpc error: code = Unavailable desc = transport is closing, retrying
2021-10-08T04:43:57.622007990Z I1008 04:43:57.621484    3193 update.go:1943] completed update for config rendered-worker-bf941581bf35ab13e1e8656cf78d9a1f
2021-10-08T04:43:57.699433828Z I1008 04:43:57.698383    3193 daemon.go:1167] In desired config rendered-worker-bf941581bf35ab13e1e8656cf78d9a1f

```
  - We need to manually uncordon the node to successfully apply the changes to the MachineConfigPool.

Actual results: MachineConfigPool was in UPDATING state. Node was in SchedulingDisabled State.

Expected results: MachineConfigPool should automatically come in an UPDATED state and the node should not be marked as SchedulingDisabled.

Comment 2 msi_bacalm 2021-10-20 08:19:55 UTC

@mkrejci
@ableisch
@sankarshan.mukhopadhyay
@walters
@aos-bugs
@jason.brenneman
@rioliu
@ccornejo
@inout,@hchaturv

Comment 3 msi_bacalm 2021-10-20 08:21:27 UTC

Any updates on this one ? @

Comment 4 msi_bacalm 2021-10-20 08:21:45 UTC

Any updates on this one ? @rioliu

Comment 5 pawel.burzawa 2021-11-09 09:19:17 UTC

Since nothing has happened for the last 20 days, I am kindly asking for an update.

Comment 6 Yu Qi Zhang 2021-11-09 15:16:27 UTC

Hi, looking at the must-gather, it seems that the worker MCP never successfully completed the update due to the cordon never being successfully removed.

> E1011 16:23:51.653195       1 render_controller.go:460] Error updating MachineConfigPool worker: Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again

this is just a transient error message (We should probably just make it a non-error log) and doesn't actually do anything. It should eventually resync.

> 2021-10-08T04:43:47.612183695Z I1008 04:43:47.612032    3193 update.go:199] cordon/uncordon failed with: cordon error: rpc error: code = Unavailable desc = transport is closing, retrying

this is an interesting issue. Theoretically the node should have retried and the MCD should have uncordoned right after, since there are no other error messages, if it failed to uncordon, it should have had a message like "failed to cordon/uncordon node (x tries): xxx", which it does not. But looking at the node status, I see that:

```
spec:
  taints:
  - effect: NoSchedule
    key: node.kubernetes.io/unschedulable
    timeAdded: "2021-10-08T04:38:12Z"
  unschedulable: true
```

Which is when the MCO cordoned the node before attempting the update.

MCO attempted to uncordon at 2021-10-08T04:43:47 (5 minutes later, after the update) but it seems that it was not successful despite the logs saying that it was. The workaround should be to manually uncordon the node, and it should complete the update (looking at machineconfigpool status, there are no further updates, and the case appears to indicate that the workaround was successful).

I will try to reproduce, but to understand the issue better, is the main point of investigation why this can occur? Is this reproduceable or consistent? Is there any special setup with this cluster that would affect node operations and API? In OCP CI this would have been a fatal failure, but across our countless CI runs I've never seen the uncordon fail like that to my knowledge, so I am unsure how to reproduce without understanding the environment better.

Comment 12 Yu Qi Zhang 2021-11-17 21:48:45 UTC

Hi,

To provide an update, the team has performed some testing and reproduction attempts. We were unsuccessful in reproducing the issue unfortunately, even in single node setups or forcing a retry to happen.

We will continue debugging into this (at a lower priority since the issue can be worked around). Our best guess right now is back in 4.7, the cordon/uncordon library we use from upstream kubernetes has a behaviour where it does not properly report uncordon failure in some rare cases.

This BZ and https://bugzilla.redhat.com/show_bug.cgi?id=2022387 appears to exhibit the same behaviour. To consolidate efforts, are we ok with me closing this as a duplicate of the other?

Comment 13 msi_bacalm 2021-11-17 22:17:21 UTC

Hi,

we understand that it might be difficult to reproduce.

Are there any observation from your side which might helps us to figure out automated workaround. 
What we would like know/found out:
 - what would be clear symptom that it failed and won;t try ?
 - how we can make it fully automated without manual intervention ? 
we might leave with that but we need to have some clear diagnostic to identify that issue happenned and to apply uncordon, this is tricky because I am notsure how does doing uncordon too early would impact the whole process.

Comment 15 Sinny Kumari 2021-11-18 11:41:21 UTC

(In reply to msi_bacalm from comment #13)
> Hi,
> 
> we understand that it might be difficult to reproduce.
> 
> Are there any observation from your side which might helps us to figure out
> automated workaround. 
> What we would like know/found out:
>  - what would be clear symptom that it failed and won;t try ?
>  - how we can make it fully automated without manual intervention ? 
> we might leave with that but we need to have some clear diagnostic to
> identify that issue happenned and to apply uncordon, this is tricky because
> I am notsure how does doing uncordon too early would impact the whole
> process.


I am not sure about automating it but best way to figure out that we have hit this issue and MCO has completed update from its end but still marked as unschedulable would be:
1. Check the machine-config-daemon pod log running on that node
2. Look for the last few lines in the MCD log. Check if it has log message "update.go:<line_no>] completed update for config rendered-<pool_name>-<sha>"
3. This would mean that MCD has completed the update on the node and no longer going to perform anything else until node get marked as schedulable.

Note that the messaging got reworded in 4.9 ("Update completed for config rendered-<pool_name>-<sha> and node has been successfully uncordoned") and onward releases and we are also sending an even for node uncordon in with 4.9 onward. See https://github.com/openshift/machine-config-operator/blob/release-4.9/pkg/daemon/daemon.go#L1293

Comment 16 Yu Qi Zhang 2021-11-22 23:38:23 UTC

To add on to what Sinny said, if you don't want to check pod logs, another way is to check the node object for the following (during updates only):

1. If the currentConfig annotation == desiredConfig annotation, and
2. The node is currently cordoned (unscheduleable = true)

Then it should mean that the MCD is not currently performing an update to the node. This is not necessarily safe, since something else could have cordoned the node for other reasons.

There could be minor time differences between the updates of desiredconfig and uncordon operation, but I think the MCD won't complain on that.

Comment 17 Yu Qi Zhang 2021-11-29 16:47:20 UTC

*** Bug 2022387 has been marked as a duplicate of this bug. ***

Comment 20 Arthur Oliveira 2022-02-15 17:44:19 UTC

Similar behavior recently reported through bug #2054792 and initially fixed by bug #2028525 in 4.7.40. It seems that we still can reproduce in version 4.7.40. Please, see bug #2054792 for additional logs.

Comment 21 Pablo Alonso Rodriguez 2022-02-22 13:13:09 UTC

*** Bug 2054792 has been marked as a duplicate of this bug. ***

Comment 22 Pablo Alonso Rodriguez 2022-02-22 13:21:47 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=2054792 has been marked as duplicate of this one, as the problem seems to be focused on the same thing: cordons/uncordons not being properly recorded and/or auto-reverted in the kube-apiserver.

It is also important to note that https://bugzilla.redhat.com/show_bug.cgi?id=2028525 should have not been expected to fix anything, because it just intended to add extra logging in order to debug this bugzilla (and any other similar ones in the future).

Comment 38 Yu Qi Zhang 2022-08-02 21:47:30 UTC

Apologies for the delay. This BZ and https://bugzilla.redhat.com/show_bug.cgi?id=2040529 has been fixed as of 4.11:

1. The MCD no longer performs uncordons, and the new controller will always retry and has no issues with the pending config which can cause this.
2. The new controller should always be polling for the latest update, so this type of issue should generally not happen, minus API instability which we would otherwise have no control over

So I will be closing this once 4.11 ships. The rework which fixes this, however, is a pretty significant effort to backport. If we require this in earlier versions, we would likely need to apply another fix.

If you would like the corresponding fix backported, please state the reasoning and the version this needs backport to, and the team will prioritize accordingly. Otherwise this fix will be only in 4.11. Thank you.

Comment 39 msi_bacalm 2022-08-02 21:55:57 UTC

Hello @jerzhang,

I would like to have this fixed in 4.10, currently this is in GA, full support lifecycle.

For this issue:
- requires manual intervetion
- there is no possible to implement "watchdog" which will automatically recover cluster - `grepping` a pod logs is not a way....
- anyone can hit this issue at any ocp upgrade

Comment 40 Yu Qi Zhang 2022-10-17 16:22:53 UTC

*** Bug 2040529 has been marked as a duplicate of this bug. ***

Comment 41 Yu Qi Zhang 2022-10-21 21:21:49 UTC

Hi,

We will look to backport this. Since bugzilla is deprecated for Openshift, we will continue tracking this via Jira, in https://issues.redhat.com/browse/OCPBUGS-1491

Comment 43 Red Hat Bugzilla 2023-09-18 04:27:14 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.