Bug 2050698

Summary: After upgrading the cluster the console still show 0 of N, 0% progress for worker nodes
Product: OpenShift Container Platform Reporter: Gabriel Meghnagi <gmeghnag>
Component: Management ConsoleAssignee: Yadan Pei <yapei>
Status: CLOSED ERRATA QA Contact: Yadan Pei <yapei>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: aacostab, abraj, aos-bugs, bshephar, jerzhang, jhadvig, mkrejci, rhamilto, wking, yanpzhan, yapei, ychoukse
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:47:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2074571    
Attachments:
Description Flags
410 to 411 upgrade finish
none
410 to 411 upgrade just happen none

Description Gabriel Meghnagi 2022-02-04 13:37:32 UTC
Description of problem:

From (Administration > Cluster Setting > Details) even if the upgrade successfully completes, the console still shows 0 of N, 0%


Version-Release number of selected component (if applicable): 

4.9.17


How reproducible: 

Upgrade from 4.9.15 to 4.9.17

Comment 2 Robb Hamilton 2022-02-04 15:48:23 UTC
Reassigning to the MCO team for investigation as I believe this is an API issue.

We’ve had multiple reports [1][2] of this same bug over the last couple days. In both cases, the problem is the underlying data from the worker MCP we use to determine the worker nodes have completed their update does not appear to be updating.  This data point is the worker MCP Updating condition lastTransitionTime.  We compare this time to the CVO status.history[0].startedTime for the spec.desiredUpdate.version since the worker nodes update later in the update cycle and can continue updating after the CVO is finished its updates.  Any ideas on why the MCPs appear to not be updating in this scenario?

[1] https://coreos.slack.com/archives/C6A3NV5J9/p1643841199166689
[2] https://coreos.slack.com/archives/C6A3NV5J9/p1643970982911949

Comment 3 Yu Qi Zhang 2022-02-04 17:36:02 UTC
The diffs between https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.9.15 and https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.9.17

indicate that there was no update either in the base OS and the MCO templates. So there was no update to either pool, and the MCO reported upgraded as successfully.

The must-gather in https://access.redhat.com/support/cases/#/case/03138509/discussion?attachmentId=a092K0000336kzKQAQ corroborate that claim. The most recent MCs

rendered-master-5741dcbb1c6dc2460c89871e158a9138
rendered-worker-189700adbe95c6e3c93fd10805794a6b

Both were created Jan. 27th which was presumably the previous update, and the lastTransitionTime is thus showing this, which is expected.

TLDR is this will happen in situations where the MCO needs to perform no update between versions, such that we simply don't roll out a pool update. Is there a reason `lastTransitionTime` is what is being used to determine success? This will cause problems in situations like this.

Comment 4 Robb Hamilton 2022-02-04 18:15:27 UTC
Reassigning to console as this is indeed a console bug.

Comment 5 Jakub Hadvig 2022-02-10 07:54:09 UTC
*** Bug 2052046 has been marked as a duplicate of this bug. ***

Comment 6 Jakub Hadvig 2022-02-10 07:55:12 UTC
This isue was also reported for 4.8 in https://bugzilla.redhat.com/show_bug.cgi?id=2052046
Updating the 'Version' to 4.8 due to this fact

Comment 8 Jakub Hadvig 2022-02-15 15:31:44 UTC
*** Bug 2054722 has been marked as a duplicate of this bug. ***

Comment 9 Robb Hamilton 2022-02-21 13:12:32 UTC
@yapei@redhat.com, can you please see this gets verified?  We've seen a number of duplicate bugs filed.

Comment 11 Yadan Pei 2022-02-23 07:41:40 UTC
Created attachment 1862812 [details]
410 to 411 upgrade finish

1. Set up a 4.10.0-rc.3 cluster, create a custom MCP
$ oc label node yapeiup-l77hw-worker-c-2f26s.c.openshift-qe.internal node-role.kubernetes.io/infra=""
$ cat infra-mcp.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  maxUnavailable: null
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""
  paused: false
$ oc create -f infra-mcp.yaml 

2. Upgrade the testing cluster to 4.11.0-0.nightly-2022-02-18-121223 which contains the bug fix, when upgrade finished, 'Worker Nodes' and 'infra Nodes' progress bar all shows correct status, they no longer appear on Cluster Settings page

verified on 4.11.0-0.nightly-2022-02-18-121223

Comment 12 Yadan Pei 2022-02-23 07:42:25 UTC
Created attachment 1862813 [details]
410 to 411 upgrade just happen

Comment 13 Kirsten Garrison 2022-03-02 21:05:23 UTC
*** Bug 1921529 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-08-10 10:47:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069