Bug 2050698

Summary:

After upgrading the cluster the console still show 0 of N, 0% progress for worker nodes

Product:

OpenShift Container Platform

Reporter:

Gabriel Meghnagi <gmeghnag>

Component:

Management Console

Assignee:

Yadan Pei <yapei>

Status:

CLOSED ERRATA

QA Contact:

Yadan Pei <yapei>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

4.8

CC:

aacostab, abraj, aos-bugs, bshephar, jerzhang, jhadvig, mkrejci, rhamilto, wking, yanpzhan, yapei, ychoukse

Target Milestone:

---

Target Release:

4.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2022-08-10 10:47:09 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

2074571

Attachments:

Description	Flags
410 to 411 upgrade finish	none
410 to 411 upgrade just happen	none

Description Gabriel Meghnagi 2022-02-04 13:37:32 UTC

Description of problem:

From (Administration > Cluster Setting > Details) even if the upgrade successfully completes, the console still shows 0 of N, 0%


Version-Release number of selected component (if applicable): 

4.9.17


How reproducible: 

Upgrade from 4.9.15 to 4.9.17

Comment 2 Robb Hamilton 2022-02-04 15:48:23 UTC

Reassigning to the MCO team for investigation as I believe this is an API issue.

We’ve had multiple reports [1][2] of this same bug over the last couple days. In both cases, the problem is the underlying data from the worker MCP we use to determine the worker nodes have completed their update does not appear to be updating.  This data point is the worker MCP Updating condition lastTransitionTime.  We compare this time to the CVO status.history[0].startedTime for the spec.desiredUpdate.version since the worker nodes update later in the update cycle and can continue updating after the CVO is finished its updates.  Any ideas on why the MCPs appear to not be updating in this scenario?

[1] https://coreos.slack.com/archives/C6A3NV5J9/p1643841199166689
[2] https://coreos.slack.com/archives/C6A3NV5J9/p1643970982911949

Comment 3 Yu Qi Zhang 2022-02-04 17:36:02 UTC

The diffs between https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.9.15 and https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.9.17

indicate that there was no update either in the base OS and the MCO templates. So there was no update to either pool, and the MCO reported upgraded as successfully.

The must-gather in https://access.redhat.com/support/cases/#/case/03138509/discussion?attachmentId=a092K0000336kzKQAQ corroborate that claim. The most recent MCs

rendered-master-5741dcbb1c6dc2460c89871e158a9138
rendered-worker-189700adbe95c6e3c93fd10805794a6b

Both were created Jan. 27th which was presumably the previous update, and the lastTransitionTime is thus showing this, which is expected.

TLDR is this will happen in situations where the MCO needs to perform no update between versions, such that we simply don't roll out a pool update. Is there a reason `lastTransitionTime` is what is being used to determine success? This will cause problems in situations like this.

Comment 4 Robb Hamilton 2022-02-04 18:15:27 UTC

Reassigning to console as this is indeed a console bug.

Comment 5 Jakub Hadvig 2022-02-10 07:54:09 UTC

*** Bug 2052046 has been marked as a duplicate of this bug. ***

Comment 6 Jakub Hadvig 2022-02-10 07:55:12 UTC

This isue was also reported for 4.8 in https://bugzilla.redhat.com/show_bug.cgi?id=2052046
Updating the 'Version' to 4.8 due to this fact

Comment 8 Jakub Hadvig 2022-02-15 15:31:44 UTC

*** Bug 2054722 has been marked as a duplicate of this bug. ***

Comment 9 Robb Hamilton 2022-02-21 13:12:32 UTC

@yapei@redhat.com, can you please see this gets verified?  We've seen a number of duplicate bugs filed.

Comment 11 Yadan Pei 2022-02-23 07:41:40 UTC

Created attachment 1862812 [details]
410 to 411 upgrade finish

1. Set up a 4.10.0-rc.3 cluster, create a custom MCP
$ oc label node yapeiup-l77hw-worker-c-2f26s.c.openshift-qe.internal node-role.kubernetes.io/infra=""
$ cat infra-mcp.yaml 
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  maxUnavailable: null
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""
  paused: false
$ oc create -f infra-mcp.yaml 

2. Upgrade the testing cluster to 4.11.0-0.nightly-2022-02-18-121223 which contains the bug fix, when upgrade finished, 'Worker Nodes' and 'infra Nodes' progress bar all shows correct status, they no longer appear on Cluster Settings page

verified on 4.11.0-0.nightly-2022-02-18-121223

Comment 12 Yadan Pei 2022-02-23 07:42:25 UTC

Created attachment 1862813 [details]
410 to 411 upgrade just happen

Comment 13 Kirsten Garrison 2022-03-02 21:05:23 UTC

*** Bug 1921529 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-08-10 10:47:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069