Bug 1999556 - "master" pool should be updated before the CVO reports available at the new version occurred
Summary: "master" pool should be updated before the CVO reports available at the new v...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.9
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.10.0
Assignee: Kirsten Garrison
QA Contact: Rio Liu
URL:
Whiteboard:
: 2019850 (view as bug list)
Depends On:
Blocks: 2025474
TreeView+ depends on / blocked
 
Reported: 2021-08-31 10:35 UTC by Stephen Benjamin
Modified: 2022-03-12 04:38 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2025396 (view as bug list)
Environment:
[sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] job=periodic-ci-openshift-release-master-nightly-4.9-e2e-metal-ipi-upgrade=all job=periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade=all
Last Closed: 2022-03-12 04:37:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2918 0 None open Bug 1999556: annotate rendered config with OCP version 2022-01-26 23:51:28 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:38:16 UTC

Description Stephen Benjamin 2021-08-31 10:35:05 UTC
We're seeing instances of the following error still:

    the "master" pool should be updated before the CVO reports available at the new version occurred

It looks like maybe a regression of https://bugzilla.redhat.com/show_bug.cgi?id=1970150?

See: 

https://search.ci.openshift.org/?search=pool+should+be+updated+before+the+CVO+reports+available+at+the+new+version&maxAge=168h&context=1&type=bug%2Bjunit&name=4.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Kirsten Garrison 2021-09-01 23:45:17 UTC
I'll take a look at this since I worked on the previous fix and want to figure out what's happening here.

Comment 2 Kirsten Garrison 2021-09-02 02:11:54 UTC
timeline from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1432377737405796352


17:42:20 default machineconfigoperator machine-config OperatorVersionChanged clusteroperator/machine-config-operator started a version change from [{operator 4.9.0-0.nightly-2021-08-30-070917}] to [{operator 4.9.0-0.nightly-2021-08-30-161832}]

  - lastTransitionTime: "2021-08-30T17:42:49Z"
    message: sync completed towards (2) generation using controller version v4.9.0-202108281318.p0.git.00f349e.assembly.stream-dirty
    status: "True"
    type: TemplateControllerCompleted

Same osimageurl, same controller version: machineconfiguration.openshift.io/generated-by-version: v4.9.0-202108281318.p0.git.00f349e.assembly.stream-dirty

17:42:51 default machineconfigoperator machine-config OperatorVersionChanged clusteroperator/machine-config-operator version changed from [{operator 4.9.0-0.nightly-2021-08-30-070917}] to [{operator 4.9.0-0.nightly-2021-08-30-161832}]

17:42:53 default machineconfigcontroller-rendercontroller master RenderedConfigGenerated rendered-master-5d8b2493d5d9fd8e6a6762f985bf5828 successfully generated

This is only happening intermittently probably some sort of timing esp for the metal platform,  looks like we'll have to harden this up more. Likely verify checks against release version. TBD.

Comment 3 Scott Dodson 2021-11-18 15:44:27 UTC
This issue comes up as a late emergency debugging situation anytime we ship a z-stream which hasn't updated the MCO and osImageURL (suspected). It happens in frequently enough that everyone forgets about this bug and sinks substantial amount of effort into figuring out what's wrong at the 11th hour before we ship a release. As such I'm going to mark this as blocker+ so that we can avoid that firedrill in the future. Once fixed we should backport this to all currently supported releases unless there's a technical reason not to.

Comment 4 Sinny Kumari 2021-12-07 13:19:29 UTC
*** Bug 2019850 has been marked as a duplicate of this bug. ***

Comment 8 Sergio 2022-01-28 11:33:45 UTC
Verified by executing this upgrade: 4.11.0-0.nightly-2022-01-28-002827 to 4.11.0-0.nightly-2022-01-28-013835


 4.11.0-0.nightly-2022-01-28-013835
  machine-os-content                             quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1b53708a85ae2b23e83926e7ced936756cd171858d815b363da613c9a3b8e422
  machine-config-operator                        https://github.com/openshift/machine-config-operator                        47436bdb7b8c49425d6813abca594485171e1221

 4.11.0-0.nightly-2022-01-28-002827
  machine-os-content                             quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1b53708a85ae2b23e83926e7ced936756cd171858d815b363da613c9a3b8e422
  machine-config-operator                        https://github.com/openshift/machine-config-operator                        47436bdb7b8c49425d6813abca594485171e1221


Both images have the same osImage and the same machine-config-operator commitID

- The upgrade finished OK.

- We can see the the operator falling back to check the "release-image-version" annotation:

E0128 10:40:33.194881       1 sync.go:721] Error syncing Required MachineConfigPools: "pool master has not progressed to latest configuration: release image version mismatch for master in rendered-master-a4c664deb290edd15e19b1588f852b39 expected: 4.11.0-0.nightly-2022-01-28-013835 got: 4.11.0-0.nightly-2022-01-28-002827, retrying"


- We can see the upgrade finishing after machine-config operator:

$ oc get co machine-config -o yaml | grep Progressing -B 3
  - lastTransitionTime: "2022-01-28T10:51:06Z"
    message: Cluster version is 4.11.0-0.nightly-2022-01-28-013835
    status: "False"
    type: Progressing

$ oc get clusterversion -o yaml | grep Progressing -B 3
    - lastTransitionTime: "2022-01-28T10:51:14Z"
      message: Cluster version is 4.11.0-0.nightly-2022-01-28-013835
      status: "False"
      type: Progressing

$ oc get clusterversions.config.openshift.io -o yaml | grep Completed -A2 -B3
    - completionTime: "2022-01-28T10:51:14Z"               <<<------------------------------ COMPLETED AFTER "2022-01-28T10:51:06Z"
      image: registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-01-28-013835
      startedTime: "2022-01-28T10:14:20Z"
      state: Completed
      verified: false
      version: 4.11.0-0.nightly-2022-01-28-013835


Nevertheless the machine-config operator reported Degraded status because of a timeout

$ oc get co machine-config 
NAME             VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
machine-config   4.11.0-0.nightly-2022-01-28-002827   True        True          True       57m     Unable to apply 4.11.0-0.nightly-2022-01-28-013835: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: release image version mismatch for master in rendered-master-a4c664deb290edd15e19b1588f852b39 expected: 4.11.0-0.nightly-2022-01-28-013835 got: 4.11.0-0.nightly-2022-01-28-002827, retrying


But after retrying the upgraded finished OK.
$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-01-28-013835   True        False         5m10s   Cluster version is 4.11.0-0.nightly-2022-01-28-013835


We move the BZ to VERIFIED status.

Comment 11 errata-xmlrpc 2022-03-12 04:37:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.