Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1932105

Summary:	machine-config ClusterOperator claims level while control-plane still updating
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Machine Config Operator	Assignee:	Kirsten Garrison <kgarriso>
Status:	CLOSED ERRATA	QA Contact:	Michael Nguyen <mnguyen>
Severity:	urgent	Docs Contact:
Priority:	high
Version:	4.7	CC:	yanyang
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:48:06 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1933368

Description W. Trevor King 2021-02-24 00:31:19 UTC

Seen in master/4.8 CI [1]:

1. 17:55:04Z, update to 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b initiated [2].
2. 18:34:42Z, machine-config goes Degraded=True on [4]:
     RenderConfigFailed: Failed to resync 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b because: refusing to read osImageURL version "4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b", operator version "4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b"
3. 18:34:43Z, machine-config goes Available=False on [4]:
     Cluster not available for 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b
4. 18:37:33Z, machine-config goes Progressing=True on [4]:
     Working towards 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
5. 18:37:33Z, machine-config goes Degraded=False [4].
6. 18:40:39Z, rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 created [3].
7. 18:40:44Z, ci-op-m36h5d5b-8aaeb-szd6x-master-1 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 [7].
8. 18:41:12Z, machine-config sets Available=True [4].
9. 18:41:12Z, machine-config bumps its 'operator' version [4], but 'operator' should only be bumped after the operator and all its operands have completed their update [5].  This unblocks the CVO, because machine-config is Available=True, Degraded=False, and 'operator' is the only version the CVO has been told to wait for [6].
10. 18:41:13Z, machine-config sets Progressing=False on [4]:
      Cluster version is 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
11. 18:43:25Z, update nominally completes (per ClusterVersion [2]).
12. 18:48:13Z, ci-op-m36h5d5b-8aaeb-szd6x-master-0 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 [7].
13. 18:55:33Z, ci-op-m36h5d5b-8aaeb-szd6x-master-2 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 [7].

So issues:

a. machine-config should not go Available=False (step 3) or Degraded=True (step 2) during this healthy update.  Clayton has [9] up to guard against this.
b. machine-config should delay bumping the 'operator' entry in status.versions (step 9) until the 'master' pool completes (after step 13).  Clayton has [8] up to help guard against this.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640
[2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version'
     2021-02-23T17:55:04Z 2021-02-23T18:43:25Z Completed 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
     2021-02-23T17:22:37Z 2021-02-23T17:52:39Z Completed 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/machineconfigs.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' | grep render | sort
     2021-02-23T17:33:03Z rendered-master-0791a04211638c1b7706db957757d22c
     2021-02-23T17:33:03Z rendered-worker-94205363958f282f378ae9bb3586a9c0
     2021-02-23T18:40:39Z rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095
     2021-02-23T18:40:39Z rendered-worker-99e80e3ee204abddcbf2411760dc2e93
[4]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/build-log.txt | grep clusteroperator/machin
e-config
     Feb 23 18:34:42.844 E clusteroperator/machine-config changed Degraded to True: RenderConfigFailed: Failed to resync 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b because: refusing to read osImageURL version "4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b", operator version "4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b"
     Feb 23 18:34:43.247 W clusteroperator/machine-config changed Available to False: Cluster not available for 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b
     Feb 23 18:37:33.310 W clusteroperator/machine-config changed Progressing to True: Working towards 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
     Feb 23 18:37:33.311 I /machine-config reason/OperatorVersionChanged clusteroperator/machine-config-operator started a version change from [{operator 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b}] to [{operator 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b}]
     Feb 23 18:37:33.348 W clusteroperator/machine-config changed Degraded to False
     Feb 23 18:41:12.774 W clusteroperator/machine-config changed Available to True: Cluster has deployed 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
     Feb 23 18:41:12.834 I /machine-config reason/OperatorVersionChanged clusteroperator/machine-config-operator version changed from [{operator 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b}] to [{operator 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b}]
     Feb 23 18:41:12.904 I clusteroperator/machine-config versions: operator 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b -> 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
     Feb 23 18:41:13.497 W clusteroperator/machine-config changed Progressing to False: Cluster version is 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
[5]: https://github.com/openshift/api/blob/70b287c0a0a51be5cdc88d40858de59e31222f5f/config/v1/types_cluster_operator.go#L45
[6]: https://github.com/openshift/machine-config-operator/blob/1402af23b7b6f5acb6dd42ab2ed41c2e01acbbcc/install/0000_80_machine-config-operator_06_clusteroperator.yaml#L12-L13
[7]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-
ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.reason == "SetDesiredConfig") | .firstTimestamp + " " + (.count | tostring) + " " + .message' | sort | grep master
     2021-02-23T18:40:44Z 1 Targeted node ci-op-m36h5d5b-8aaeb-szd6x-master-1 to config rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095
     2021-02-23T18:48:13Z 1 Targeted node ci-op-m36h5d5b-8aaeb-szd6x-master-0 to config rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095
     2021-02-23T18:55:33Z 1 Targeted node ci-op-m36h5d5b-8aaeb-szd6x-master-2 to config rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095
[8]: https://github.com/openshift/origin/pull/25922
[9]: https://github.com/openshift/origin/pull/25920

Comment 1 Yu Qi Zhang 2021-02-24 01:35:58 UTC

Hey Trevor, a few questions:

1. is this a new issue in 4.8 only and not in previous versions? Do you know if this happens every time (in terms of unavailable + degraded during the upgrade timeframe)

2. for "a. machine-config should not go Available=False (step 3) or Degraded=True (step 2) during this healthy update.  Clayton has [9] up to guard against this."

How exactly does this failure manifest? Does it stop the upgrade?

3. b. machine-config should delay bumping the 'operator' entry in status.versions (step 9) until the 'master' pool completes (after step 13).  Clayton has [8] up to help guard against this.

Is the MCO prematurely reporting upgrade completion before master has finished? Is this also new behaviour?

Comment 2 W. Trevor King 2021-02-24 02:37:07 UTC

> ...is this a new issue in 4.8 only and not in previous versions?

No information yet on frequency.  Clayton's origin#25920 and 25922 should give us that information.  Might be able to query CI search in the meantime, but I haven't looked for a fingerprint to use.

> How exactly does this failure manifest? Does it stop the upgrade?

It temporarily blocks the update.  If it lasts for 10+ minutes (was 6m or less this update), it would trigger a page-at-midnight ClsuterOperatorDown alert.  And anyone who polls ClusterOperator during the update and sees Available=False from machine-config might justifiably believe that their cluster was exploding, and then be more likely to ignore later Available=False that represented real "my component is dead" concerns.

> Is the MCO prematurely reporting upgrade completion before master has finished?

Yes.

> Is this also new behaviour?

Unclear.

Comment 3 W. Trevor King 2021-02-24 03:29:19 UTC

Rough handle on how often this happens:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=build-log&search=Pool+master+is+still+reporting.*Updating:+true' | grep 'failures match' | sort
pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-aws-upgrade - 7 runs, 71% failed, 20% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade - 10 runs, 30% failed, 33% of failures match
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-upgrade - 5 runs, 60% failed, 100% of failures match
pull-ci-openshift-cluster-openshift-controller-manager-operator-release-4.7-e2e-upgrade - 2 runs, 50% failed, 100% of failures match
pull-ci-openshift-installer-master-e2e-aws-upgrade - 28 runs, 75% failed, 29% of failures match
pull-ci-openshift-kubernetes-release-4.7-e2e-azure-upgrade - 2 runs, 50% failed, 100% of failures match
pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade - 9 runs, 78% failed, 29% of failures match
pull-ci-openshift-oc-master-e2e-aws-upgrade - 14 runs, 50% failed, 43% of failures match
pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade - 3 runs, 67% failed, 50% of failures match
pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade - 4 runs, 25% failed, 100% of failures match
pull-ci-openshift-service-ca-operator-master-e2e-aws-upgrade - 2 runs, 50% failed, 100% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 11 runs, 36% failed, 50% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.7 - 4 runs, 75% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.8 - 4 runs, 25% failed, 100% of failures match

Searching with [1] turns up [2] as one of the older entries still in CI-search's 14d history, and [2] has:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25884/pull-ci-openshift-origin-release-4.7-e2e-gcp-upgrade/1359664951764979712/build-log.txt | grep -1 'Pool master is still reporting.*Updating: true'
Feb 11 02:16:04.771: INFO: Waiting on pools to be upgraded
Feb 11 02:16:04.816: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
...
Feb 11 02:33:54.879: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
Feb 11 02:34:04.894: INFO: All pools completed upgrade

So that suggests at least 4.7 is impacted, this failure mode is not new, and this failure mode is easy to reproduce.

[1]: https://search.ci.openshift.org/?search=Pool+master+is+still+reporting.*Updating%3A+true&maxAge=336h&context=1&type=build-log&name=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25884/pull-ci-openshift-origin-release-4.7-e2e-gcp-upgrade/1359664951764979712

Comment 4 W. Trevor King 2021-02-24 04:05:34 UTC

4.6.18 -> 4.7.0 CI job [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1363777305826037760/artifacts/launch/container-logs/test.log | grep -A1 'Waiting on pools'
Feb 22 10:51:05.232: INFO: Waiting on pools to be upgraded
Feb 22 10:51:05.460: INFO: All pools completed upgrade

4.7.0-0.nightly-2021-02-22-135912 -> 4.7.0-0.nightly-2021-02-22-210958 [2]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.7/1363960100204580864/build-log.txt | grep -A1 'Waiting on pools to be upgraded'
Feb 22 22:06:11.253: INFO: Waiting on pools to be upgraded
Feb 22 22:06:11.332: INFO: All pools completed upgrade

4.8.0-0.nightly-2021-02-23-160229 -> 4.8.0-0.nightly-2021-02-23-200827 [3]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1364308785111240704/build-log.txt | grep -A1 'Waiting on pools'
Feb 23 21:57:36.319: INFO: Waiting on pools to be upgraded
Feb 23 21:57:36.372: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1364308785111240704/build-log.txt | grep -c 'Pool master is'      
0

So those three release-informers all look good.  Not clear to me why the 'failures match' percentages from the previous comment are so high, but the number of matching job names is so limited.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1363777305826037760
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.7/1363960100204580864
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1364308785111240704

Comment 5 W. Trevor King 2021-02-24 05:55:26 UTC

Looking at all e2e failures over the past 4d:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&type=build-log&context=2&search=Waiting+on+pools+to+be+upgraded' | jq -r 'length as $all | ([to_entries[] | tostring | select(contains("Pool master is still reporting"))] | length) as $late_master | 
($late_master | tostring) + " / " + ($all | tostring)'
72 / 519

So 72 jobs where we had late masters out of 519 update jobs which failed e2e.  Breaking down by job name:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&type=build-log&context=2&search=Waiting+on+pools+to+be+upgraded' | jq -r '. as $data | ([to_entries[].value | to_entries[0].value[0].name]) | unique[] | . as $name | ([$data | to_entries[].value | to_entries[0].value[0] | select(.name == $name)]) as $jobs | ($jobs | length) as $all | ([$jobs[] | tostring | select(contains("Pool master is still reporting"))] | length) as $late_master | ($late_master / $all | tostring) + " " + ($late_master 
| tostring) + " / " + ($all | tostring) + " " + $name' | sort -n
0 0 / 1 endurance-upgrade-aws-4.6
...
0 0 / 9 periodic-ci-openshift-release-master-origin-4.6-e2e-azure-upgrade
0.08695652173913043 2 / 23 release-openshift-origin-installer-e2e-aws-upgrade
0.125 1 / 8 pull-ci-openshift-origin-master-e2e-gcp-upgrade
0.125 2 / 16 periodic-ci-openshift-release-master-ocp-4.8-e2e-metal-ipi-upgrade
0.16666666666666666 1 / 6 release-openshift-origin-installer-e2e-azure-upgrade-4.9
0.18181818181818182 2 / 11 pull-ci-openshift-service-ca-operator-master-e2e-aws-upgrade
0.2 2 / 10 release-openshift-ocp-installer-e2e-aws-upgrade-4.8
0.25 1 / 4 pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-aws-upgrade
0.2608695652173913 6 / 23 pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade
0.2682926829268293 11 / 41 pull-ci-openshift-installer-master-e2e-aws-upgrade
0.3333333333333333 1 / 3 pull-ci-openshift-cluster-ingress-operator-master-e2e-upgrade
0.3333333333333333 2 / 6 pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade
0.3333333333333333 3 / 9 pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-upgrade
0.36363636363636365 4 / 11 pull-ci-openshift-oc-master-e2e-aws-upgrade
0.5 1 / 2 pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade
0.5384615384615384 7 / 13 release-openshift-okd-installer-e2e-aws-upgrade
0.6666666666666666 2 / 3 pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade
0.75 3 / 4 pull-ci-openshift-cluster-node-tuning-operator-master-e2e-upgrade
1 1 / 1 pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-upgrade
1 1 / 1 pull-ci-openshift-cluster-openshift-controller-manager-operator-release-4.7-e2e-upgrade
1 1 / 1 pull-ci-openshift-images-master-e2e-aws-upgrade
1 1 / 1 pull-ci-openshift-kubernetes-release-4.7-e2e-azure-upgrade
1 1 / 1 pull-ci-openshift-openshift-apiserver-release-4.7-e2e-aws-upgrade
1 1 / 1 pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade
1 1 / 1 release-openshift-origin-installer-e2e-azure-upgrade-4.8
1 5 / 5 release-openshift-origin-installer-e2e-gcp-upgrade-4.7
1 9 / 9 release-openshift-origin-installer-e2e-azure-upgrade-4.7

So 4.7 OKD Azure and GCP updates seem like reliable reproducers.  Getting a bunch of job links:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&name=^release-openshift-origin-installer-e2e-azure-upgrade-4
.7$&type=build-log&context=2&search=Waiting+on+pools+to+be+upgraded' | jq -r 'to_entries[] | .key + "\n  " + ([.value | to_entries[].value[].context[]] | join("\n  "))' | grep '^http\|Pool master'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1362996944573566976
  Feb 20 06:56:02.052: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363087725145624578
  Feb 20 12:59:13.866: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363269331844075520
  Feb 21 01:06:58.210: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363632187471040512
  Feb 22 01:03:30.494: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363813696526094336
  Feb 22 13:26:27.738: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363904442780356608
  Feb 22 19:01:16.860: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364085961394229248
  Feb 23 06:58:17.556: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364267562774302720
  Feb 23 19:07:57.801: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364358329765203968
  Feb 24 01:10:24.861: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)

And releases used in those updates:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&name=^release-openshift-origin-installer-e2e-azure-upgrade-4.7$&type=build-log&context=1&search=Resolved+release' | jq -r 'to_entries[] | .key + "\n  " + ([.value | to_entries[].value[].context[]] | join("\n  "))' | grep '^http\|Resolved release'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1362996944573566976
  2021/02/20 05:26:34 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351
  2021/02/20 05:26:34 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363087725145624578
  2021/02/20 11:27:03 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351
  2021/02/20 11:27:03 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363269331844075520
  2021/02/20 23:28:40 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351
  2021/02/20 23:28:40 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363632187471040512
  2021/02/21 23:30:44 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351
  2021/02/21 23:30:44 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363722874543673344
  2021/02/22 05:30:59 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909
  2021/02/22 05:30:59 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-030505
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363813696526094336
  2021/02/22 11:31:44 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-030505
  2021/02/22 11:31:44 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-033505
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363904442780356608
  2021/02/22 17:32:23 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-033505
  2021/02/22 17:32:23 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-100111
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364085961394229248
  2021/02/23 05:33:42 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-164056
  2021/02/23 05:33:42 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-185213
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364267562774302720
  2021/02/23 17:35:20 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-164056
  2021/02/23 17:35:20 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-185213
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364358329765203968
  2021/02/23 23:35:57 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-164056
  2021/02/23 23:35:57 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-185213

But it's currently:

$ date --utc --iso=m
2021-02-24T05:52+00:00

So maybe something has been fixed in 4.7 in this space in the past two days?  Nope, looks like 4.7 is just really quiet [1].  [2] shows the only change between the impacted 4.7.0-0.ci-2021-02-22-185213 and the current tip to be a rebuild of the ovirt-csi-driver container.

[1]: https://amd64.ocp.releases.ci.openshift.org/#4.7.0-0.ci
[2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.7.0-0.ci/release/4.7.0-0.ci-2021-02-22-185213?from=4.7.0-0.ci-2021-02-22-171056

Comment 6 Yu Qi Zhang 2021-02-24 18:22:50 UTC

I observed a 4.7.0 to 4.8-nightly upgrade on GCP to see the state of the upgrade, dumping some of my findings here:

The upgrade progresses to 83% (MCO) and the MCO gets to the following state on co status:

  Conditions:
    Last Transition Time:  2021-02-24T17:53:59Z
    Message:               Working towards 4.8.0-0.nightly-2021-02-24-063313
    Status:                True                                                    <- correct
    Type:                  Progressing
    Last Transition Time:  2021-02-24T17:53:59Z
    Status:                False                                                   <- correct
    Type:                  Degraded
    Last Transition Time:  2021-02-24T17:52:01Z
    Message:               Cluster not available for 4.7.0
    Status:                False                                                   <- incorrect?
    Type:                  Available
    Last Transition Time:  2021-02-24T16:04:40Z
    Reason:                AsExpected
    Status:                True                                                    <- also incorrect?
    Type:                  Upgradeable

The availability of the operator should be true, and in the code,

cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded)

Is the only evaluation criteria we seem to have for setting available=false. That is obviously false, so how come available is evaluating false?

This ties a bit back into the fact that we're using available==degraded which is also wrong, but that's another discussion.

The operator does actually not go progressing=True or update its version until master pool is done, which is correct, so this is not consistently reproducible. Will try to look into the above job logs to see what's happening there.

A few more observations:

DNS auth openshift-apiserver goes degraded while MCO is updating, this in turn causes clusterversion status to report

Unable to apply 4.8.0-0.nightly-2021-02-24-063313: wait has exceeded 40 minutes for these operators: openshift-apiserver

and then eventually wraps around to:

Working towards 4.8.0-0.nightly-2021-02-24-063313: 172 of 669 done (25% complete)

I assume this is the CVO attempting to reconcile and starting from the beginning?

After the master MCP completes, the MCO reports complete, all other operators are back to not-degraded, and eventually the clusterversion reports: Cluster version is 4.8.0-0.nightly-2021-02-24-063313

And the update completes successfully.

It feels like this is a bit of a mess. My first investigation will be why the availability of the operator is not being evaluated correctly. If you have some idea of whether the above is "normal" @Trevor it'd be appreciated.

Comment 7 Yu Qi Zhang 2021-02-24 19:32:55 UTC

Looking at the status sync, the majority of the code hasn't changed in awhile:

https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/status.go

The version sync happens here:

https://github.com/openshift/machine-config-operator/blob/789951039423aa1745af3b23f06ce00ba5ebe8f9/pkg/operator/sync.go#L95

which would happen only if

https://github.com/openshift/machine-config-operator/blob/789951039423aa1745af3b23f06ce00ba5ebe8f9/pkg/operator/sync.go#L66 

as finished, which can only happen if

https://github.com/openshift/machine-config-operator/blob/789951039423aa1745af3b23f06ce00ba5ebe8f9/pkg/operator/sync.go#L568

has completed.

So theoretically while the pool is still progressing, the version should never get to update.

Now unfortunately in some of the failing runs I looked at, the MCO pod logs start after the version has been updated, which means a previous MCO pod set the version wrongly and the logs are now lost. I will try to replicate this to see what exactly is happening, but without the relevant logs I don't quite see how we got to a situation where the version is updated wrongly (and why this wouldn't have been the case 2 releases ago, since the code hasn't really changed)

Comment 8 W. Trevor King 2021-02-24 23:01:35 UTC

(In reply to Yu Qi Zhang from comment #6)
>     Last Transition Time:  2021-02-24T17:52:01Z
>     Message:               Cluster not available for 4.7.0
>     Status:                False                                             <- incorrect?
>     Type:                  Available

Yeah.  You should only go Available=False when an operand (the pool here?) is non-functional or non-available [1].

>     Last Transition Time:  2021-02-24T16:04:40Z
>     Reason:                AsExpected
>     Status:                True                                              <- also incorrect?
>     Type:                  Upgradeable

Yeah, ideally we'd go Upgradeable=False during 4.7->4.8 to block folks from redirecting their update to 4.9 until the pools had all caught up to 4.8, unless we get [2] or similar to get safe n-2 version skew.

> The availability of the operator should be true, and in the code,
> 
> cov1helpers.IsStatusConditionTrue(co.Status.Conditions,
> configv1.OperatorDegraded)
> 
> Is the only evaluation criteria we seem to have for setting available=false.
> That is obviously false, so how come available is evaluating false?

I'm not sure about this.  Will poke around some.

> The operator does actually not go progressing=True or update its version
> until master pool is done, which is correct, so this is not consistently
> reproducible. Will try to look into the above job logs to see what's
> happening there.

I think you mean "back to progressing=False".  And yeah, I haven't been able to reproduce the operator-leveling-before-master-pool behavior in the few jobs I've launched myself.

> Working towards 4.8.0-0.nightly-2021-02-24-063313: 172 of 669 done (25%
> complete)
> 
> I assume this is the CVO attempting to reconcile and starting from the
> beginning?

CVO doesn't persist the % completion, so if the CVO pod is drained, the replacement CVO pod will come back up at 0% and should quickly burn through the bulk of the already-reconciled update before sticking somewhere near where it had been before.

(In reply to Yu Qi Zhang from comment #7)
> Now unfortunately in some of the failing runs I looked at, the MCO pod logs start after the version has been updated, which means a previous MCO pod set the version wrongly and the logs are now lost.

Yeah, this is a pain.  We're rolling out some hosted-Loki stuff, e.g. [3] which should help mitigate.

[1]: https://github.com/openshift/api/blob/70b287c0a0a51be5cdc88d40858de59e31222f5f/config/v1/types_cluster_operator.go#L143-L145
[2]: https://github.com/openshift/enhancements/blob/41692eb45ffaddea8bcb8c002462c527fd59c707/enhancements/update/eus-upgrades-mvp.md
[3]: https://github.com/openshift/release/pull/16186

Comment 9 Yu Qi Zhang 2021-02-24 23:28:55 UTC

Still digging into some details but https://github.com/openshift/machine-config-operator/pull/2433 should be a potential remediation. Let me know if you are able to reproduce outside of specific CI runs

Comment 10 W. Trevor King 2021-02-25 04:24:26 UTC

Sanity check showing that the initial CI job discussed in comment 0 does have the required-for-upgrade label set [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[].metadata | .name + "\n  " + (.labels | keys | join("\n  "))'
master
  machineconfiguration.openshift.io/mco-built-in
  operator.machineconfiguration.openshift.io/required-for-upgrade
  pools.operator.machineconfiguration.openshift.io/master
worker
  machineconfiguration.openshift.io/mco-built-in
  pools.operator.machineconfiguration.openshift.io/worker

[1]: https://github.com/openshift/machine-config-operator/blame/f53cb9aa4a9c81ec4d0dfe6cb734957d4523676a/pkg/operator/sync.go#L38

Comment 11 W. Trevor King 2021-02-26 00:51:24 UTC

Jerry and I dug into this some more today.  Going back to comment 0, this window is just too short:

6. 18:40:39Z, rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 created...
7. 18:40:44Z, ci-op-m36h5d5b-8aaeb-szd6x-master-1 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095...
8. 18:41:12Z, machine-config sets Available=True...
9. 18:41:12Z, machine-config bumps its 'operator' version...

So we suspect syncRequiredMachineConfigPools, which was bumped in 4.7 with [1], learning to watch all pools instead of just watching required pools.  Before that PR, the file had [2]:

	if pool.Generation <= pool.Status.ObservedGeneration &&
		isPoolStatusConditionTrue(pool, mcfgv1.MachineConfigPoolUpdated) &&
		!degraded {
		continue
	}
	lastErr = fmt.Errorf("error pool %s is not ready, retrying. Status: (pool degraded: %v total: %d, ready %d, updated: %d, unavailable: %d)", pool.Name, degraded, pool.Status.MachineCount, pool.Status.ReadyMachineCount, pool.Status.UpdatedMachineCount, pool.Status.UnavailableMachineCount)
	return false, nil

That's accepting the pool as valid only when:

a. The pool controller is caught up (metadata.generation <= status.observedGeneration).  I.e. the pool controller has noticed the bumped rendered MachineConfig, because that bump brought a metadata.generation bump, and the controller has passed the metadata.generation bump on into status.observedGeneration.
b. The pool is updated.  And we know the pool controller's claim about up-to-date-ness isn't too stale, because of (a).
c. The pool isn't degraded.

Any of those conditions fail, and we set "pool ... not ready" and take another pass through the Poll loop.

After [1]:

a. Any degraded pool will block the MCO from reconciling [3], which is a bit harsh.  I'd rather have any degraded pool block future minor updates via Upgradeable=False, but non-required pools should not be able to directly block the current update by going degraded.
b. We grow an isMachineConfigPoolConfigurationValid call [4].  Not quite sure what that's about.
c. If we pass the generation and updated checks, we continue [5], like before.
d. But (this is the main bug), after that continue, there is no longer the "pool ... not ready" error (that error is now only triggered by a degraded pool; it used to be triggered by all not-level-yet required pools).

[1]: https://github.com/openshift/machine-config-operator/pull/2231
[2]: https://github.com/openshift/machine-config-operator/blame/6531aac963b1ef9c2508838f7593ed6e91cdad62/pkg/operator/sync.go#L607-L613
[3]: https://github.com/openshift/machine-config-operator/pull/2231/files#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R595
[4]: https://github.com/openshift/machine-config-operator/pull/2231/files#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R601
[5]: https://github.com/openshift/machine-config-operator/pull/2231/files#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R608

Comment 12 W. Trevor King 2021-02-26 21:16:55 UTC

Down to one referenced PR, and that one has merged.  Moving to MODIFIED.

Comment 14 Yu Qi Zhang 2021-03-01 15:21:37 UTC

Reassigning to Kirsten who provided the fix

Comment 15 W. Trevor King 2021-03-01 18:18:41 UTC

Still seeing this in CI, even though the jobs are now using mco#2440:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=the+%22master%22+pool+should+be+updated+before+the+CVO+reports+available+at+the+new+version&maxAge=12h' | grep 'failures match' | sort
periodic-ci-openshift-release-master-nightly-4.8-e2e-metal-ipi-upgrade - 3 runs, 67% failed, 100% of failures match
pull-ci-openshift-cluster-image-registry-operator-master-e2e-upgrade - 2 runs, 100% failed, 50% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-agnostic-upgrade - 2 runs, 50% failed, 100% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade - 2 runs, 100% failed, 50% of failures match
pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade - 3 runs, 100% failed, 33% of failures match
rehearse-15646-pull-ci-openshift-cluster-api-provider-baremetal-master-e2e-metal-ipi-upgrade - 1 runs, 100% failed, 100% of failures match
rehearse-16324-pull-ci-openshift-cloud-credential-operator-master-e2e-upgrade - 2 runs, 50% failed, 100% of failures match
rehearse-16324-pull-ci-openshift-machine-api-operator-master-e2e-vsphere-upgrade - 2 runs, 50% failed, 100% of failures match
release-openshift-ocp-installer-e2e-aws-upgrade-4.8 - 3 runs, 67% failed, 100% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 8 runs, 25% failed, 50% of failures match

Moving back to ASSIGNED while we investigate.

Comment 17 Michael Nguyen 2021-03-09 20:44:19 UTC

Upgraded from 4.7.1 to 4.8.0-0.nightly-2021-03-08-133419

Watched `oc get co/machine-config`, `oc get mcp/master`, and `oc get clusterversion`.  The transition from mcp/master updated go from false to true at the same time co/machine-config goes from unavailable to available.
 
Tue Mar  9 15:37:03 EST 2021
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          60m     Working towards 4.8.0-0.nightly-2021-03-08-133419: 173 of 668 done (25% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.1     False       True          False      17m
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-7639ea368e9a6940c480cd4c7ebc9570   False     True       False      3              2                   2                     0                      88m
 
Tue Mar  9 15:37:34 EST 2021
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          60m     Working towards 4.8.0-0.nightly-2021-03-08-133419: 173 of 668 done (25% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.1     False       True          False      18m
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-7639ea368e9a6940c480cd4c7ebc9570   False     True       False      3              2                   2                     0                      88m
 
Tue Mar  9 15:38:05 EST 2021
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          61m     Working towards 4.8.0-0.nightly-2021-03-08-133419: 173 of 668 done (25% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.1     False       True          False      18m
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-7639ea368e9a6940c480cd4c7ebc9570   False     True       False      3              2                   2                     0                      89m
 
Tue Mar  9 15:38:37 EST 2021
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          61m     Working towards 4.8.0-0.nightly-2021-03-08-133419: 173 of 668 done (25% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.1     False       True          False      19m
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-7639ea368e9a6940c480cd4c7ebc9570   False     True       False      3              2                   2                     0                      89m
 
Tue Mar  9 15:39:08 EST 2021
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          62m     Working towards 4.8.0-0.nightly-2021-03-08-133419: 173 of 668 done (25% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.1     False       True          False      19m
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-7639ea368e9a6940c480cd4c7ebc9570   False     True       False      3              2                   2                     0                      90m
 
Tue Mar  9 15:39:47 EST 2021
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          63m     Working towards 4.8.0-0.nightly-2021-03-08-133419: 173 of 668 done (25% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.1     False       True          False      20m
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-7639ea368e9a6940c480cd4c7ebc9570   False     True       False      3              2                   2                     0                      91m
 
Tue Mar  9 15:40:18 EST 2021
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          63m     Working towards 4.8.0-0.nightly-2021-03-08-133419: 260 of 668 done (38% complete)
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.8.0-0.nightly-2021-03-08-133419   True        False         False      27s
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-a5c62ad5e856b0d6b72504e1c2c6bf4c   True      False      False      3              3                   3                     0                      91m
 
Tue Mar  9 15:40:50 EST 2021
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          64m     Working towards 4.8.0-0.nightly-2021-03-08-133419: 664 of 668 done (99% complete)
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.8.0-0.nightly-2021-03-08-133419   True        False         False      59s
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-a5c62ad5e856b0d6b72504e1c2c6bf4c   True      False      False      3              3                   3                     0                      92m
 
Tue Mar  9 15:41:21 EST 2021
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-08-133419   True        False         31s     Cluster version is 4.8.0-0.nightly-2021-03-08-133419
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.8.0-0.nightly-2021-03-08-133419   True        False         False      90s
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-a5c62ad5e856b0d6b72504e1c2c6bf4c   True      False      False      3              3                   3                     0                      92m



MCO LOG
------------------ion rendered-master-a5c62ad5e856b0d6b72504e1c2c6bf4c, retrying
I0309 20:39:49.983283       1 sync.go:613] pool master has not progressed to latest configuration: controller version mismatch for rendered-master-7639ea368e9a6940c480cd4c7ebc9570 expected 82868e63176fee2bc806c1deb308ed1fc8965d84 has ff9431921ef6c22f602da46c07980622f2809546: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-a5c62ad5e856b0d6b72504e1c2c6bf4c, retrying
I0309 20:39:50.982471       1 sync.go:613] pool master has not progressed to latest configuration: controller version mismatch for rendered-master-7639ea368e9a6940c480cd4c7ebc9570 expected 82868e63176fee2bc806c1deb308ed1fc8965d84 has ff9431921ef6c22f602da46c07980622f2809546: 2 (ready 2) out of 3 nodes are updating to latest configuration rendered-master-a5c62ad5e856b0d6b72504e1c2c6bf4c, retrying
I0309 20:39:51.983070       1 status.go:395] pool master references machine config rendered-master-a5c62ad5e856b0d6b72504e1c2c6bf4c at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:39:51.983177       1 status.go:395] pool master references machine config 00-master at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:39:51.983205       1 status.go:395] pool master references machine config 01-master-container-runtime at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:39:51.983229       1 status.go:395] pool master references machine config 01-master-kubelet at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:39:51.983252       1 status.go:395] pool master references machine config 99-master-generated-registries at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:39:51.983277       1 status.go:397] pool master references machine config 99-master-ssh, which does not declare a version
I0309 20:39:51.983302       1 sync.go:619] pool master is up to date
I0309 20:39:51.995090       1 sync.go:631] required machine-config pools synchronized
I0309 20:39:52.378496       1 event.go:282] Event(v1.ObjectReference{Kind:"", Namespace:"", Name:"machine-config", UID:"5c085e65-0442-4af8-9786-a61807ed558e", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorVersionChanged' clusteroperator/machine-config-operator version changed from [{operator 4.7.1}] to [{operator 4.8.0-0.nightly-2021-03-08-133419}]
W0309 20:39:53.196631       1 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
I0309 20:39:58.706206       1 sync.go:569] syncing Required MachineConfigPools
I0309 20:39:59.706361       1 status.go:395] pool master references machine config rendered-master-a5c62ad5e856b0d6b72504e1c2c6bf4c at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:39:59.706484       1 status.go:395] pool master references machine config 00-master at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:39:59.706498       1 status.go:395] pool master references machine config 01-master-container-runtime at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:39:59.706773       1 status.go:395] pool master references machine config 01-master-kubelet at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:39:59.706839       1 status.go:395] pool master references machine config 99-master-generated-registries at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:39:59.706852       1 status.go:397] pool master references machine config 99-master-ssh, which does not declare a version
I0309 20:39:59.706875       1 sync.go:619] pool master is up to date
I0309 20:39:59.717163       1 sync.go:631] required machine-config pools synchronized
W0309 20:40:19.603073       1 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
I0309 20:40:24.707010       1 sync.go:569] syncing Required MachineConfigPools
I0309 20:40:25.707241       1 status.go:395] pool master references machine config rendered-master-a5c62ad5e856b0d6b72504e1c2c6bf4c at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:40:25.707281       1 status.go:395] pool master references machine config 00-master at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:40:25.707287       1 status.go:395] pool master references machine config 01-master-container-runtime at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:40:25.707292       1 status.go:395] pool master references machine config 01-master-kubelet at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:40:25.707298       1 status.go:395] pool master references machine config 99-master-generated-registries at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84
I0309 20:40:25.707303       1 status.go:397] pool master references machine config 99-master-ssh, which does not declare a version
I0309 20:40:25.707309       1 sync.go:619] pool master is up to date
I0309 20:40:25.721609       1 sync.go:631] required machine-config pools synchronized
W0309 20:40:26.519764       1 warnings.go:70] apiextensions.k8s.io/v1beta1 CustomResourceDefinition is deprecated in v1.16+, unavailable in v1.22+; use apiextensions.k8s.io/v1 CustomResourceDefinition
I0309 20:40:31.982034       1 sync.go:569] syncing Required MachineConfigPools
I0309 20:40:32.982249       1 status.go:395] pool master references machine config rendered-master-a5c62ad5e856b0d6b72504e1c2c6bf4c at the expected 82868e63176fee2bc806c1deb308ed1fc8965d84

Comment 20 errata-xmlrpc 2021-07-27 22:48:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438