Bug 1933368 - machine-config ClusterOperator claims level while control-plane still updating
Summary: machine-config ClusterOperator claims level while control-plane still updating
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.7.z
Assignee: Yu Qi Zhang
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1932105
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-26 20:02 UTC by OpenShift BugZilla Robot
Modified: 2021-03-25 01:53 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-25 01:53:00 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2441 0 None open [release-4.7] Bug 1933368: operator/sync.go restore err when required pools not leveled 2021-03-12 23:56:21 UTC
Red Hat Product Errata RHBA-2021:0821 0 None None None 2021-03-25 01:53:14 UTC

Description OpenShift BugZilla Robot 2021-02-26 20:02:19 UTC
+++ This bug was initially created as a clone of Bug #1932105 +++

Seen in master/4.8 CI [1]:

1. 17:55:04Z, update to 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b initiated [2].
2. 18:34:42Z, machine-config goes Degraded=True on [4]:
     RenderConfigFailed: Failed to resync 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b because: refusing to read osImageURL version "4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b", operator version "4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b"
3. 18:34:43Z, machine-config goes Available=False on [4]:
     Cluster not available for 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b
4. 18:37:33Z, machine-config goes Progressing=True on [4]:
     Working towards 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
5. 18:37:33Z, machine-config goes Degraded=False [4].
6. 18:40:39Z, rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 created [3].
7. 18:40:44Z, ci-op-m36h5d5b-8aaeb-szd6x-master-1 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 [7].
8. 18:41:12Z, machine-config sets Available=True [4].
9. 18:41:12Z, machine-config bumps its 'operator' version [4], but 'operator' should only be bumped after the operator and all its operands have completed their update [5].  This unblocks the CVO, because machine-config is Available=True, Degraded=False, and 'operator' is the only version the CVO has been told to wait for [6].
10. 18:41:13Z, machine-config sets Progressing=False on [4]:
      Cluster version is 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
11. 18:43:25Z, update nominally completes (per ClusterVersion [2]).
12. 18:48:13Z, ci-op-m36h5d5b-8aaeb-szd6x-master-0 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 [7].
13. 18:55:33Z, ci-op-m36h5d5b-8aaeb-szd6x-master-2 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 [7].

So issues:

a. machine-config should not go Available=False (step 3) or Degraded=True (step 2) during this healthy update.  Clayton has [9] up to guard against this.
b. machine-config should delay bumping the 'operator' entry in status.versions (step 9) until the 'master' pool completes (after step 13).  Clayton has [8] up to help guard against this.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640
[2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version'
     2021-02-23T17:55:04Z 2021-02-23T18:43:25Z Completed 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
     2021-02-23T17:22:37Z 2021-02-23T17:52:39Z Completed 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b
[3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/machineconfigs.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' | grep render | sort
     2021-02-23T17:33:03Z rendered-master-0791a04211638c1b7706db957757d22c
     2021-02-23T17:33:03Z rendered-worker-94205363958f282f378ae9bb3586a9c0
     2021-02-23T18:40:39Z rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095
     2021-02-23T18:40:39Z rendered-worker-99e80e3ee204abddcbf2411760dc2e93
[4]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/build-log.txt | grep clusteroperator/machin
e-config
     Feb 23 18:34:42.844 E clusteroperator/machine-config changed Degraded to True: RenderConfigFailed: Failed to resync 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b because: refusing to read osImageURL version "4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b", operator version "4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b"
     Feb 23 18:34:43.247 W clusteroperator/machine-config changed Available to False: Cluster not available for 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b
     Feb 23 18:37:33.310 W clusteroperator/machine-config changed Progressing to True: Working towards 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
     Feb 23 18:37:33.311 I /machine-config reason/OperatorVersionChanged clusteroperator/machine-config-operator started a version change from [{operator 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b}] to [{operator 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b}]
     Feb 23 18:37:33.348 W clusteroperator/machine-config changed Degraded to False
     Feb 23 18:41:12.774 W clusteroperator/machine-config changed Available to True: Cluster has deployed 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
     Feb 23 18:41:12.834 I /machine-config reason/OperatorVersionChanged clusteroperator/machine-config-operator version changed from [{operator 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b}] to [{operator 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b}]
     Feb 23 18:41:12.904 I clusteroperator/machine-config versions: operator 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b -> 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
     Feb 23 18:41:13.497 W clusteroperator/machine-config changed Progressing to False: Cluster version is 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b
[5]: https://github.com/openshift/api/blob/70b287c0a0a51be5cdc88d40858de59e31222f5f/config/v1/types_cluster_operator.go#L45
[6]: https://github.com/openshift/machine-config-operator/blob/1402af23b7b6f5acb6dd42ab2ed41c2e01acbbcc/install/0000_80_machine-config-operator_06_clusteroperator.yaml#L12-L13
[7]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-
ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.reason == "SetDesiredConfig") | .firstTimestamp + " " + (.count | tostring) + " " + .message' | sort | grep master
     2021-02-23T18:40:44Z 1 Targeted node ci-op-m36h5d5b-8aaeb-szd6x-master-1 to config rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095
     2021-02-23T18:48:13Z 1 Targeted node ci-op-m36h5d5b-8aaeb-szd6x-master-0 to config rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095
     2021-02-23T18:55:33Z 1 Targeted node ci-op-m36h5d5b-8aaeb-szd6x-master-2 to config rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095
[8]: https://github.com/openshift/origin/pull/25922
[9]: https://github.com/openshift/origin/pull/25920

--- Additional comment from jerzhang@redhat.com on 2021-02-24 01:35:58 UTC ---

Hey Trevor, a few questions:

1. is this a new issue in 4.8 only and not in previous versions? Do you know if this happens every time (in terms of unavailable + degraded during the upgrade timeframe)

2. for "a. machine-config should not go Available=False (step 3) or Degraded=True (step 2) during this healthy update.  Clayton has [9] up to guard against this."

How exactly does this failure manifest? Does it stop the upgrade?

3. b. machine-config should delay bumping the 'operator' entry in status.versions (step 9) until the 'master' pool completes (after step 13).  Clayton has [8] up to help guard against this.

Is the MCO prematurely reporting upgrade completion before master has finished? Is this also new behaviour?

--- Additional comment from wking@redhat.com on 2021-02-24 02:37:07 UTC ---

> ...is this a new issue in 4.8 only and not in previous versions?

No information yet on frequency.  Clayton's origin#25920 and 25922 should give us that information.  Might be able to query CI search in the meantime, but I haven't looked for a fingerprint to use.

> How exactly does this failure manifest? Does it stop the upgrade?

It temporarily blocks the update.  If it lasts for 10+ minutes (was 6m or less this update), it would trigger a page-at-midnight ClsuterOperatorDown alert.  And anyone who polls ClusterOperator during the update and sees Available=False from machine-config might justifiably believe that their cluster was exploding, and then be more likely to ignore later Available=False that represented real "my component is dead" concerns.

> Is the MCO prematurely reporting upgrade completion before master has finished?

Yes.

> Is this also new behaviour?

Unclear.

--- Additional comment from wking@redhat.com on 2021-02-24 03:29:19 UTC ---

Rough handle on how often this happens:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=build-log&search=Pool+master+is+still+reporting.*Updating:+true' | grep 'failures match' | sort
pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-aws-upgrade - 7 runs, 71% failed, 20% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade - 10 runs, 30% failed, 33% of failures match
pull-ci-openshift-cluster-node-tuning-operator-master-e2e-upgrade - 5 runs, 60% failed, 100% of failures match
pull-ci-openshift-cluster-openshift-controller-manager-operator-release-4.7-e2e-upgrade - 2 runs, 50% failed, 100% of failures match
pull-ci-openshift-installer-master-e2e-aws-upgrade - 28 runs, 75% failed, 29% of failures match
pull-ci-openshift-kubernetes-release-4.7-e2e-azure-upgrade - 2 runs, 50% failed, 100% of failures match
pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade - 9 runs, 78% failed, 29% of failures match
pull-ci-openshift-oc-master-e2e-aws-upgrade - 14 runs, 50% failed, 43% of failures match
pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade - 3 runs, 67% failed, 50% of failures match
pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade - 4 runs, 25% failed, 100% of failures match
pull-ci-openshift-service-ca-operator-master-e2e-aws-upgrade - 2 runs, 50% failed, 100% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 11 runs, 36% failed, 50% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.7 - 4 runs, 75% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.8 - 4 runs, 25% failed, 100% of failures match

Searching with [1] turns up [2] as one of the older entries still in CI-search's 14d history, and [2] has:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25884/pull-ci-openshift-origin-release-4.7-e2e-gcp-upgrade/1359664951764979712/build-log.txt | grep -1 'Pool master is still reporting.*Updating: true'
Feb 11 02:16:04.771: INFO: Waiting on pools to be upgraded
Feb 11 02:16:04.816: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
...
Feb 11 02:33:54.879: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
Feb 11 02:34:04.894: INFO: All pools completed upgrade

So that suggests at least 4.7 is impacted, this failure mode is not new, and this failure mode is easy to reproduce.

[1]: https://search.ci.openshift.org/?search=Pool+master+is+still+reporting.*Updating%3A+true&maxAge=336h&context=1&type=build-log&name=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25884/pull-ci-openshift-origin-release-4.7-e2e-gcp-upgrade/1359664951764979712

--- Additional comment from wking@redhat.com on 2021-02-24 04:05:34 UTC ---

4.6.18 -> 4.7.0 CI job [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1363777305826037760/artifacts/launch/container-logs/test.log | grep -A1 'Waiting on pools'
Feb 22 10:51:05.232: INFO: Waiting on pools to be upgraded
Feb 22 10:51:05.460: INFO: All pools completed upgrade

4.7.0-0.nightly-2021-02-22-135912 -> 4.7.0-0.nightly-2021-02-22-210958 [2]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.7/1363960100204580864/build-log.txt | grep -A1 'Waiting on pools to be upgraded'
Feb 22 22:06:11.253: INFO: Waiting on pools to be upgraded
Feb 22 22:06:11.332: INFO: All pools completed upgrade

4.8.0-0.nightly-2021-02-23-160229 -> 4.8.0-0.nightly-2021-02-23-200827 [3]:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1364308785111240704/build-log.txt | grep -A1 'Waiting on pools'
Feb 23 21:57:36.319: INFO: Waiting on pools to be upgraded
Feb 23 21:57:36.372: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1364308785111240704/build-log.txt | grep -c 'Pool master is'      
0

So those three release-informers all look good.  Not clear to me why the 'failures match' percentages from the previous comment are so high, but the number of matching job names is so limited.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1363777305826037760
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.7/1363960100204580864
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1364308785111240704

--- Additional comment from wking@redhat.com on 2021-02-24 05:55:26 UTC ---

Looking at all e2e failures over the past 4d:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&type=build-log&context=2&search=Waiting+on+pools+to+be+upgraded' | jq -r 'length as $all | ([to_entries[] | tostring | select(contains("Pool master is still reporting"))] | length) as $late_master | 
($late_master | tostring) + " / " + ($all | tostring)'
72 / 519

So 72 jobs where we had late masters out of 519 update jobs which failed e2e.  Breaking down by job name:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&type=build-log&context=2&search=Waiting+on+pools+to+be+upgraded' | jq -r '. as $data | ([to_entries[].value | to_entries[0].value[0].name]) | unique[] | . as $name | ([$data | to_entries[].value | to_entries[0].value[0] | select(.name == $name)]) as $jobs | ($jobs | length) as $all | ([$jobs[] | tostring | select(contains("Pool master is still reporting"))] | length) as $late_master | ($late_master / $all | tostring) + " " + ($late_master 
| tostring) + " / " + ($all | tostring) + " " + $name' | sort -n
0 0 / 1 endurance-upgrade-aws-4.6
...
0 0 / 9 periodic-ci-openshift-release-master-origin-4.6-e2e-azure-upgrade
0.08695652173913043 2 / 23 release-openshift-origin-installer-e2e-aws-upgrade
0.125 1 / 8 pull-ci-openshift-origin-master-e2e-gcp-upgrade
0.125 2 / 16 periodic-ci-openshift-release-master-ocp-4.8-e2e-metal-ipi-upgrade
0.16666666666666666 1 / 6 release-openshift-origin-installer-e2e-azure-upgrade-4.9
0.18181818181818182 2 / 11 pull-ci-openshift-service-ca-operator-master-e2e-aws-upgrade
0.2 2 / 10 release-openshift-ocp-installer-e2e-aws-upgrade-4.8
0.25 1 / 4 pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-aws-upgrade
0.2608695652173913 6 / 23 pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade
0.2682926829268293 11 / 41 pull-ci-openshift-installer-master-e2e-aws-upgrade
0.3333333333333333 1 / 3 pull-ci-openshift-cluster-ingress-operator-master-e2e-upgrade
0.3333333333333333 2 / 6 pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade
0.3333333333333333 3 / 9 pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-upgrade
0.36363636363636365 4 / 11 pull-ci-openshift-oc-master-e2e-aws-upgrade
0.5 1 / 2 pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade
0.5384615384615384 7 / 13 release-openshift-okd-installer-e2e-aws-upgrade
0.6666666666666666 2 / 3 pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade
0.75 3 / 4 pull-ci-openshift-cluster-node-tuning-operator-master-e2e-upgrade
1 1 / 1 pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-upgrade
1 1 / 1 pull-ci-openshift-cluster-openshift-controller-manager-operator-release-4.7-e2e-upgrade
1 1 / 1 pull-ci-openshift-images-master-e2e-aws-upgrade
1 1 / 1 pull-ci-openshift-kubernetes-release-4.7-e2e-azure-upgrade
1 1 / 1 pull-ci-openshift-openshift-apiserver-release-4.7-e2e-aws-upgrade
1 1 / 1 pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade
1 1 / 1 release-openshift-origin-installer-e2e-azure-upgrade-4.8
1 5 / 5 release-openshift-origin-installer-e2e-gcp-upgrade-4.7
1 9 / 9 release-openshift-origin-installer-e2e-azure-upgrade-4.7

So 4.7 OKD Azure and GCP updates seem like reliable reproducers.  Getting a bunch of job links:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&name=^release-openshift-origin-installer-e2e-azure-upgrade-4
.7$&type=build-log&context=2&search=Waiting+on+pools+to+be+upgraded' | jq -r 'to_entries[] | .key + "\n  " + ([.value | to_entries[].value[].context[]] | join("\n  "))' | grep '^http\|Pool master'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1362996944573566976
  Feb 20 06:56:02.052: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363087725145624578
  Feb 20 12:59:13.866: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363269331844075520
  Feb 21 01:06:58.210: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363632187471040512
  Feb 22 01:03:30.494: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363813696526094336
  Feb 22 13:26:27.738: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363904442780356608
  Feb 22 19:01:16.860: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364085961394229248
  Feb 23 06:58:17.556: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364267562774302720
  Feb 23 19:07:57.801: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364358329765203968
  Feb 24 01:10:24.861: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)

And releases used in those updates:

$ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&name=^release-openshift-origin-installer-e2e-azure-upgrade-4.7$&type=build-log&context=1&search=Resolved+release' | jq -r 'to_entries[] | .key + "\n  " + ([.value | to_entries[].value[].context[]] | join("\n  "))' | grep '^http\|Resolved release'
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1362996944573566976
  2021/02/20 05:26:34 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351
  2021/02/20 05:26:34 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363087725145624578
  2021/02/20 11:27:03 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351
  2021/02/20 11:27:03 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363269331844075520
  2021/02/20 23:28:40 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351
  2021/02/20 23:28:40 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363632187471040512
  2021/02/21 23:30:44 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351
  2021/02/21 23:30:44 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363722874543673344
  2021/02/22 05:30:59 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909
  2021/02/22 05:30:59 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-030505
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363813696526094336
  2021/02/22 11:31:44 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-030505
  2021/02/22 11:31:44 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-033505
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363904442780356608
  2021/02/22 17:32:23 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-033505
  2021/02/22 17:32:23 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-100111
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364085961394229248
  2021/02/23 05:33:42 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-164056
  2021/02/23 05:33:42 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-185213
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364267562774302720
  2021/02/23 17:35:20 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-164056
  2021/02/23 17:35:20 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-185213
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364358329765203968
  2021/02/23 23:35:57 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-164056
  2021/02/23 23:35:57 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-185213

But it's currently:

$ date --utc --iso=m
2021-02-24T05:52+00:00

So maybe something has been fixed in 4.7 in this space in the past two days?  Nope, looks like 4.7 is just really quiet [1].  [2] shows the only change between the impacted 4.7.0-0.ci-2021-02-22-185213 and the current tip to be a rebuild of the ovirt-csi-driver container.

[1]: https://amd64.ocp.releases.ci.openshift.org/#4.7.0-0.ci
[2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.7.0-0.ci/release/4.7.0-0.ci-2021-02-22-185213?from=4.7.0-0.ci-2021-02-22-171056

--- Additional comment from jerzhang@redhat.com on 2021-02-24 18:22:50 UTC ---

I observed a 4.7.0 to 4.8-nightly upgrade on GCP to see the state of the upgrade, dumping some of my findings here:

The upgrade progresses to 83% (MCO) and the MCO gets to the following state on co status:

  Conditions:
    Last Transition Time:  2021-02-24T17:53:59Z
    Message:               Working towards 4.8.0-0.nightly-2021-02-24-063313
    Status:                True                                                    <- correct
    Type:                  Progressing
    Last Transition Time:  2021-02-24T17:53:59Z
    Status:                False                                                   <- correct
    Type:                  Degraded
    Last Transition Time:  2021-02-24T17:52:01Z
    Message:               Cluster not available for 4.7.0
    Status:                False                                                   <- incorrect?
    Type:                  Available
    Last Transition Time:  2021-02-24T16:04:40Z
    Reason:                AsExpected
    Status:                True                                                    <- also incorrect?
    Type:                  Upgradeable

The availability of the operator should be true, and in the code,

cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded)

Is the only evaluation criteria we seem to have for setting available=false. That is obviously false, so how come available is evaluating false?

This ties a bit back into the fact that we're using available==degraded which is also wrong, but that's another discussion.

The operator does actually not go progressing=True or update its version until master pool is done, which is correct, so this is not consistently reproducible. Will try to look into the above job logs to see what's happening there.

A few more observations:

DNS auth openshift-apiserver goes degraded while MCO is updating, this in turn causes clusterversion status to report

Unable to apply 4.8.0-0.nightly-2021-02-24-063313: wait has exceeded 40 minutes for these operators: openshift-apiserver

and then eventually wraps around to:

Working towards 4.8.0-0.nightly-2021-02-24-063313: 172 of 669 done (25% complete)

I assume this is the CVO attempting to reconcile and starting from the beginning?

After the master MCP completes, the MCO reports complete, all other operators are back to not-degraded, and eventually the clusterversion reports: Cluster version is 4.8.0-0.nightly-2021-02-24-063313

And the update completes successfully.

It feels like this is a bit of a mess. My first investigation will be why the availability of the operator is not being evaluated correctly. If you have some idea of whether the above is "normal" @Trevor it'd be appreciated.

--- Additional comment from jerzhang@redhat.com on 2021-02-24 19:32:55 UTC ---

Looking at the status sync, the majority of the code hasn't changed in awhile:

https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/status.go

The version sync happens here:

https://github.com/openshift/machine-config-operator/blob/789951039423aa1745af3b23f06ce00ba5ebe8f9/pkg/operator/sync.go#L95

which would happen only if

https://github.com/openshift/machine-config-operator/blob/789951039423aa1745af3b23f06ce00ba5ebe8f9/pkg/operator/sync.go#L66 

as finished, which can only happen if

https://github.com/openshift/machine-config-operator/blob/789951039423aa1745af3b23f06ce00ba5ebe8f9/pkg/operator/sync.go#L568

has completed.

So theoretically while the pool is still progressing, the version should never get to update.

Now unfortunately in some of the failing runs I looked at, the MCO pod logs start after the version has been updated, which means a previous MCO pod set the version wrongly and the logs are now lost. I will try to replicate this to see what exactly is happening, but without the relevant logs I don't quite see how we got to a situation where the version is updated wrongly (and why this wouldn't have been the case 2 releases ago, since the code hasn't really changed)

--- Additional comment from wking@redhat.com on 2021-02-24 23:01:35 UTC ---

(In reply to Yu Qi Zhang from comment #6)
>     Last Transition Time:  2021-02-24T17:52:01Z
>     Message:               Cluster not available for 4.7.0
>     Status:                False                                             <- incorrect?
>     Type:                  Available

Yeah.  You should only go Available=False when an operand (the pool here?) is non-functional or non-available [1].

>     Last Transition Time:  2021-02-24T16:04:40Z
>     Reason:                AsExpected
>     Status:                True                                              <- also incorrect?
>     Type:                  Upgradeable

Yeah, ideally we'd go Upgradeable=False during 4.7->4.8 to block folks from redirecting their update to 4.9 until the pools had all caught up to 4.8, unless we get [2] or similar to get safe n-2 version skew.

> The availability of the operator should be true, and in the code,
> 
> cov1helpers.IsStatusConditionTrue(co.Status.Conditions,
> configv1.OperatorDegraded)
> 
> Is the only evaluation criteria we seem to have for setting available=false.
> That is obviously false, so how come available is evaluating false?

I'm not sure about this.  Will poke around some.

> The operator does actually not go progressing=True or update its version
> until master pool is done, which is correct, so this is not consistently
> reproducible. Will try to look into the above job logs to see what's
> happening there.

I think you mean "back to progressing=False".  And yeah, I haven't been able to reproduce the operator-leveling-before-master-pool behavior in the few jobs I've launched myself.

> Working towards 4.8.0-0.nightly-2021-02-24-063313: 172 of 669 done (25%
> complete)
> 
> I assume this is the CVO attempting to reconcile and starting from the
> beginning?

CVO doesn't persist the % completion, so if the CVO pod is drained, the replacement CVO pod will come back up at 0% and should quickly burn through the bulk of the already-reconciled update before sticking somewhere near where it had been before.

(In reply to Yu Qi Zhang from comment #7)
> Now unfortunately in some of the failing runs I looked at, the MCO pod logs start after the version has been updated, which means a previous MCO pod set the version wrongly and the logs are now lost.

Yeah, this is a pain.  We're rolling out some hosted-Loki stuff, e.g. [3] which should help mitigate.

[1]: https://github.com/openshift/api/blob/70b287c0a0a51be5cdc88d40858de59e31222f5f/config/v1/types_cluster_operator.go#L143-L145
[2]: https://github.com/openshift/enhancements/blob/41692eb45ffaddea8bcb8c002462c527fd59c707/enhancements/update/eus-upgrades-mvp.md
[3]: https://github.com/openshift/release/pull/16186

--- Additional comment from jerzhang@redhat.com on 2021-02-24 23:28:55 UTC ---

Still digging into some details but https://github.com/openshift/machine-config-operator/pull/2433 should be a potential remediation. Let me know if you are able to reproduce outside of specific CI runs

--- Additional comment from wking@redhat.com on 2021-02-25 04:24:26 UTC ---

Sanity check showing that the initial CI job discussed in comment 0 does have the required-for-upgrade label set [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[].metadata | .name + "\n  " + (.labels | keys | join("\n  "))'
master
  machineconfiguration.openshift.io/mco-built-in
  operator.machineconfiguration.openshift.io/required-for-upgrade
  pools.operator.machineconfiguration.openshift.io/master
worker
  machineconfiguration.openshift.io/mco-built-in
  pools.operator.machineconfiguration.openshift.io/worker

[1]: https://github.com/openshift/machine-config-operator/blame/f53cb9aa4a9c81ec4d0dfe6cb734957d4523676a/pkg/operator/sync.go#L38

--- Additional comment from wking@redhat.com on 2021-02-26 00:51:24 UTC ---

Jerry and I dug into this some more today.  Going back to comment 0, this window is just too short:

6. 18:40:39Z, rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 created...
7. 18:40:44Z, ci-op-m36h5d5b-8aaeb-szd6x-master-1 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095...
8. 18:41:12Z, machine-config sets Available=True...
9. 18:41:12Z, machine-config bumps its 'operator' version...

So we suspect syncRequiredMachineConfigPools, which was bumped in 4.7 with [1], learning to watch all pools instead of just watching required pools.  Before that PR, the file had [2]:

	if pool.Generation <= pool.Status.ObservedGeneration &&
		isPoolStatusConditionTrue(pool, mcfgv1.MachineConfigPoolUpdated) &&
		!degraded {
		continue
	}
	lastErr = fmt.Errorf("error pool %s is not ready, retrying. Status: (pool degraded: %v total: %d, ready %d, updated: %d, unavailable: %d)", pool.Name, degraded, pool.Status.MachineCount, pool.Status.ReadyMachineCount, pool.Status.UpdatedMachineCount, pool.Status.UnavailableMachineCount)
	return false, nil

That's accepting the pool as valid only when:

a. The pool controller is caught up (metadata.generation <= status.observedGeneration).  I.e. the pool controller has noticed the bumped rendered MachineConfig, because that bump brought a metadata.generation bump, and the controller has passed the metadata.generation bump on into status.observedGeneration.
b. The pool is updated.  And we know the pool controller's claim about up-to-date-ness isn't too stale, because of (a).
c. The pool isn't degraded.

Any of those conditions fail, and we set "pool ... not ready" and take another pass through the Poll loop.

After [1]:

a. Any degraded pool will block the MCO from reconciling [3], which is a bit harsh.  I'd rather have any degraded pool block future minor updates via Upgradeable=False, but non-required pools should not be able to directly block the current update by going degraded.
b. We grow an isMachineConfigPoolConfigurationValid call [4].  Not quite sure what that's about.
c. If we pass the generation and updated checks, we continue [5], like before.
d. But (this is the main bug), after that continue, there is no longer the "pool ... not ready" error (that error is now only triggered by a degraded pool; it used to be triggered by all not-level-yet required pools).

[1]: https://github.com/openshift/machine-config-operator/pull/2231
[2]: https://github.com/openshift/machine-config-operator/blame/6531aac963b1ef9c2508838f7593ed6e91cdad62/pkg/operator/sync.go#L607-L613
[3]: https://github.com/openshift/machine-config-operator/pull/2231/files#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R595
[4]: https://github.com/openshift/machine-config-operator/pull/2231/files#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R601
[5]: https://github.com/openshift/machine-config-operator/pull/2231/files#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R608

Comment 3 Michael Nguyen 2021-03-16 15:36:56 UTC
Verified on 4.7.0-0.nightly-2021-03-14-223051.  Upgraded from 4.7.1 to 4.7.0-0.nightly-2021-03-14-223051 and watched the progress.  The machine-config cluster operator is not available until the master machine config pool is updated and becomes available.


NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          54m     Working towards 4.7.0-0.nightly-2021-03-14-223051: 176 of 668 done (26% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.1     False       True          False      20m
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-d7b53fee67394972404f20d7abc19546   False     True       False      3              2                   2                     0                      127m
 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          55m     Working towards 4.7.0-0.nightly-2021-03-14-223051: 176 of 668 done (26% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.1     False       True          False      20m
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-d7b53fee67394972404f20d7abc19546   False     True       False      3              2                   2                     0                      128m
 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          55m     Working towards 4.7.0-0.nightly-2021-03-14-223051: 176 of 668 done (26% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.1     False       True          False      20m
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-d7b53fee67394972404f20d7abc19546   False     True       False      3              2                   2                     0                      128m
 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          56m     Working towards 4.7.0-0.nightly-2021-03-14-223051: 176 of 668 done (26% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.1     False       True          False      21m
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-d7b53fee67394972404f20d7abc19546   False     True       False      3              2                   2                     0                      129m
 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          56m     Working towards 4.7.0-0.nightly-2021-03-14-223051: 176 of 668 done (26% complete), waiting up to 40 minutes on openshift-apiserver
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.0-0.nightly-2021-03-14-223051   True        False         False      17s
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-683abfce7dccbf54f19b49189cbc1ad7   True      False      False      3              3                   3                     0                      129m
 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          56m     Working towards 4.7.0-0.nightly-2021-03-14-223051: 310 of 668 done (46% complete)
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.0-0.nightly-2021-03-14-223051   True        False         False      38s
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-683abfce7dccbf54f19b49189cbc1ad7   True      False      False      3              3                   3                     0                      129m
 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.1     True        True          57m     Working towards 4.7.0-0.nightly-2021-03-14-223051: 563 of 668 done (84% complete)
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.0-0.nightly-2021-03-14-223051   True        False         False      60s
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-683abfce7dccbf54f19b49189cbc1ad7   True      False      False      3              3                   3                     0                      130m
 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-03-14-223051   True        False         14s     Cluster version is 4.7.0-0.nightly-2021-03-14-223051
NAME             VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
machine-config   4.7.0-0.nightly-2021-03-14-223051   True        False         False      82s
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-683abfce7dccbf54f19b49189cbc1ad7   True      False      False      3              3                   3                     0                      130m

Comment 5 errata-xmlrpc 2021-03-25 01:53:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.3 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0821


Note You need to log in before you can comment on or make changes to this bug.