+++ This bug was initially created as a clone of Bug #1932105 +++ Seen in master/4.8 CI [1]: 1. 17:55:04Z, update to 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b initiated [2]. 2. 18:34:42Z, machine-config goes Degraded=True on [4]: RenderConfigFailed: Failed to resync 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b because: refusing to read osImageURL version "4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b", operator version "4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b" 3. 18:34:43Z, machine-config goes Available=False on [4]: Cluster not available for 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b 4. 18:37:33Z, machine-config goes Progressing=True on [4]: Working towards 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b 5. 18:37:33Z, machine-config goes Degraded=False [4]. 6. 18:40:39Z, rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 created [3]. 7. 18:40:44Z, ci-op-m36h5d5b-8aaeb-szd6x-master-1 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 [7]. 8. 18:41:12Z, machine-config sets Available=True [4]. 9. 18:41:12Z, machine-config bumps its 'operator' version [4], but 'operator' should only be bumped after the operator and all its operands have completed their update [5]. This unblocks the CVO, because machine-config is Available=True, Degraded=False, and 'operator' is the only version the CVO has been told to wait for [6]. 10. 18:41:13Z, machine-config sets Progressing=False on [4]: Cluster version is 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b 11. 18:43:25Z, update nominally completes (per ClusterVersion [2]). 12. 18:48:13Z, ci-op-m36h5d5b-8aaeb-szd6x-master-0 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 [7]. 13. 18:55:33Z, ci-op-m36h5d5b-8aaeb-szd6x-master-2 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 [7]. So issues: a. machine-config should not go Available=False (step 3) or Degraded=True (step 2) during this healthy update. Clayton has [9] up to guard against this. b. machine-config should delay bumping the 'operator' entry in status.versions (step 9) until the 'master' pool completes (after step 13). Clayton has [8] up to help guard against this. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640 [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.history[] | .startedTime + " " + (.completionTime // "-") + " " + .state + " " + .version' 2021-02-23T17:55:04Z 2021-02-23T18:43:25Z Completed 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b 2021-02-23T17:22:37Z 2021-02-23T17:52:39Z Completed 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b [3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/machineconfigs.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' | grep render | sort 2021-02-23T17:33:03Z rendered-master-0791a04211638c1b7706db957757d22c 2021-02-23T17:33:03Z rendered-worker-94205363958f282f378ae9bb3586a9c0 2021-02-23T18:40:39Z rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 2021-02-23T18:40:39Z rendered-worker-99e80e3ee204abddcbf2411760dc2e93 [4]: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/build-log.txt | grep clusteroperator/machin e-config Feb 23 18:34:42.844 E clusteroperator/machine-config changed Degraded to True: RenderConfigFailed: Failed to resync 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b because: refusing to read osImageURL version "4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b", operator version "4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b" Feb 23 18:34:43.247 W clusteroperator/machine-config changed Available to False: Cluster not available for 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b Feb 23 18:37:33.310 W clusteroperator/machine-config changed Progressing to True: Working towards 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b Feb 23 18:37:33.311 I /machine-config reason/OperatorVersionChanged clusteroperator/machine-config-operator started a version change from [{operator 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b}] to [{operator 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b}] Feb 23 18:37:33.348 W clusteroperator/machine-config changed Degraded to False Feb 23 18:41:12.774 W clusteroperator/machine-config changed Available to True: Cluster has deployed 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b Feb 23 18:41:12.834 I /machine-config reason/OperatorVersionChanged clusteroperator/machine-config-operator version changed from [{operator 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b}] to [{operator 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b}] Feb 23 18:41:12.904 I clusteroperator/machine-config versions: operator 4.8.0-0.ci.test-2021-02-23-170341-ci-op-m36h5d5b -> 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b Feb 23 18:41:13.497 W clusteroperator/machine-config changed Progressing to False: Cluster version is 4.8.0-0.ci.test-2021-02-23-170859-ci-op-m36h5d5b [5]: https://github.com/openshift/api/blob/70b287c0a0a51be5cdc88d40858de59e31222f5f/config/v1/types_cluster_operator.go#L45 [6]: https://github.com/openshift/machine-config-operator/blob/1402af23b7b6f5acb6dd42ab2ed41c2e01acbbcc/install/0000_80_machine-config-operator_06_clusteroperator.yaml#L12-L13 [7]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp- ovn-upgrade/gather-extra/artifacts/events.json | jq -r '.items[] | select(.reason == "SetDesiredConfig") | .firstTimestamp + " " + (.count | tostring) + " " + .message' | sort | grep master 2021-02-23T18:40:44Z 1 Targeted node ci-op-m36h5d5b-8aaeb-szd6x-master-1 to config rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 2021-02-23T18:48:13Z 1 Targeted node ci-op-m36h5d5b-8aaeb-szd6x-master-0 to config rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 2021-02-23T18:55:33Z 1 Targeted node ci-op-m36h5d5b-8aaeb-szd6x-master-2 to config rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 [8]: https://github.com/openshift/origin/pull/25922 [9]: https://github.com/openshift/origin/pull/25920 --- Additional comment from jerzhang on 2021-02-24 01:35:58 UTC --- Hey Trevor, a few questions: 1. is this a new issue in 4.8 only and not in previous versions? Do you know if this happens every time (in terms of unavailable + degraded during the upgrade timeframe) 2. for "a. machine-config should not go Available=False (step 3) or Degraded=True (step 2) during this healthy update. Clayton has [9] up to guard against this." How exactly does this failure manifest? Does it stop the upgrade? 3. b. machine-config should delay bumping the 'operator' entry in status.versions (step 9) until the 'master' pool completes (after step 13). Clayton has [8] up to help guard against this. Is the MCO prematurely reporting upgrade completion before master has finished? Is this also new behaviour? --- Additional comment from wking on 2021-02-24 02:37:07 UTC --- > ...is this a new issue in 4.8 only and not in previous versions? No information yet on frequency. Clayton's origin#25920 and 25922 should give us that information. Might be able to query CI search in the meantime, but I haven't looked for a fingerprint to use. > How exactly does this failure manifest? Does it stop the upgrade? It temporarily blocks the update. If it lasts for 10+ minutes (was 6m or less this update), it would trigger a page-at-midnight ClsuterOperatorDown alert. And anyone who polls ClusterOperator during the update and sees Available=False from machine-config might justifiably believe that their cluster was exploding, and then be more likely to ignore later Available=False that represented real "my component is dead" concerns. > Is the MCO prematurely reporting upgrade completion before master has finished? Yes. > Is this also new behaviour? Unclear. --- Additional comment from wking on 2021-02-24 03:29:19 UTC --- Rough handle on how often this happens: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=build-log&search=Pool+master+is+still+reporting.*Updating:+true' | grep 'failures match' | sort pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-aws-upgrade - 7 runs, 71% failed, 20% of failures match pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade - 10 runs, 30% failed, 33% of failures match pull-ci-openshift-cluster-node-tuning-operator-master-e2e-upgrade - 5 runs, 60% failed, 100% of failures match pull-ci-openshift-cluster-openshift-controller-manager-operator-release-4.7-e2e-upgrade - 2 runs, 50% failed, 100% of failures match pull-ci-openshift-installer-master-e2e-aws-upgrade - 28 runs, 75% failed, 29% of failures match pull-ci-openshift-kubernetes-release-4.7-e2e-azure-upgrade - 2 runs, 50% failed, 100% of failures match pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade - 9 runs, 78% failed, 29% of failures match pull-ci-openshift-oc-master-e2e-aws-upgrade - 14 runs, 50% failed, 43% of failures match pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade - 3 runs, 67% failed, 50% of failures match pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade - 4 runs, 25% failed, 100% of failures match pull-ci-openshift-service-ca-operator-master-e2e-aws-upgrade - 2 runs, 50% failed, 100% of failures match release-openshift-okd-installer-e2e-aws-upgrade - 11 runs, 36% failed, 50% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.7 - 4 runs, 75% failed, 100% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.8 - 4 runs, 25% failed, 100% of failures match Searching with [1] turns up [2] as one of the older entries still in CI-search's 14d history, and [2] has: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25884/pull-ci-openshift-origin-release-4.7-e2e-gcp-upgrade/1359664951764979712/build-log.txt | grep -1 'Pool master is still reporting.*Updating: true' Feb 11 02:16:04.771: INFO: Waiting on pools to be upgraded Feb 11 02:16:04.816: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) ... Feb 11 02:33:54.879: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) Feb 11 02:34:04.894: INFO: All pools completed upgrade So that suggests at least 4.7 is impacted, this failure mode is not new, and this failure mode is easy to reproduce. [1]: https://search.ci.openshift.org/?search=Pool+master+is+still+reporting.*Updating%3A+true&maxAge=336h&context=1&type=build-log&name=upgrade&maxMatches=5&maxBytes=20971520&groupBy=job [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25884/pull-ci-openshift-origin-release-4.7-e2e-gcp-upgrade/1359664951764979712 --- Additional comment from wking on 2021-02-24 04:05:34 UTC --- 4.6.18 -> 4.7.0 CI job [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1363777305826037760/artifacts/launch/container-logs/test.log | grep -A1 'Waiting on pools' Feb 22 10:51:05.232: INFO: Waiting on pools to be upgraded Feb 22 10:51:05.460: INFO: All pools completed upgrade 4.7.0-0.nightly-2021-02-22-135912 -> 4.7.0-0.nightly-2021-02-22-210958 [2]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.7/1363960100204580864/build-log.txt | grep -A1 'Waiting on pools to be upgraded' Feb 22 22:06:11.253: INFO: Waiting on pools to be upgraded Feb 22 22:06:11.332: INFO: All pools completed upgrade 4.8.0-0.nightly-2021-02-23-160229 -> 4.8.0-0.nightly-2021-02-23-200827 [3]: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1364308785111240704/build-log.txt | grep -A1 'Waiting on pools' Feb 23 21:57:36.319: INFO: Waiting on pools to be upgraded Feb 23 21:57:36.372: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false) $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1364308785111240704/build-log.txt | grep -c 'Pool master is' 0 So those three release-informers all look good. Not clear to me why the 'failures match' percentages from the previous comment are so high, but the number of matching job names is so limited. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws/1363777305826037760 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.7/1363960100204580864 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-upgrade-4.8/1364308785111240704 --- Additional comment from wking on 2021-02-24 05:55:26 UTC --- Looking at all e2e failures over the past 4d: $ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&type=build-log&context=2&search=Waiting+on+pools+to+be+upgraded' | jq -r 'length as $all | ([to_entries[] | tostring | select(contains("Pool master is still reporting"))] | length) as $late_master | ($late_master | tostring) + " / " + ($all | tostring)' 72 / 519 So 72 jobs where we had late masters out of 519 update jobs which failed e2e. Breaking down by job name: $ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&type=build-log&context=2&search=Waiting+on+pools+to+be+upgraded' | jq -r '. as $data | ([to_entries[].value | to_entries[0].value[0].name]) | unique[] | . as $name | ([$data | to_entries[].value | to_entries[0].value[0] | select(.name == $name)]) as $jobs | ($jobs | length) as $all | ([$jobs[] | tostring | select(contains("Pool master is still reporting"))] | length) as $late_master | ($late_master / $all | tostring) + " " + ($late_master | tostring) + " / " + ($all | tostring) + " " + $name' | sort -n 0 0 / 1 endurance-upgrade-aws-4.6 ... 0 0 / 9 periodic-ci-openshift-release-master-origin-4.6-e2e-azure-upgrade 0.08695652173913043 2 / 23 release-openshift-origin-installer-e2e-aws-upgrade 0.125 1 / 8 pull-ci-openshift-origin-master-e2e-gcp-upgrade 0.125 2 / 16 periodic-ci-openshift-release-master-ocp-4.8-e2e-metal-ipi-upgrade 0.16666666666666666 1 / 6 release-openshift-origin-installer-e2e-azure-upgrade-4.9 0.18181818181818182 2 / 11 pull-ci-openshift-service-ca-operator-master-e2e-aws-upgrade 0.2 2 / 10 release-openshift-ocp-installer-e2e-aws-upgrade-4.8 0.25 1 / 4 pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-aws-upgrade 0.2608695652173913 6 / 23 pull-ci-openshift-cluster-version-operator-master-e2e-agnostic-upgrade 0.2682926829268293 11 / 41 pull-ci-openshift-installer-master-e2e-aws-upgrade 0.3333333333333333 1 / 3 pull-ci-openshift-cluster-ingress-operator-master-e2e-upgrade 0.3333333333333333 2 / 6 pull-ci-openshift-machine-api-operator-master-e2e-aws-upgrade 0.3333333333333333 3 / 9 pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-upgrade 0.36363636363636365 4 / 11 pull-ci-openshift-oc-master-e2e-aws-upgrade 0.5 1 / 2 pull-ci-openshift-openshift-apiserver-master-e2e-aws-upgrade 0.5384615384615384 7 / 13 release-openshift-okd-installer-e2e-aws-upgrade 0.6666666666666666 2 / 3 pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade 0.75 3 / 4 pull-ci-openshift-cluster-node-tuning-operator-master-e2e-upgrade 1 1 / 1 pull-ci-openshift-cluster-api-provider-aws-master-e2e-aws-upgrade 1 1 / 1 pull-ci-openshift-cluster-openshift-controller-manager-operator-release-4.7-e2e-upgrade 1 1 / 1 pull-ci-openshift-images-master-e2e-aws-upgrade 1 1 / 1 pull-ci-openshift-kubernetes-release-4.7-e2e-azure-upgrade 1 1 / 1 pull-ci-openshift-openshift-apiserver-release-4.7-e2e-aws-upgrade 1 1 / 1 pull-ci-openshift-ovn-kubernetes-master-e2e-gcp-ovn-upgrade 1 1 / 1 release-openshift-origin-installer-e2e-azure-upgrade-4.8 1 5 / 5 release-openshift-origin-installer-e2e-gcp-upgrade-4.7 1 9 / 9 release-openshift-origin-installer-e2e-azure-upgrade-4.7 So 4.7 OKD Azure and GCP updates seem like reliable reproducers. Getting a bunch of job links: $ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&name=^release-openshift-origin-installer-e2e-azure-upgrade-4 .7$&type=build-log&context=2&search=Waiting+on+pools+to+be+upgraded' | jq -r 'to_entries[] | .key + "\n " + ([.value | to_entries[].value[].context[]] | join("\n "))' | grep '^http\|Pool master' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1362996944573566976 Feb 20 06:56:02.052: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363087725145624578 Feb 20 12:59:13.866: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363269331844075520 Feb 21 01:06:58.210: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363632187471040512 Feb 22 01:03:30.494: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363813696526094336 Feb 22 13:26:27.738: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363904442780356608 Feb 22 19:01:16.860: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364085961394229248 Feb 23 06:58:17.556: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364267562774302720 Feb 23 19:07:57.801: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364358329765203968 Feb 24 01:10:24.861: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) And releases used in those updates: $ curl -s 'https://search.ci.openshift.org/search?maxAge=96h&name=^release-openshift-origin-installer-e2e-azure-upgrade-4.7$&type=build-log&context=1&search=Resolved+release' | jq -r 'to_entries[] | .key + "\n " + ([.value | to_entries[].value[].context[]] | join("\n "))' | grep '^http\|Resolved release' https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1362996944573566976 2021/02/20 05:26:34 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351 2021/02/20 05:26:34 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363087725145624578 2021/02/20 11:27:03 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351 2021/02/20 11:27:03 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363269331844075520 2021/02/20 23:28:40 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351 2021/02/20 23:28:40 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363632187471040512 2021/02/21 23:30:44 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-184351 2021/02/21 23:30:44 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363722874543673344 2021/02/22 05:30:59 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-18-205909 2021/02/22 05:30:59 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-030505 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363813696526094336 2021/02/22 11:31:44 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-030505 2021/02/22 11:31:44 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-033505 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1363904442780356608 2021/02/22 17:32:23 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-033505 2021/02/22 17:32:23 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-100111 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364085961394229248 2021/02/23 05:33:42 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-164056 2021/02/23 05:33:42 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-185213 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364267562774302720 2021/02/23 17:35:20 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-164056 2021/02/23 17:35:20 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-185213 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.7/1364358329765203968 2021/02/23 23:35:57 Resolved release initial to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-164056 2021/02/23 23:35:57 Resolved release latest to registry.ci.openshift.org/ocp/release:4.7.0-0.ci-2021-02-22-185213 But it's currently: $ date --utc --iso=m 2021-02-24T05:52+00:00 So maybe something has been fixed in 4.7 in this space in the past two days? Nope, looks like 4.7 is just really quiet [1]. [2] shows the only change between the impacted 4.7.0-0.ci-2021-02-22-185213 and the current tip to be a rebuild of the ovirt-csi-driver container. [1]: https://amd64.ocp.releases.ci.openshift.org/#4.7.0-0.ci [2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.7.0-0.ci/release/4.7.0-0.ci-2021-02-22-185213?from=4.7.0-0.ci-2021-02-22-171056 --- Additional comment from jerzhang on 2021-02-24 18:22:50 UTC --- I observed a 4.7.0 to 4.8-nightly upgrade on GCP to see the state of the upgrade, dumping some of my findings here: The upgrade progresses to 83% (MCO) and the MCO gets to the following state on co status: Conditions: Last Transition Time: 2021-02-24T17:53:59Z Message: Working towards 4.8.0-0.nightly-2021-02-24-063313 Status: True <- correct Type: Progressing Last Transition Time: 2021-02-24T17:53:59Z Status: False <- correct Type: Degraded Last Transition Time: 2021-02-24T17:52:01Z Message: Cluster not available for 4.7.0 Status: False <- incorrect? Type: Available Last Transition Time: 2021-02-24T16:04:40Z Reason: AsExpected Status: True <- also incorrect? Type: Upgradeable The availability of the operator should be true, and in the code, cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorDegraded) Is the only evaluation criteria we seem to have for setting available=false. That is obviously false, so how come available is evaluating false? This ties a bit back into the fact that we're using available==degraded which is also wrong, but that's another discussion. The operator does actually not go progressing=True or update its version until master pool is done, which is correct, so this is not consistently reproducible. Will try to look into the above job logs to see what's happening there. A few more observations: DNS auth openshift-apiserver goes degraded while MCO is updating, this in turn causes clusterversion status to report Unable to apply 4.8.0-0.nightly-2021-02-24-063313: wait has exceeded 40 minutes for these operators: openshift-apiserver and then eventually wraps around to: Working towards 4.8.0-0.nightly-2021-02-24-063313: 172 of 669 done (25% complete) I assume this is the CVO attempting to reconcile and starting from the beginning? After the master MCP completes, the MCO reports complete, all other operators are back to not-degraded, and eventually the clusterversion reports: Cluster version is 4.8.0-0.nightly-2021-02-24-063313 And the update completes successfully. It feels like this is a bit of a mess. My first investigation will be why the availability of the operator is not being evaluated correctly. If you have some idea of whether the above is "normal" @Trevor it'd be appreciated. --- Additional comment from jerzhang on 2021-02-24 19:32:55 UTC --- Looking at the status sync, the majority of the code hasn't changed in awhile: https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/status.go The version sync happens here: https://github.com/openshift/machine-config-operator/blob/789951039423aa1745af3b23f06ce00ba5ebe8f9/pkg/operator/sync.go#L95 which would happen only if https://github.com/openshift/machine-config-operator/blob/789951039423aa1745af3b23f06ce00ba5ebe8f9/pkg/operator/sync.go#L66 as finished, which can only happen if https://github.com/openshift/machine-config-operator/blob/789951039423aa1745af3b23f06ce00ba5ebe8f9/pkg/operator/sync.go#L568 has completed. So theoretically while the pool is still progressing, the version should never get to update. Now unfortunately in some of the failing runs I looked at, the MCO pod logs start after the version has been updated, which means a previous MCO pod set the version wrongly and the logs are now lost. I will try to replicate this to see what exactly is happening, but without the relevant logs I don't quite see how we got to a situation where the version is updated wrongly (and why this wouldn't have been the case 2 releases ago, since the code hasn't really changed) --- Additional comment from wking on 2021-02-24 23:01:35 UTC --- (In reply to Yu Qi Zhang from comment #6) > Last Transition Time: 2021-02-24T17:52:01Z > Message: Cluster not available for 4.7.0 > Status: False <- incorrect? > Type: Available Yeah. You should only go Available=False when an operand (the pool here?) is non-functional or non-available [1]. > Last Transition Time: 2021-02-24T16:04:40Z > Reason: AsExpected > Status: True <- also incorrect? > Type: Upgradeable Yeah, ideally we'd go Upgradeable=False during 4.7->4.8 to block folks from redirecting their update to 4.9 until the pools had all caught up to 4.8, unless we get [2] or similar to get safe n-2 version skew. > The availability of the operator should be true, and in the code, > > cov1helpers.IsStatusConditionTrue(co.Status.Conditions, > configv1.OperatorDegraded) > > Is the only evaluation criteria we seem to have for setting available=false. > That is obviously false, so how come available is evaluating false? I'm not sure about this. Will poke around some. > The operator does actually not go progressing=True or update its version > until master pool is done, which is correct, so this is not consistently > reproducible. Will try to look into the above job logs to see what's > happening there. I think you mean "back to progressing=False". And yeah, I haven't been able to reproduce the operator-leveling-before-master-pool behavior in the few jobs I've launched myself. > Working towards 4.8.0-0.nightly-2021-02-24-063313: 172 of 669 done (25% > complete) > > I assume this is the CVO attempting to reconcile and starting from the > beginning? CVO doesn't persist the % completion, so if the CVO pod is drained, the replacement CVO pod will come back up at 0% and should quickly burn through the bulk of the already-reconciled update before sticking somewhere near where it had been before. (In reply to Yu Qi Zhang from comment #7) > Now unfortunately in some of the failing runs I looked at, the MCO pod logs start after the version has been updated, which means a previous MCO pod set the version wrongly and the logs are now lost. Yeah, this is a pain. We're rolling out some hosted-Loki stuff, e.g. [3] which should help mitigate. [1]: https://github.com/openshift/api/blob/70b287c0a0a51be5cdc88d40858de59e31222f5f/config/v1/types_cluster_operator.go#L143-L145 [2]: https://github.com/openshift/enhancements/blob/41692eb45ffaddea8bcb8c002462c527fd59c707/enhancements/update/eus-upgrades-mvp.md [3]: https://github.com/openshift/release/pull/16186 --- Additional comment from jerzhang on 2021-02-24 23:28:55 UTC --- Still digging into some details but https://github.com/openshift/machine-config-operator/pull/2433 should be a potential remediation. Let me know if you are able to reproduce outside of specific CI runs --- Additional comment from wking on 2021-02-25 04:24:26 UTC --- Sanity check showing that the initial CI job discussed in comment 0 does have the required-for-upgrade label set [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/988/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade/1364259570968432640/artifacts/e2e-gcp-ovn-upgrade/gather-extra/artifacts/machineconfigpools.json | jq -r '.items[].metadata | .name + "\n " + (.labels | keys | join("\n "))' master machineconfiguration.openshift.io/mco-built-in operator.machineconfiguration.openshift.io/required-for-upgrade pools.operator.machineconfiguration.openshift.io/master worker machineconfiguration.openshift.io/mco-built-in pools.operator.machineconfiguration.openshift.io/worker [1]: https://github.com/openshift/machine-config-operator/blame/f53cb9aa4a9c81ec4d0dfe6cb734957d4523676a/pkg/operator/sync.go#L38 --- Additional comment from wking on 2021-02-26 00:51:24 UTC --- Jerry and I dug into this some more today. Going back to comment 0, this window is just too short: 6. 18:40:39Z, rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095 created... 7. 18:40:44Z, ci-op-m36h5d5b-8aaeb-szd6x-master-1 asked to update to rendered-master-4c88d1c343b9e2e390bbc5fdca7b6095... 8. 18:41:12Z, machine-config sets Available=True... 9. 18:41:12Z, machine-config bumps its 'operator' version... So we suspect syncRequiredMachineConfigPools, which was bumped in 4.7 with [1], learning to watch all pools instead of just watching required pools. Before that PR, the file had [2]: if pool.Generation <= pool.Status.ObservedGeneration && isPoolStatusConditionTrue(pool, mcfgv1.MachineConfigPoolUpdated) && !degraded { continue } lastErr = fmt.Errorf("error pool %s is not ready, retrying. Status: (pool degraded: %v total: %d, ready %d, updated: %d, unavailable: %d)", pool.Name, degraded, pool.Status.MachineCount, pool.Status.ReadyMachineCount, pool.Status.UpdatedMachineCount, pool.Status.UnavailableMachineCount) return false, nil That's accepting the pool as valid only when: a. The pool controller is caught up (metadata.generation <= status.observedGeneration). I.e. the pool controller has noticed the bumped rendered MachineConfig, because that bump brought a metadata.generation bump, and the controller has passed the metadata.generation bump on into status.observedGeneration. b. The pool is updated. And we know the pool controller's claim about up-to-date-ness isn't too stale, because of (a). c. The pool isn't degraded. Any of those conditions fail, and we set "pool ... not ready" and take another pass through the Poll loop. After [1]: a. Any degraded pool will block the MCO from reconciling [3], which is a bit harsh. I'd rather have any degraded pool block future minor updates via Upgradeable=False, but non-required pools should not be able to directly block the current update by going degraded. b. We grow an isMachineConfigPoolConfigurationValid call [4]. Not quite sure what that's about. c. If we pass the generation and updated checks, we continue [5], like before. d. But (this is the main bug), after that continue, there is no longer the "pool ... not ready" error (that error is now only triggered by a degraded pool; it used to be triggered by all not-level-yet required pools). [1]: https://github.com/openshift/machine-config-operator/pull/2231 [2]: https://github.com/openshift/machine-config-operator/blame/6531aac963b1ef9c2508838f7593ed6e91cdad62/pkg/operator/sync.go#L607-L613 [3]: https://github.com/openshift/machine-config-operator/pull/2231/files#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R595 [4]: https://github.com/openshift/machine-config-operator/pull/2231/files#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R601 [5]: https://github.com/openshift/machine-config-operator/pull/2231/files#diff-3e205577bec1a1d1711df8bfeff30e63f044b24cfbdc69bbd7d0052fdc881891R608
Verified on 4.7.0-0.nightly-2021-03-14-223051. Upgraded from 4.7.1 to 4.7.0-0.nightly-2021-03-14-223051 and watched the progress. The machine-config cluster operator is not available until the master machine config pool is updated and becomes available. NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.1 True True 54m Working towards 4.7.0-0.nightly-2021-03-14-223051: 176 of 668 done (26% complete), waiting up to 40 minutes on openshift-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.7.1 False True False 20m NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-d7b53fee67394972404f20d7abc19546 False True False 3 2 2 0 127m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.1 True True 55m Working towards 4.7.0-0.nightly-2021-03-14-223051: 176 of 668 done (26% complete), waiting up to 40 minutes on openshift-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.7.1 False True False 20m NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-d7b53fee67394972404f20d7abc19546 False True False 3 2 2 0 128m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.1 True True 55m Working towards 4.7.0-0.nightly-2021-03-14-223051: 176 of 668 done (26% complete), waiting up to 40 minutes on openshift-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.7.1 False True False 20m NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-d7b53fee67394972404f20d7abc19546 False True False 3 2 2 0 128m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.1 True True 56m Working towards 4.7.0-0.nightly-2021-03-14-223051: 176 of 668 done (26% complete), waiting up to 40 minutes on openshift-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.7.1 False True False 21m NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-d7b53fee67394972404f20d7abc19546 False True False 3 2 2 0 129m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.1 True True 56m Working towards 4.7.0-0.nightly-2021-03-14-223051: 176 of 668 done (26% complete), waiting up to 40 minutes on openshift-apiserver NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.7.0-0.nightly-2021-03-14-223051 True False False 17s NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-683abfce7dccbf54f19b49189cbc1ad7 True False False 3 3 3 0 129m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.1 True True 56m Working towards 4.7.0-0.nightly-2021-03-14-223051: 310 of 668 done (46% complete) NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.7.0-0.nightly-2021-03-14-223051 True False False 38s NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-683abfce7dccbf54f19b49189cbc1ad7 True False False 3 3 3 0 129m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.1 True True 57m Working towards 4.7.0-0.nightly-2021-03-14-223051: 563 of 668 done (84% complete) NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.7.0-0.nightly-2021-03-14-223051 True False False 60s NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-683abfce7dccbf54f19b49189cbc1ad7 True False False 3 3 3 0 130m NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2021-03-14-223051 True False 14s Cluster version is 4.7.0-0.nightly-2021-03-14-223051 NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE machine-config 4.7.0-0.nightly-2021-03-14-223051 True False False 82s NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-683abfce7dccbf54f19b49189cbc1ad7 True False False 3 3 3 0 130m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.3 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0821