Starting with 4.10.0-0.nightly-2021-11-03-020416 (https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2021-11-03-020416), release payloads are being rejected. 4.10 micro upgrades are failing with the "master" pool should be updated before the CVO reports available at the new version Example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528 Note that this run also has a node segfault, the other failed runs do not exhibit this, so unclear if we should ignore this on this prow job. Setting to urgent because nightly payloads are being blocked.
Slack discussion https://coreos.slack.com/archives/C01CQA76KMX/p1635946313263700
Analysis in slack thread is leaning towards a bug where the MCO has reported level before it should have. Moving over to MCO and reaching out to them on slack.
Poking around in comment 0's [1], the failing test-case: : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] 43m33s fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:160]: during upgrade to registry.build01.ci.openshift.org/ci-op-wi0wdxci/release@sha256:51f2f24726c8153cae4bf9978a5d04088e88479ab4f5618c1111e0e69176ad2f Unexpected error: <*errors.errorString | 0xc0027f5d80>: { s: "the \"master\" pool should be updated before the CVO reports available at the new version", } the "master" pool should be updated before the CVO reports available at the new version occurred has this stdout: Nov 3 03:28:35.858: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-wi0wdxci/release@sha256:51f2f24726c8153cae4bf9978a5d04088e88479ab4f5618c1111e0e69176ad2f Nov 3 03:28:35.931: INFO: Waiting on pools to be upgraded Nov 3 03:28:36.063: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) Nov 3 03:28:36.063: INFO: Invariant violation detected: the "master" pool should be updated before the CVO reports available at the new version Nov 3 03:28:36.128: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false) ... Nov 3 03:36:36.324: INFO: All pools completed upgrade Because the machine-config operator claimed to have completed its update at 3:23: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528/build-log.txt | grep 'clusteroperator/machine-config ' Nov 03 03:23:03.719 W clusteroperator/machine-config condition/Progressing status/True changed: Working towards 4.10.0-0.nightly-2021-11-03-020416 Nov 03 03:23:03.719 - 6s W clusteroperator/machine-config condition/Progressing status/True reason/Working towards 4.10.0-0.nightly-2021-11-03-020416 Nov 03 03:23:09.856 I clusteroperator/machine-config versions: operator 4.10.0-0.nightly-2021-11-02-191632 -> 4.10.0-0.nightly-2021-11-03-020416 Nov 03 03:23:10.428 W clusteroperator/machine-config condition/Progressing status/False changed: Cluster version is 4.10.0-0.nightly-2021-11-03-020416 Nov 03 03:23:16.899 W clusteroperator/machine-config condition/Upgradeable status/False reason/PoolUpdating changed: One or more machine config pools are updating, please see `oc get mcp` for further details Nov 03 03:36:29.855 W clusteroperator/machine-config condition/Upgradeable status/True changed: Neither the machine-config operator nor the machine-os-content changed between those two releases [2]. But the MCO bumped the rendered configs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528/artifacts/e2e-aws-upgrade/gather-extra/artifacts/machineconfigs.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' | sort | tail -n4 2021-11-03T02:30:53Z rendered-master-78d48a9fbc8ba6a746040c11f8335d1a 2021-11-03T02:30:53Z rendered-worker-f4fb9dc4bce02415a779f436ccbd258f 2021-11-03T03:23:11Z rendered-master-ebc4fc8c7b6a155d86684204394e4930 2021-11-03T03:23:11Z rendered-worker-80216c83e551170f7d4be028d2c2414b Because the pause image changed: $ cat ~/bin/diff-machine-config.sh #!/bin/bash URI="$1" NAME_A="$2" NAME_B="$3" JQ="${4:-.}" MACHINE_CONFIGS="$(curl -s "${URI}")" function machine_config() { NAME="$1" echo "${MACHINE_CONFIGS}" | jq -r --arg name "${NAME}" '.items[] | select(.metadata.name == $name)' | jq -r "${JQ}" } CONFIG_A="$(machine_config ${NAME_A})" CONFIG_B="$(machine_config ${NAME_B})" echo "${NAME_A}: length: $(echo "${CONFIG_A}" | wc -l), sha1: $(echo "${CONFIG_A}" | sha1sum)" echo "${NAME_B}: length: $(echo "${CONFIG_B}" | wc -l), sha1: $(echo "${CONFIG_B}" | sha1sum)" diff -u0 <(machine_config "${NAME_A}") <(machine_config "${NAME_B}") $ diff-machine-config.sh https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528/artifacts/e2e-aws-upgrade/gather-extra/artifacts/machineconfigs.json rendered-master-78d48a9fbc8ba6a746040c11f8335d1a rendered-master-ebc4fc8c7b6a155d86684204394e4930 '.spec.config.systemd.units[] | select(.name == "kubelet.service").contents' rendered-master-78d48a9fbc8ba6a746040c11f8335d1a: length: 42, sha1: 2df9decb6c194fcd15239b3f5c04e62eefee9a96 - rendered-master-ebc4fc8c7b6a155d86684204394e4930: length: 42, sha1: 693fb66283aee7d2cd87850a1348c6a859f9227d - --- /dev/fd/63 2021-11-03 09:21:12.248227981 -0700 +++ /dev/fd/62 2021-11-03 09:21:12.249227981 -0700 @@ -34 +34 @@ - --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ef6684475a8fc4eb898dc5170dbf883dad2c2883b924cb3d86a4f3b2cb746639 \ + --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0adaefcf3675fa153e7d3917c7a24563e9da81922cac6c3eed3edd4b57bf98bb \ Which is the "pod" entry in the release image references: $ oc adm release info --image-for=pod registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-11-02-191632 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ef6684475a8fc4eb898dc5170dbf883dad2c2883b924cb3d86a4f3b2cb746639 $ oc adm release info --image-for=pod registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-11-03-020416 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0adaefcf3675fa153e7d3917c7a24563e9da81922cac6c3eed3edd4b57bf98bb See also a similar earlier issue in bug 1970150. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528 [2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2021-11-03-020416?from=4.10.0-0.nightly-2021-11-02-191632
Latest payload made it past this problem, so presumably it was because that one contained the changes we normally expect to see in the os content. Still needs a fix for whatever edge case brought this on, dropping to High sev though as payload promotion is no longer blocked.
Closing this as we already have another bug https://bugzilla.redhat.com/show_bug.cgi?id=1999556 to track it. *** This bug has been marked as a duplicate of bug 1999556 ***