Bug 2019850
Summary: | 4.10 failing with the "master" pool should be updated before the CVO reports available at the new version | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Stephen Benjamin <stbenjam> |
Component: | Machine Config Operator | Assignee: | MCO Team <team-mco> |
Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
Status: | CLOSED DUPLICATE | Docs Contact: | |
Severity: | high | ||
Priority: | unspecified | CC: | aos-bugs, dgoodwin, jack.ottofaro, jokerman, mkrejci, skumari, wking |
Version: | 4.10 | ||
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-12-07 13:19:29 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Stephen Benjamin
2021-11-03 13:29:39 UTC
Slack discussion https://coreos.slack.com/archives/C01CQA76KMX/p1635946313263700 Analysis in slack thread is leaning towards a bug where the MCO has reported level before it should have. Moving over to MCO and reaching out to them on slack. Poking around in comment 0's [1], the failing test-case: : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial] 43m33s fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:160]: during upgrade to registry.build01.ci.openshift.org/ci-op-wi0wdxci/release@sha256:51f2f24726c8153cae4bf9978a5d04088e88479ab4f5618c1111e0e69176ad2f Unexpected error: <*errors.errorString | 0xc0027f5d80>: { s: "the \"master\" pool should be updated before the CVO reports available at the new version", } the "master" pool should be updated before the CVO reports available at the new version occurred has this stdout: Nov 3 03:28:35.858: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-wi0wdxci/release@sha256:51f2f24726c8153cae4bf9978a5d04088e88479ab4f5618c1111e0e69176ad2f Nov 3 03:28:35.931: INFO: Waiting on pools to be upgraded Nov 3 03:28:36.063: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false) Nov 3 03:28:36.063: INFO: Invariant violation detected: the "master" pool should be updated before the CVO reports available at the new version Nov 3 03:28:36.128: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false) ... Nov 3 03:36:36.324: INFO: All pools completed upgrade Because the machine-config operator claimed to have completed its update at 3:23: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528/build-log.txt | grep 'clusteroperator/machine-config ' Nov 03 03:23:03.719 W clusteroperator/machine-config condition/Progressing status/True changed: Working towards 4.10.0-0.nightly-2021-11-03-020416 Nov 03 03:23:03.719 - 6s W clusteroperator/machine-config condition/Progressing status/True reason/Working towards 4.10.0-0.nightly-2021-11-03-020416 Nov 03 03:23:09.856 I clusteroperator/machine-config versions: operator 4.10.0-0.nightly-2021-11-02-191632 -> 4.10.0-0.nightly-2021-11-03-020416 Nov 03 03:23:10.428 W clusteroperator/machine-config condition/Progressing status/False changed: Cluster version is 4.10.0-0.nightly-2021-11-03-020416 Nov 03 03:23:16.899 W clusteroperator/machine-config condition/Upgradeable status/False reason/PoolUpdating changed: One or more machine config pools are updating, please see `oc get mcp` for further details Nov 03 03:36:29.855 W clusteroperator/machine-config condition/Upgradeable status/True changed: Neither the machine-config operator nor the machine-os-content changed between those two releases [2]. But the MCO bumped the rendered configs: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528/artifacts/e2e-aws-upgrade/gather-extra/artifacts/machineconfigs.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' | sort | tail -n4 2021-11-03T02:30:53Z rendered-master-78d48a9fbc8ba6a746040c11f8335d1a 2021-11-03T02:30:53Z rendered-worker-f4fb9dc4bce02415a779f436ccbd258f 2021-11-03T03:23:11Z rendered-master-ebc4fc8c7b6a155d86684204394e4930 2021-11-03T03:23:11Z rendered-worker-80216c83e551170f7d4be028d2c2414b Because the pause image changed: $ cat ~/bin/diff-machine-config.sh #!/bin/bash URI="$1" NAME_A="$2" NAME_B="$3" JQ="${4:-.}" MACHINE_CONFIGS="$(curl -s "${URI}")" function machine_config() { NAME="$1" echo "${MACHINE_CONFIGS}" | jq -r --arg name "${NAME}" '.items[] | select(.metadata.name == $name)' | jq -r "${JQ}" } CONFIG_A="$(machine_config ${NAME_A})" CONFIG_B="$(machine_config ${NAME_B})" echo "${NAME_A}: length: $(echo "${CONFIG_A}" | wc -l), sha1: $(echo "${CONFIG_A}" | sha1sum)" echo "${NAME_B}: length: $(echo "${CONFIG_B}" | wc -l), sha1: $(echo "${CONFIG_B}" | sha1sum)" diff -u0 <(machine_config "${NAME_A}") <(machine_config "${NAME_B}") $ diff-machine-config.sh https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528/artifacts/e2e-aws-upgrade/gather-extra/artifacts/machineconfigs.json rendered-master-78d48a9fbc8ba6a746040c11f8335d1a rendered-master-ebc4fc8c7b6a155d86684204394e4930 '.spec.config.systemd.units[] | select(.name == "kubelet.service").contents' rendered-master-78d48a9fbc8ba6a746040c11f8335d1a: length: 42, sha1: 2df9decb6c194fcd15239b3f5c04e62eefee9a96 - rendered-master-ebc4fc8c7b6a155d86684204394e4930: length: 42, sha1: 693fb66283aee7d2cd87850a1348c6a859f9227d - --- /dev/fd/63 2021-11-03 09:21:12.248227981 -0700 +++ /dev/fd/62 2021-11-03 09:21:12.249227981 -0700 @@ -34 +34 @@ - --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ef6684475a8fc4eb898dc5170dbf883dad2c2883b924cb3d86a4f3b2cb746639 \ + --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0adaefcf3675fa153e7d3917c7a24563e9da81922cac6c3eed3edd4b57bf98bb \ Which is the "pod" entry in the release image references: $ oc adm release info --image-for=pod registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-11-02-191632 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ef6684475a8fc4eb898dc5170dbf883dad2c2883b924cb3d86a4f3b2cb746639 $ oc adm release info --image-for=pod registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-11-03-020416 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0adaefcf3675fa153e7d3917c7a24563e9da81922cac6c3eed3edd4b57bf98bb See also a similar earlier issue in bug 1970150. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528 [2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2021-11-03-020416?from=4.10.0-0.nightly-2021-11-02-191632 Latest payload made it past this problem, so presumably it was because that one contained the changes we normally expect to see in the os content. Still needs a fix for whatever edge case brought this on, dropping to High sev though as payload promotion is no longer blocked. Closing this as we already have another bug https://bugzilla.redhat.com/show_bug.cgi?id=1999556 to track it. *** This bug has been marked as a duplicate of bug 1999556 *** |