Bug 2019850

Summary:	4.10 failing with the "master" pool should be updated before the CVO reports available at the new version
Product:	OpenShift Container Platform	Reporter:	Stephen Benjamin <stbenjam>
Component:	Machine Config Operator	Assignee:	MCO Team <team-mco>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	aos-bugs, dgoodwin, jack.ottofaro, jokerman, mkrejci, skumari, wking
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-12-07 13:19:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Stephen Benjamin 2021-11-03 13:29:39 UTC

Starting with 4.10.0-0.nightly-2021-11-03-020416 (https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2021-11-03-020416), release payloads are being rejected.  4.10 micro upgrades are failing with


     the "master" pool should be updated before the CVO reports available at the new version

Example job:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528

Note that this run also has a node segfault, the other failed runs do not exhibit this, so unclear if we should ignore this on this prow job.

Setting to urgent because nightly payloads are being blocked.

Comment 1 Jack Ottofaro 2021-11-03 14:42:45 UTC

Slack discussion https://coreos.slack.com/archives/C01CQA76KMX/p1635946313263700

Comment 2 Devan Goodwin 2021-11-03 15:27:09 UTC

Analysis in slack thread is leaning towards a bug where the MCO has reported level before it should have. Moving over to MCO and reaching out to them on slack.

Comment 3 W. Trevor King 2021-11-03 16:24:30 UTC

Poking around in comment 0's [1], the failing test-case:

  : [sig-arch][Feature:ClusterUpgrade] Cluster should remain functional during upgrade [Disruptive] [Serial]	43m33s
fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:160]: during upgrade to registry.build01.ci.openshift.org/ci-op-wi0wdxci/release@sha256:51f2f24726c8153cae4bf9978a5d04088e88479ab4f5618c1111e0e69176ad2f
  Unexpected error:
    <*errors.errorString | 0xc0027f5d80>: {
        s: "the \"master\" pool should be updated before the CVO reports available at the new version",
    }
    the "master" pool should be updated before the CVO reports available at the new version
  occurred

has this stdout:

  Nov  3 03:28:35.858: INFO: Completed upgrade to registry.build01.ci.openshift.org/ci-op-wi0wdxci/release@sha256:51f2f24726c8153cae4bf9978a5d04088e88479ab4f5618c1111e0e69176ad2f
  Nov  3 03:28:35.931: INFO: Waiting on pools to be upgraded
  Nov  3 03:28:36.063: INFO: Pool master is still reporting (Updated: false, Updating: true, Degraded: false)
  Nov  3 03:28:36.063: INFO: Invariant violation detected: the "master" pool should be updated before the CVO reports available at the new version
  Nov  3 03:28:36.128: INFO: Pool worker is still reporting (Updated: false, Updating: true, Degraded: false)
  ...
  Nov  3 03:36:36.324: INFO: All pools completed upgrade

Because the machine-config operator claimed to have completed its update at 3:23:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528/build-log.txt | grep 'clusteroperator/machine-config '
  Nov 03 03:23:03.719 W clusteroperator/machine-config condition/Progressing status/True changed: Working towards 4.10.0-0.nightly-2021-11-03-020416
  Nov 03 03:23:03.719 - 6s    W clusteroperator/machine-config condition/Progressing status/True reason/Working towards 4.10.0-0.nightly-2021-11-03-020416
  Nov 03 03:23:09.856 I clusteroperator/machine-config versions: operator 4.10.0-0.nightly-2021-11-02-191632 -> 4.10.0-0.nightly-2021-11-03-020416
  Nov 03 03:23:10.428 W clusteroperator/machine-config condition/Progressing status/False changed: Cluster version is 4.10.0-0.nightly-2021-11-03-020416
  Nov 03 03:23:16.899 W clusteroperator/machine-config condition/Upgradeable status/False reason/PoolUpdating changed: One or more machine config pools are updating, please see `oc get mcp` for further details
  Nov 03 03:36:29.855 W clusteroperator/machine-config condition/Upgradeable status/True changed: 

Neither the machine-config operator nor the machine-os-content changed between those two releases [2].  But the MCO bumped the rendered configs:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528/artifacts/e2e-aws-upgrade/gather-extra/artifacts/machineconfigs.json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' | sort | tail -n4
  2021-11-03T02:30:53Z rendered-master-78d48a9fbc8ba6a746040c11f8335d1a
  2021-11-03T02:30:53Z rendered-worker-f4fb9dc4bce02415a779f436ccbd258f
  2021-11-03T03:23:11Z rendered-master-ebc4fc8c7b6a155d86684204394e4930
  2021-11-03T03:23:11Z rendered-worker-80216c83e551170f7d4be028d2c2414b

Because the pause image changed:

  $ cat ~/bin/diff-machine-config.sh 
  #!/bin/bash

  URI="$1"
  NAME_A="$2"
  NAME_B="$3"
  JQ="${4:-.}"
  MACHINE_CONFIGS="$(curl -s "${URI}")"

  function machine_config() {
        NAME="$1"
        echo "${MACHINE_CONFIGS}" | jq -r --arg name "${NAME}" '.items[] | select(.metadata.name == $name)' | jq -r "${JQ}"
  }

  CONFIG_A="$(machine_config ${NAME_A})"
  CONFIG_B="$(machine_config ${NAME_B})"

  echo "${NAME_A}: length: $(echo "${CONFIG_A}" | wc -l), sha1: $(echo "${CONFIG_A}" | sha1sum)"
  echo "${NAME_B}: length: $(echo "${CONFIG_B}" | wc -l), sha1: $(echo "${CONFIG_B}" | sha1sum)"

  diff -u0 <(machine_config "${NAME_A}") <(machine_config "${NAME_B}")
  $ diff-machine-config.sh https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528/artifacts/e2e-aws-upgrade/gather-extra/artifacts/machineconfigs.json rendered-master-78d48a9fbc8ba6a746040c11f8335d1a rendered-master-ebc4fc8c7b6a155d86684204394e4930 '.spec.config.systemd.units[] | select(.name == "kubelet.service").contents'
rendered-master-78d48a9fbc8ba6a746040c11f8335d1a: length: 42, sha1: 2df9decb6c194fcd15239b3f5c04e62eefee9a96  -
rendered-master-ebc4fc8c7b6a155d86684204394e4930: length: 42, sha1: 693fb66283aee7d2cd87850a1348c6a859f9227d  -
--- /dev/fd/63  2021-11-03 09:21:12.248227981 -0700
+++ /dev/fd/62  2021-11-03 09:21:12.249227981 -0700
@@ -34 +34 @@
-      --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ef6684475a8fc4eb898dc5170dbf883dad2c2883b924cb3d86a4f3b2cb746639 \
+      --pod-infra-container-image=quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0adaefcf3675fa153e7d3917c7a24563e9da81922cac6c3eed3edd4b57bf98bb \

Which is the "pod" entry in the release image references:

  $ oc adm release info --image-for=pod registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-11-02-191632
  quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ef6684475a8fc4eb898dc5170dbf883dad2c2883b924cb3d86a4f3b2cb746639
  $ oc adm release info --image-for=pod registry.ci.openshift.org/ocp/release:4.10.0-0.nightly-2021-11-03-020416
quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0adaefcf3675fa153e7d3917c7a24563e9da81922cac6c3eed3edd4b57bf98bb

See also a similar earlier issue in bug 1970150.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-aws-upgrade/1455719487931158528
[2]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4.10.0-0.nightly/release/4.10.0-0.nightly-2021-11-03-020416?from=4.10.0-0.nightly-2021-11-02-191632

Comment 4 Devan Goodwin 2021-11-03 17:13:14 UTC

Latest payload made it past this problem, so presumably it was because that one contained the changes we normally expect to see in the os content. 

Still needs a fix for whatever edge case brought this on, dropping to High sev though as payload promotion is no longer blocked.

Comment 5 Sinny Kumari 2021-12-07 13:19:29 UTC

Closing this as we already have another bug https://bugzilla.redhat.com/show_bug.cgi?id=1999556 to track it.

*** This bug has been marked as a duplicate of bug 1999556 ***