Cluster upgrade.[sig-cluster-lifecycle] Cluster version operator acknowledges upgrade is failing frequently in CI, see: https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20Cluster%20version%20operator%20acknowledges%20upgrade TRT has identified breakage in important metal-ipi jobs beginning right around April 1. [sig-cluster-lifecycle] Cluster version operator acknowledges upgrade expand_less 2m0s {Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 2; updated.Generation: 3 Timed out waiting for cluster to acknowledge `upgrade: timed out waiting for the condition; observedGeneration: 2; updated.Generation: 3} Affects the following jobs: openshift-etcd-113-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 13 runs, 100% failed, 54% of failures match = 54% impact openshift-etcd-113-nightly-4.11-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-kuryr-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
A bunch of folks have been digging into [1] as an example failure, and it has the following timeline: * 20:57:10, test suite requests update, as marked by the UpgradeStarted event [2]. * 20:57:11, CVO begins retrieving and validating the update, as marked by the "RetrievePayload: Retrieving..." event [2,3]. * At this point, the CVO drops into the RetrievePayload method [4]. * The CVO hunts around for a release signature, with a 2m timeout [5]. A failure here would generate a "An image was retrieved from ... that failed verification..." log [5], but without Loki [6] or a CVO event, we don't have a timestamp on this failure. * 20:59:11, failing to find a signature for the target nightly, the CVO creates a Job to retrieve the target payload [7,8]. * 20:59:11, test suite hits its 2m timeout and fails the run, as marked by the UpgradeFailed event [3]. So I suspected an issue with accessing the signature stores from the metal environment. However, launching a 4.10.8 metal cluster with cluster-bot 'launch 4.10.8 metal': $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version 4.10.8 $ oc get -o jsonpath='{.status.platformStatus.type}{"\n"}' infrastructure cluster BareMetal Sending it off to a recent CI release, which won't be signed: $ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707 warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway Updating to release image registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707 Checking on events: $ oc -n openshift-cluster-version get -o json events | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' | tail -n3 2022-04-05T20:27:30Z 1 version PayloadLoaded: Payload loaded version="4.10.8" image="registry.build01.ci.openshift.org/ci-ln-c4vw0sb/release@sha256:0696e249622b4d07d8f4501504b6c568ed6ba92416176a01a12b7f1882707117" 2022-04-05T20:33:18Z 1 version RetrievePayload: Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707" 2022-04-05T20:33:19Z 1 version RetrievePayloadFailed: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707" failure=The update cannot be verified: unable to locate a valid signature for one or more sources So it failed within a second of the "RetrievePayload: Retrieving..." event. Checking the CVO logs: $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep ' 20:33:1' | grep -A6 7ea3a82de55 I0405 20:33:18.379395 1 cvo.go:546] Desired version from spec is v1.Update{Version:"", Image:"registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707", Force:false} I0405 20:33:18.379411 1 availableupdates.go:46] Retrieving available updates again, because more than 3m57.294819776s has elapsed since 2022-04-05 20:27:29.95544607 +0000 UTC m=+749.852124959 I0405 20:33:18.379418 1 cvo.go:609] Finished syncing upgradeable "openshift-cluster-version/version" (146.141µs) I0405 20:33:18.379434 1 sync_worker.go:682] Detected while considering cluster version generation 3: version change (from {4.10.8 registry.build01.ci.openshift.org/ci-ln-c4vw0sb/release@sha256:0696e249622b4d07d8f4501504b6c568ed6ba92416176a01a12b7f1882707117 false} to { registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707 false}) I0405 20:33:18.379445 1 sync_worker.go:222] syncPayload: registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707 (force=false) I0405 20:33:18.379622 1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RetrievePayload' Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707" I0405 20:33:18.386156 1 cvo.go:590] Finished syncing available updates "openshift-cluster-version/version" (6.81631ms) I0405 20:33:19.141315 1 sync_worker.go:354] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to locate a valid signature for one or more sources ... nothing all that surprising in there. Also not all that much detail, because verification logging is V(4): $ git grep -oh 'log.V([0-9])' vendor/github.com/openshift/library-go/pkg/verify | sort | uniq -c 15 log.V(4) and we've been using v2 since bug 2034493. So possible next steps include: a. Add Loki access [6] to allow us to look for "that failed verification" log lines [5] b. Teach the CVO to event on verification failure. c. Blindly bump the test-suite timeout to 3m, or drop the CVO's force-mode verification timeout to 1m30s. Because "takes ~2m to get through verification" isn't that terrible, I'm setting this to blocker-. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1510700525824249856 [2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1510700525824249856/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' ... 2022-04-03T20:10:50Z 1 version PayloadLoaded: Payload loaded version="4.10.8" image="virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:0944463374a807808476ac163c2f336d52c11e5e0a75398b65659e8c476079e8" 2022-04-03T20:57:10Z null cluster UpgradeStarted: version/ image/registry.build02.ci.openshift.org/ci-op-6bpdwxhc/release@sha256:2c1df95cf72361c748bdfe4b778ed4fe6a60098def6d817fed02338474c04dcf 2022-04-03T20:57:11Z 1 version RetrievePayload: Retrieving and verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-6bpdwxhc/release@sha256:2c1df95cf72361c748bdfe4b778ed4fe6a60098def6d817fed02338474c04dcf" 2022-04-03T20:59:11Z null cluster UpgradeFailed: failed to acknowledge version: Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 2; updated.Generation: 3 2022-04-03T20:59:11Z 1 version--bkcj6 SuccessfulCreate: Created pod: version--bkcj6-qnp28 ... [3]: https://github.com/openshift/cluster-version-operator/blob/66b875250897d20e34da4ad872ea3286b3527873/pkg/cvo/sync_worker.go#L244-L245 [4]: https://github.com/openshift/cluster-version-operator/blob/66b875250897d20e34da4ad872ea3286b3527873/pkg/cvo/sync_worker.go#L252 [5]: https://github.com/openshift/cluster-version-operator/blob/66b875250897d20e34da4ad872ea3286b3527873/pkg/cvo/updatepayload.go#L94-L116 [6]: https://github.com/openshift/release/pull/27561 [7]: https://github.com/openshift/cluster-version-operator/blob/66b875250897d20e34da4ad872ea3286b3527873/pkg/cvo/updatepayload.go#L124 [8]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1510700525824249856/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-61822cf0856f5e25433fd5d7fd915e0a7e38958e5d0de357037902d9ad667fb7/namespaces/openshift-cluster-version/batch/jobs.yaml | yaml2json | jq -r '.items[].metadata | .creationTimestamp + " " + .name' 2022-04-03T20:59:11Z version--bkcj6
(In reply to W. Trevor King from comment #2) > So possible next steps include: > > a. Add Loki access [6] to allow us to look for "that failed verification" > log lines [5] I'm looking into adding loki support to the metal-ipi jobs but my assumption is that it won't work on the IPv6 jobs(If I'm correctly understanding how loki works) as the cluster wont have ipv6 connectivity back to observatorium.api.openshift.com
(In reply to W. Trevor King from comment #2) > So possible next steps include: > > a. Add Loki access [6] to allow us to look for "that failed verification" > log lines [5] > b. Teach the CVO to event on verification failure. > c. Blindly bump the test-suite timeout to 3m, or drop the CVO's force-mode > verification timeout to 1m30s. Ok, IPv6 Loki off the table per comment 3, so no (a). [1] has a test-suite timeout increase in-flight, but it's currently blocked on the build01 issues described in [2], so that delays (c). That leaves (b), which we'd want anyway in the absence of Loki for debugging the delay. [1]: https://github.com/openshift/origin/pull/26974 [2]: https://github.com/openshift/release/pull/27618#issue-1194963508
We modified the test to try to buy the team some time by extending the timeout from 2 minutes to 4 minutes for OpenStack and metal, the test continues to fail and is hitting the 4 minute timeout very frequently. https://sippy.ci.openshift.org/sippy-ng/jobs/4.11/analysis?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6%22%7D%5D%7D&tests=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20Cluster%20version%20operator%20acknowledges%20upgrade Sample failures: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648 https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516766577708503040 This desperately needs to be investigated, did a PR get landed to provide more info via events?
Events for [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .reason + ": " + .message' ... 2022-04-20T23:35:54Z UpgradeStarted: version/ image/registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6 2022-04-20T23:35:54Z RetrievePayload: Retrieving and verifying payload version="" image="registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6" 2022-04-20T23:37:54Z SuccessfulCreate: Created pod: version--7bwjg-nt6wr 2022-04-20T23:37:57Z AddedInterface: Add eth0 [fd01:0:0:2::39/64] from ovn-kubernetes 2022-04-20T23:37:57Z Pulling: Pulling image "registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6" 2022-04-20T23:38:02Z Pulled: Successfully pulled image "registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6" in 5.668574082s 2022-04-20T23:38:03Z Created: Created container payload 2022-04-20T23:38:03Z Started: Started container payload 2022-04-20T23:38:06Z Completed: Job completed 2022-04-20T23:38:06Z LoadPayload: Loading payload version="" image="registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6" 2022-04-20T23:38:07Z PreconditionWarn: precondition warning for payload loaded version="4.11.0-0.nightly-2022-04-20-215725" image="registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6": Forced through blocking failures: Multiple precondition checks failed: * Precondition "ClusterVersionUpgradeable" failed because of "PoolUpdating": Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details * Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 2022-04-20T23:38:07Z PreconditionsPassed: preconditions passed for payload loaded version="4.11.0-0.nightly-2022-04-20-215725" image="registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6" 2022-04-20T23:38:07Z PayloadLoaded: Payload loaded version="4.11.0-0.nightly-2022-04-20-215725" image="registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6" 2022-04-20T23:39:54Z UpgradeFailed: failed to acknowledge version: Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 2; updated.Generation: 3 We validate signatures before creating the version-* job, so that's 2m hunting for signatures (23:35:54 to 23:37:54). A few seconds later at 23:38:06 we have finished the download and manifest ingestion. Then... something? And 2m later the test suite gives up at 23:39:54. $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-9869db998-s22zh_cluster-version-operator.log | head -n1 I0420 22:47:07.238791 1 start.go:23] ClusterVersionOperator 4.10.0-202204090935.p0.g66b8752.assembly.stream-66b8752 $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-9869db998-s22zh_cluster-version-operator.log | tail -n1 E0420 23:43:27.370084 1 task.go:112] error running apply for clusteroperator "console" (520 of 787): Cluster operator console is updating versions Huh, same CVO pod still going at gather-time? $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-9869db998-s22zh_cluster-version-operator.log | grep 'apply.*in state\|result of work' I0420 22:59:12.049811 1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Initializing at attempt 0 I0420 23:03:17.044673 1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Initializing at attempt 1 I0420 23:07:55.282439 1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Initializing at attempt 2 I0420 23:11:52.357919 1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0 I0420 23:15:57.560185 1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0 I0420 23:20:02.755194 1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0 I0420 23:24:08.000292 1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0 I0420 23:28:13.214686 1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0 I0420 23:32:18.431654 1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0 I0420 23:36:23.647248 1 sync_worker.go:726] apply: on generation 3 in state Updating at attempt 0 I0420 23:40:30.417690 1 sync_worker.go:726] apply: 4.11.0-0.nightly-2022-04-20-215725 on generation 3 in state Reconciling at attempt 0 Certainly by gather time: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.observedGeneration' 3 I dunno. Maybe we also need observerdGeneration logging, or something... [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648
Just wanted to check in and see if there's been any luck with this the last few weeks? The test still looks permafailing on two metal/ipi/ovn/ipv6 upgrade jobs: https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/analysis?test=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20Cluster%20version%20operator%20acknowledges%20upgrade
[1] still has the slow signature attempt: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade-ovn-ipv6/1529546657899745280/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .reason + ": " + .message' ... 2022-05-25T21:07:40Z UpgradeStarted: version/ image/registry.build03.ci.openshift.org/ci-op-8lyycj5b/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd 2022-05-25T21:07:40Z RetrievePayload: Retrieving and verifying payload version="" image="registry.build03.ci.openshift.org/ci-op-8lyycj5b/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd" 2022-05-25T21:09:41Z SuccessfulCreate: Created pod: version--4jjhp-zmt5t ... 2022-05-25T21:09:53Z RetrievePayload: Target release version="" image="registry.build03.ci.openshift.org/ci-op-8lyycj5b/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd against keyrings: verifier-public-key-redhat [2022-05-25T21:07:41Z: prefix sha256-83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd in config map signatures-managed: no more signatures to check, 2022-05-25T21:09:40Z: Get "https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd/signature-1": context deadline exceeded, 2022-05-25T21:09:40Z: Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd/signature-1": context deadline exceeded, 2022-05-25T21:09:40Z: context deadline exceeded] 2022-05-25T21:09:53Z LoadPayload: Loading payload version="" image="registry.build03.ci.openshift.org/ci-op-8lyycj5b/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd" ... So that's our expected 2m timeout for signature verification when 'Force: true'. And from the new event, we're spending 2m each on the storage.googleapis.com and mirror.openshift.com signature stores, and the networking environment is neither failing our request nor succeeding. Comparing with a lookup from my local machine: $ time curl -s https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd/signature-1 | gpg --decrypt {"critical": {"image": {"docker-manifest-digest": "sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd"}, "type": "atomic container signature", "identity": {"docker-reference": "quay.io/openshift-release-dev/ocp-release-nightly:4.11.0-0.nightly-2022-05-25-193227-x86_64"}}, "optional": {"creator": "Red Hat OpenShift Signing Authority 0.0.1"}}gpg: Signature made Wed 25 May 2022 07:10:19 PM PDT gpg: using RSA key 938A80CAF21541EB gpg: Good signature from "Red Hat, Inc. (beta key 2) <security>" [unknown] gpg: WARNING: This key is not certified with a trusted signature! gpg: There is no indication that the signature belongs to the owner. Primary key fingerprint: B08B 659E E86A F623 BC90 E8DB 938A 80CA F215 41EB real 0m0.676s user 0m0.052s sys 0m0.042s I'm going to re-scope this bug as "now we have nice events with lots of detail out of the CVO", and we can clone off a new bug for the metal folks to try and figure out why these lookups are taking so long in their environment. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade-ovn-ipv6/1529546657899745280
Go through the bug, it's more like an environment issue. We did not hit such ci issue in QE's daily test. So for this bug's scope, I will verify the cvo's event when forcing through a sig-verification failure. 1. try to upgrade to an unsigned payload with force Before the pr: 2. check cvo event and logs # ./oc get events ... 7m21s Normal RetrievePayload clusterversion/version Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee" 7m8s Normal LoadPayload clusterversion/version Loading payload version="" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee" 7m8s Normal PreconditionsPassed clusterversion/version preconditions passed for payload loaded version="4.11.0-0.nightly-2022-05-24-062131" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee" 7m8s Normal PayloadLoaded clusterversion/version Payload loaded version="4.11.0-0.nightly-2022-05-24-062131" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee" 7m Normal LeaderElection configmap/version jliu411a-bsrpr-master-1_db7704ce-da8a-49df-8311-91676d15898b became leader 7m Normal RetrievePayload clusterversion/version Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee" 7m Normal LoadPayload clusterversion/version Loading payload version="" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee" 6m59s Normal PayloadLoaded clusterversion/version Payload loaded version="4.11.0-0.nightly-2022-05-24-062131" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee" There is signature verify failure info in cvo log. # cat pre|grep -E "Retrieving and verifying|unable to verify|retrieve signature" I0531 07:48:46.897940 1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RetrievePayload' Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee" W0531 07:48:47.114291 1 updatepayload.go:116] An image was retrieved from "registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee" that failed verification: The update cannot be verified: unable to verify sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee against keyrings: verifier-public-key-redhat After the pr: 2. check cvo event and logs The verify failure info is now added into event. # ./oc get events ... 2m28s Normal RetrievePayload clusterversion/version Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" 2m16s Warning RetrievePayload clusterversion/version Target release version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f against keyrings: verifier-public-key-redhat... 2m16s Normal LoadPayload clusterversion/version Loading payload version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" 2m15s Normal PreconditionsPassed clusterversion/version preconditions passed for payload loaded version="4.11.0-0.nightly-2022-05-26-013101" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" 2m15s Normal PayloadLoaded clusterversion/version Payload loaded version="4.11.0-0.nightly-2022-05-26-013101" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" 2m10s Normal LeaderElection configmap/version jliu411b-gngwf-master-1_e2d18eec-cec7-4f95-af7d-d2e4c88026e1 became leader 2m10s Normal RetrievePayload clusterversion/version Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" 2m10s Normal LoadPayload clusterversion/version Loading payload version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" 2m9s Normal PayloadLoaded clusterversion/version Payload loaded version="4.11.0-0.nightly-2022-05-26-013101" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" There is signature verification info of each store added in to cvo log. # cat post|grep -E "Retrieving and verifying|unable to verify|retrieve signature" I0531 08:36:18.114647 1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RetrievePayload' Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" W0531 08:36:18.334456 1 updatepayload.go:117] Target release version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f against keyrings: verifier-public-key-redhat Target release version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f against keyrings: verifier-public-key-redhat [2022-05-31T08:36:18Z: prefix sha256-a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f in config map signatures-managed: no more signatures to check, 2022-05-31T08:36:18Z: unable to retrieve signature from https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f/signature-1: no more signatures to check, 2022-05-31T08:36:18Z: unable to retrieve signature from https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f/signature-1: no more signatures to check, 2022-05-31T08:36:18Z: parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release: no more signatures to check, 2022-05-31T08:36:18Z: serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release: no more signatures to check]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069