Bug 2071998 - Cluster-version operator should share details of signature verification when it fails in 'Force: true' updates
Summary: Cluster-version operator should share details of signature verification when ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.11
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.11.0
Assignee: W. Trevor King
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks: 2093987 2093992
TreeView+ depends on / blocked
 
Reported: 2022-04-05 11:37 UTC by Devan Goodwin
Modified: 2022-08-10 11:03 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2093987 2093992 (view as bug list)
Environment:
Last Closed: 2022-08-10 11:03:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 763 0 None Merged Bug 2071998: pkg/cvo/updatepayload: Event when forcing through a sig-verification failure 2022-05-26 03:06:28 UTC
Github openshift library-go pull 1358 0 None Merged Bug 2071998: pkg/verify: Expose underlying signature errors 2022-05-26 03:06:29 UTC
Github openshift library-go pull 1371 0 None Merged Bug 2071998: pkg/verify: Expose store error details, especially slow access 2022-05-26 03:06:30 UTC

Description Devan Goodwin 2022-04-05 11:37:56 UTC
Cluster upgrade.[sig-cluster-lifecycle] Cluster version operator acknowledges upgrade

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.11/analysis?test=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20Cluster%20version%20operator%20acknowledges%20upgrade

TRT has identified breakage in important metal-ipi jobs beginning right around April 1.

[sig-cluster-lifecycle] Cluster version operator acknowledges upgrade expand_less     2m0s
{Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 2; updated.Generation: 3  Timed out waiting for cluster to acknowledge `upgrade: timed out waiting for the condition; observedGeneration: 2; updated.Generation: 3}

Affects the following jobs:

openshift-etcd-113-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact

periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 13 runs, 100% failed, 54% of failures match = 54% impact

openshift-etcd-113-nightly-4.11-e2e-metal-ipi-upgrade-ovn-ipv6 (all) - 2 runs, 100% failed, 100% of failures match = 100% impact

periodic-ci-shiftstack-shiftstack-ci-main-periodic-4.11-upgrade-from-stable-4.10-e2e-openstack-kuryr-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact

Comment 2 W. Trevor King 2022-04-05 21:16:44 UTC
A bunch of folks have been digging into [1] as an example failure, and it has the following timeline:

* 20:57:10, test suite requests update, as marked by the UpgradeStarted event [2].
* 20:57:11, CVO begins retrieving and validating the update, as marked by the "RetrievePayload: Retrieving..." event [2,3].
* At this point, the CVO drops into the RetrievePayload method [4].
* The CVO hunts around for a release signature, with a 2m timeout [5].  A failure here would generate a "An image was retrieved from ... that failed verification..." log [5], but without Loki [6] or a CVO event, we don't have a timestamp on this failure.
* 20:59:11, failing to find a signature for the target nightly, the CVO creates a Job to retrieve the target payload [7,8].
* 20:59:11, test suite hits its 2m timeout and fails the run, as marked by the UpgradeFailed event [3].

So I suspected an issue with accessing the signature stores from the metal environment.  However, launching a 4.10.8 metal cluster with cluster-bot 'launch 4.10.8 metal':

  $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version
  4.10.8
  $ oc get -o jsonpath='{.status.platformStatus.type}{"\n"}' infrastructure cluster
  BareMetal

Sending it off to a recent CI release, which won't be signed:

  $ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707
  warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
  Updating to release image registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707

Checking on events:

  $ oc -n openshift-cluster-version get -o json events | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message' | tail -n3
  2022-04-05T20:27:30Z 1 version PayloadLoaded: Payload loaded version="4.10.8" image="registry.build01.ci.openshift.org/ci-ln-c4vw0sb/release@sha256:0696e249622b4d07d8f4501504b6c568ed6ba92416176a01a12b7f1882707117"
  2022-04-05T20:33:18Z 1 version RetrievePayload: Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707"
  2022-04-05T20:33:19Z 1 version RetrievePayloadFailed: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707" failure=The update cannot be verified: unable to locate a valid signature for one or more sources

So it failed within a second of the "RetrievePayload: Retrieving..." event.  Checking the CVO logs:

  $ oc -n openshift-cluster-version logs -l k8s-app=cluster-version-operator --tail -1 | grep ' 20:33:1' | grep -A6 7ea3a82de55
  I0405 20:33:18.379395       1 cvo.go:546] Desired version from spec is v1.Update{Version:"", Image:"registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707", Force:false}
  I0405 20:33:18.379411       1 availableupdates.go:46] Retrieving available updates again, because more than 3m57.294819776s has elapsed since 2022-04-05 20:27:29.95544607 +0000 UTC m=+749.852124959
  I0405 20:33:18.379418       1 cvo.go:609] Finished syncing upgradeable "openshift-cluster-version/version" (146.141µs)
  I0405 20:33:18.379434       1 sync_worker.go:682] Detected while considering cluster version generation 3: version change (from {4.10.8 registry.build01.ci.openshift.org/ci-ln-c4vw0sb/release@sha256:0696e249622b4d07d8f4501504b6c568ed6ba92416176a01a12b7f1882707117 false} to { registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707 false})
  I0405 20:33:18.379445       1 sync_worker.go:222] syncPayload: registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707 (force=false)
  I0405 20:33:18.379622       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RetrievePayload' Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:7ea3a82de55b7892962ab1d6ff3a37f908fccea9f13398c7d27d187e46dcb707"
  I0405 20:33:18.386156       1 cvo.go:590] Finished syncing available updates "openshift-cluster-version/version" (6.81631ms)
  I0405 20:33:19.141315       1 sync_worker.go:354] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to locate a valid signature for one or more sources
  ...

nothing all that surprising in there.  Also not all that much detail, because verification logging is V(4):

  $ git grep -oh 'log.V([0-9])' vendor/github.com/openshift/library-go/pkg/verify | sort | uniq -c
       15 log.V(4)

and we've been using v2 since bug 2034493.

So possible next steps include:

a. Add Loki access [6] to allow us to look for "that failed verification" log lines [5]
b. Teach the CVO to event on verification failure.
c. Blindly bump the test-suite timeout to 3m, or drop the CVO's force-mode verification timeout to 1m30s.

Because "takes ~2m to get through verification" isn't that terrible, I'm setting this to blocker-.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1510700525824249856
[2]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1510700525824249856/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + (.count | tostring) + " " + .involvedObject.name + " " + .reason + ": " + .message'
     ...
     2022-04-03T20:10:50Z 1 version PayloadLoaded: Payload loaded version="4.10.8" image="virthost.ostest.test.metalkube.org:5000/localimages/local-release-image@sha256:0944463374a807808476ac163c2f336d52c11e5e0a75398b65659e8c476079e8"
     2022-04-03T20:57:10Z null cluster UpgradeStarted: version/ image/registry.build02.ci.openshift.org/ci-op-6bpdwxhc/release@sha256:2c1df95cf72361c748bdfe4b778ed4fe6a60098def6d817fed02338474c04dcf
     2022-04-03T20:57:11Z 1 version RetrievePayload: Retrieving and verifying payload version="" image="registry.build02.ci.openshift.org/ci-op-6bpdwxhc/release@sha256:2c1df95cf72361c748bdfe4b778ed4fe6a60098def6d817fed02338474c04dcf"
     2022-04-03T20:59:11Z null cluster UpgradeFailed: failed to acknowledge version: Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 2; updated.Generation: 3
     2022-04-03T20:59:11Z 1 version--bkcj6 SuccessfulCreate: Created pod: version--bkcj6-qnp28
     ...
[3]: https://github.com/openshift/cluster-version-operator/blob/66b875250897d20e34da4ad872ea3286b3527873/pkg/cvo/sync_worker.go#L244-L245
[4]: https://github.com/openshift/cluster-version-operator/blob/66b875250897d20e34da4ad872ea3286b3527873/pkg/cvo/sync_worker.go#L252
[5]: https://github.com/openshift/cluster-version-operator/blob/66b875250897d20e34da4ad872ea3286b3527873/pkg/cvo/updatepayload.go#L94-L116
[6]: https://github.com/openshift/release/pull/27561
[7]: https://github.com/openshift/cluster-version-operator/blob/66b875250897d20e34da4ad872ea3286b3527873/pkg/cvo/updatepayload.go#L124
[8]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1510700525824249856/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-61822cf0856f5e25433fd5d7fd915e0a7e38958e5d0de357037902d9ad667fb7/namespaces/openshift-cluster-version/batch/jobs.yaml | yaml2json | jq -r '.items[].metadata | .creationTimestamp + " " + .name'
     2022-04-03T20:59:11Z version--bkcj6

Comment 3 Derek Higgins 2022-04-06 10:58:28 UTC
(In reply to W. Trevor King from comment #2)
> So possible next steps include:
> 
> a. Add Loki access [6] to allow us to look for "that failed verification"
> log lines [5]

I'm looking into adding loki support to the metal-ipi jobs but my assumption
is that it won't work on the IPv6 jobs(If I'm correctly understanding how loki works)
as the cluster wont have ipv6 connectivity back to observatorium.api.openshift.com

Comment 4 W. Trevor King 2022-04-07 16:07:55 UTC
(In reply to W. Trevor King from comment #2)
> So possible next steps include:
> 
> a. Add Loki access [6] to allow us to look for "that failed verification"
>    log lines [5]
> b. Teach the CVO to event on verification failure.
> c. Blindly bump the test-suite timeout to 3m, or drop the CVO's force-mode
>    verification timeout to 1m30s.

Ok, IPv6 Loki off the table per comment 3, so no (a).  [1] has a test-suite timeout increase in-flight, but it's currently blocked on the build01 issues described in [2], so that delays (c).  That leaves (b), which we'd want anyway in the absence of Loki for debugging the delay.

[1]: https://github.com/openshift/origin/pull/26974
[2]: https://github.com/openshift/release/pull/27618#issue-1194963508

Comment 6 W. Trevor King 2022-04-21 23:59:59 UTC
Events for [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .reason + ": " + .message'
...
2022-04-20T23:35:54Z UpgradeStarted: version/ image/registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6
2022-04-20T23:35:54Z RetrievePayload: Retrieving and verifying payload version="" image="registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6"
2022-04-20T23:37:54Z SuccessfulCreate: Created pod: version--7bwjg-nt6wr
2022-04-20T23:37:57Z AddedInterface: Add eth0 [fd01:0:0:2::39/64] from ovn-kubernetes
2022-04-20T23:37:57Z Pulling: Pulling image "registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6"
2022-04-20T23:38:02Z Pulled: Successfully pulled image "registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6" in 5.668574082s
2022-04-20T23:38:03Z Created: Created container payload
2022-04-20T23:38:03Z Started: Started container payload
2022-04-20T23:38:06Z Completed: Job completed
2022-04-20T23:38:06Z LoadPayload: Loading payload version="" image="registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6"
2022-04-20T23:38:07Z PreconditionWarn: precondition warning for payload loaded version="4.11.0-0.nightly-2022-04-20-215725" image="registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6": Forced through blocking failures: Multiple precondition checks failed:
* Precondition "ClusterVersionUpgradeable" failed because of "PoolUpdating": Cluster operator machine-config should not be upgraded between minor versions: One or more machine config pools are updating, please see `oc get mcp` for further details
* Precondition "EtcdRecentBackup" failed because of "ControllerStarted": 
2022-04-20T23:38:07Z PreconditionsPassed: preconditions passed for payload loaded version="4.11.0-0.nightly-2022-04-20-215725" image="registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6"
2022-04-20T23:38:07Z PayloadLoaded: Payload loaded version="4.11.0-0.nightly-2022-04-20-215725" image="registry.build03.ci.openshift.org/ci-op-wxwz83ft/release@sha256:9cdb7706c96ca2c5b02b636404ad132d73825c7998a14597df5860c98eb713a6"
2022-04-20T23:39:54Z UpgradeFailed: failed to acknowledge version: Timed out waiting for cluster to acknowledge upgrade: timed out waiting for the condition; observedGeneration: 2; updated.Generation: 3

We validate signatures before creating the version-* job, so that's 2m hunting for signatures (23:35:54 to 23:37:54).  A few seconds later at 23:38:06 we have finished the download and manifest ingestion.  Then... something?  And 2m later the test suite gives up at 23:39:54.

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-9869db998-s22zh_cluster-version-operator.log | head -n1
I0420 22:47:07.238791       1 start.go:23] ClusterVersionOperator 4.10.0-202204090935.p0.g66b8752.assembly.stream-66b8752
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-9869db998-s22zh_cluster-version-operator.log | tail -n1
E0420 23:43:27.370084       1 task.go:112] error running apply for clusteroperator "console" (520 of 787): Cluster operator console is updating versions

Huh, same CVO pod still going at gather-time?

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-9869db998-s22zh_cluster-version-operator.log | grep 'apply.*in state\|result of work'
I0420 22:59:12.049811       1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Initializing at attempt 0
I0420 23:03:17.044673       1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Initializing at attempt 1
I0420 23:07:55.282439       1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Initializing at attempt 2
I0420 23:11:52.357919       1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0
I0420 23:15:57.560185       1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0
I0420 23:20:02.755194       1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0
I0420 23:24:08.000292       1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0
I0420 23:28:13.214686       1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0
I0420 23:32:18.431654       1 sync_worker.go:726] apply: 4.10.11 on generation 2 in state Reconciling at attempt 0
I0420 23:36:23.647248       1 sync_worker.go:726] apply:  on generation 3 in state Updating at attempt 0
I0420 23:40:30.417690       1 sync_worker.go:726] apply: 4.11.0-0.nightly-2022-04-20-215725 on generation 3 in state Reconciling at attempt 0

Certainly by gather time:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.observedGeneration'
3

I dunno.  Maybe we also need observerdGeneration logging, or something...

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-upgrade-from-stable-4.10-e2e-metal-ipi-upgrade-ovn-ipv6/1516899498356379648

Comment 8 Devan Goodwin 2022-05-09 13:23:35 UTC
Just wanted to check in and see if there's been any luck with this the last few weeks? The test still looks permafailing on two metal/ipi/ovn/ipv6 upgrade jobs: https://sippy.dptools.openshift.org/sippy-ng/tests/4.11/analysis?test=Cluster%20upgrade.%5Bsig-cluster-lifecycle%5D%20Cluster%20version%20operator%20acknowledges%20upgrade

Comment 11 W. Trevor King 2022-05-26 03:08:18 UTC
[1] still has the slow signature attempt:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade-ovn-ipv6/1529546657899745280/artifacts/e2e-metal-ipi-upgrade-ovn-ipv6/gather-extra/artifacts/events.json | jq -r '[.items[] | select(.metadata.namespace == "openshift-cluster-version")] | sort_by(.metadata.creationTimestamp)[] | .metadata.creationTimestamp + " " + .reason + ": " + .message'
...
2022-05-25T21:07:40Z UpgradeStarted: version/ image/registry.build03.ci.openshift.org/ci-op-8lyycj5b/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd
2022-05-25T21:07:40Z RetrievePayload: Retrieving and verifying payload version="" image="registry.build03.ci.openshift.org/ci-op-8lyycj5b/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd"
2022-05-25T21:09:41Z SuccessfulCreate: Created pod: version--4jjhp-zmt5t
...
2022-05-25T21:09:53Z RetrievePayload: 
Target release version="" image="registry.build03.ci.openshift.org/ci-op-8lyycj5b/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd against keyrings: verifier-public-key-redhat
[2022-05-25T21:07:41Z: prefix sha256-83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd in config map signatures-managed: no more signatures to check, 2022-05-25T21:09:40Z: Get "https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd/signature-1": context deadline exceeded, 2022-05-25T21:09:40Z: Get "https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd/signature-1": context deadline exceeded, 2022-05-25T21:09:40Z: context deadline exceeded]
2022-05-25T21:09:53Z LoadPayload: Loading payload version="" image="registry.build03.ci.openshift.org/ci-op-8lyycj5b/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd"
...

So that's our expected 2m timeout for signature verification when 'Force: true'.  And from the new event, we're spending 2m each on the storage.googleapis.com and mirror.openshift.com signature stores, and the networking environment is neither failing our request nor succeeding.  Comparing with a lookup from my local machine:

$ time curl -s https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd/signature-1 | gpg --decrypt
{"critical": {"image": {"docker-manifest-digest": "sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd"}, "type": "atomic container signature", "identity": {"docker-reference": "quay.io/openshift-release-dev/ocp-release-nightly:4.11.0-0.nightly-2022-05-25-193227-x86_64"}}, "optional": {"creator": "Red Hat OpenShift Signing Authority 0.0.1"}}gpg: Signature made Wed 25 May 2022 07:10:19 PM PDT
gpg:                using RSA key 938A80CAF21541EB
gpg: Good signature from "Red Hat, Inc. (beta key 2) <security>" [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:          There is no indication that the signature belongs to the owner.
Primary key fingerprint: B08B 659E E86A F623 BC90  E8DB 938A 80CA F215 41EB

real    0m0.676s
user    0m0.052s
sys     0m0.042s

I'm going to re-scope this bug as "now we have nice events with lots of detail out of the CVO", and we can clone off a new bug for the metal folks to try and figure out why these lookups are taking so long in their environment.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.11-e2e-metal-ipi-upgrade-ovn-ipv6/1529546657899745280

Comment 13 liujia 2022-05-31 08:57:42 UTC
Go through the bug, it's more like an environment issue. We did not hit such ci issue in QE's daily test. So for this bug's scope, I will verify the cvo's event when forcing through a sig-verification failure. 

1. try to upgrade to an unsigned payload with force

Before the pr:
2. check cvo event and logs
# ./oc get events
...
7m21s       Normal    RetrievePayload       clusterversion/version                           Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee"
7m8s        Normal    LoadPayload           clusterversion/version                           Loading payload version="" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee"
7m8s        Normal    PreconditionsPassed   clusterversion/version                           preconditions passed for payload loaded version="4.11.0-0.nightly-2022-05-24-062131" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee"
7m8s        Normal    PayloadLoaded         clusterversion/version                           Payload loaded version="4.11.0-0.nightly-2022-05-24-062131" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee"
7m          Normal    LeaderElection        configmap/version                                jliu411a-bsrpr-master-1_db7704ce-da8a-49df-8311-91676d15898b became leader
7m          Normal    RetrievePayload       clusterversion/version                           Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee"
7m          Normal    LoadPayload           clusterversion/version                           Loading payload version="" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee"
6m59s       Normal    PayloadLoaded         clusterversion/version                           Payload loaded version="4.11.0-0.nightly-2022-05-24-062131" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee"

There is signature verify failure info in cvo log.
# cat pre|grep -E "Retrieving and verifying|unable to verify|retrieve signature"
I0531 07:48:46.897940       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RetrievePayload' Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee"
W0531 07:48:47.114291       1 updatepayload.go:116] An image was retrieved from "registry.ci.openshift.org/ocp/release@sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee" that failed verification: The update cannot be verified: unable to verify sha256:421fc6020ef4dd99da314477507f76b74910cc6ba65fe5f3a1d576c6a5288aee against keyrings: verifier-public-key-redhat

After the pr:
2. check cvo event and logs
The verify failure info is now added into event.
# ./oc get events
...
2m28s       Normal    RetrievePayload       clusterversion/version                           Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f"
2m16s       Warning   RetrievePayload       clusterversion/version                           Target release version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f against keyrings: verifier-public-key-redhat...
2m16s       Normal    LoadPayload           clusterversion/version                           Loading payload version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f"
2m15s       Normal    PreconditionsPassed   clusterversion/version                           preconditions passed for payload loaded version="4.11.0-0.nightly-2022-05-26-013101" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f"
2m15s       Normal    PayloadLoaded         clusterversion/version                           Payload loaded version="4.11.0-0.nightly-2022-05-26-013101" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f"
2m10s       Normal    LeaderElection        configmap/version                                jliu411b-gngwf-master-1_e2d18eec-cec7-4f95-af7d-d2e4c88026e1 became leader
2m10s       Normal    RetrievePayload       clusterversion/version                           Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f"
2m10s       Normal    LoadPayload           clusterversion/version                           Loading payload version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f"
2m9s        Normal    PayloadLoaded         clusterversion/version                           Payload loaded version="4.11.0-0.nightly-2022-05-26-013101" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f"

There is signature verification info of each store added in to cvo log.
# cat post|grep -E "Retrieving and verifying|unable to verify|retrieve signature"
I0531 08:36:18.114647       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RetrievePayload' Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f"
W0531 08:36:18.334456       1 updatepayload.go:117] Target release version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f against keyrings: verifier-public-key-redhat
Target release version="" image="registry.ci.openshift.org/ocp/release@sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f" cannot be verified, but continuing anyway because the update was forced: unable to verify sha256:a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f against keyrings: verifier-public-key-redhat
[2022-05-31T08:36:18Z: prefix sha256-a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f in config map signatures-managed: no more signatures to check, 2022-05-31T08:36:18Z: unable to retrieve signature from https://storage.googleapis.com/openshift-release/official/signatures/openshift/release/sha256=a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f/signature-1: no more signatures to check, 2022-05-31T08:36:18Z: unable to retrieve signature from https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release/sha256=a3d1450802f9842c0780d4592c09435f2fc5ce43c5b0c613426d1d6840a9d19f/signature-1: no more signatures to check, 2022-05-31T08:36:18Z: parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release: no more signatures to check, 2022-05-31T08:36:18Z: serial signature store wrapping config maps in openshift-config-managed with label "release.openshift.io/verification-signatures", parallel signature store wrapping containers/image signature store under https://storage.googleapis.com/openshift-release/official/signatures/openshift/release, containers/image signature store under https://mirror.openshift.com/pub/openshift-v4/signatures/openshift/release: no more signatures to check]

Comment 15 errata-xmlrpc 2022-08-10 11:03:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.