Bug 2090680 - upgrade for a disconnected cluster get hang on retrieving and verifying payload
Summary: upgrade for a disconnected cluster get hang on retrieving and verifying payload
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.11
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.13.0
Assignee: Petr Muller
QA Contact: Evgeni Vakhonin
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-26 10:46 UTC by Johnny Liu
Modified: 2023-05-17 22:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-05-17 22:46:53 UTC
Target Upstream Version:
Embargoed:
evakhoni: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 846 0 None Merged Bug 2090680: pkg/cvo/updatepayload.go: timeout payload retrieval 2023-01-04 12:25:38 UTC
Github openshift cluster-version-operator pull 881 0 None Merged Revert "pkg/cvo/updatepayload.go: timeout payload retrieval" 2023-02-20 09:52:21 UTC
Github openshift cluster-version-operator pull 896 0 None Merged Bug 2090680: RetrievePayload: Improve timeouts and cover behavior with tests 2023-02-20 09:52:24 UTC
Red Hat Product Errata RHSA-2023:1326 0 None None None 2023-05-17 22:47:12 UTC

Description Johnny Liu 2022-05-26 10:46:38 UTC
Description of problem:

Version-Release number of the following components:
4.11.0-0.nightly-2022-05-25-123329

How reproducible:
Always

Steps to Reproduce:
1. set up a cluster in a restricted network using 4.11.0-0.nightly-2022-05-25-123329
2. mirror 4.11.0-0.nightly-2022-05-25-193227 to private registry
3. upgrade the cluster to 4.11.0-0.nightly-2022-05-25-193227 without --force option
$ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd


Actual results:
Wait for 3+ hours, no any upgrade history info in clusterversion, from event log, only can see "Retrieving and verifying payload".

[root@preserve-jialiu-ansible ~]# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-05-25-123329   True        False         160m    Cluster version is 4.11.0-0.nightly-2022-05-25-123329

[root@preserve-jialiu-ansible ~]# oc get clusterversion -o yaml
apiVersion: v1
items:
- apiVersion: config.openshift.io/v1
  kind: ClusterVersion
  metadata:
    creationTimestamp: "2022-05-26T03:51:28Z"
    generation: 3
    name: version
    resourceVersion: "62069"
    uid: b5674b4b-7295-4287-904c-94fe1112659b
  spec:
    channel: stable-4.11
    clusterID: 027285eb-b4ea-4127-85b6-031c1af7db72
    desiredUpdate:
      force: false
      image: registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd
      version: ""
  status:
    availableUpdates: null
    capabilities:
      enabledCapabilities:
      - baremetal
      - marketplace
      - openshift-samples
      knownCapabilities:
      - baremetal
      - marketplace
      - openshift-samples
    conditions:
    - lastTransitionTime: "2022-05-26T03:51:31Z"
      message: Capabilities match configured spec
      reason: AsExpected
      status: "False"
      type: ImplicitlyEnabledCapabilities
    - lastTransitionTime: "2022-05-26T03:51:31Z"
      message: Payload loaded version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
      reason: PayloadLoaded
      status: "True"
      type: ReleaseAccepted
    - lastTransitionTime: "2022-05-26T04:23:06Z"
      message: Done applying 4.11.0-0.nightly-2022-05-25-123329
      status: "True"
      type: Available
    - lastTransitionTime: "2022-05-26T04:21:21Z"
      status: "False"
      type: Failing
    - lastTransitionTime: "2022-05-26T04:23:06Z"
      message: Cluster version is 4.11.0-0.nightly-2022-05-25-123329
      status: "False"
      type: Progressing
    - lastTransitionTime: "2022-05-26T03:51:31Z"
      message: 'Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.11&id=027285eb-b4ea-4127-85b6-031c1af7db72&version=4.11.0-0.nightly-2022-05-25-123329":
        dial tcp 34.228.45.157:443: connect: connection timed out'
      reason: RemoteFailed
      status: "False"
      type: RetrievedUpdates
    desired:
      image: registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1
      version: 4.11.0-0.nightly-2022-05-25-123329
    history:
    - completionTime: "2022-05-26T04:23:06Z"
      image: registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1
      startedTime: "2022-05-26T03:51:31Z"
      state: Completed
      verified: false
      version: 4.11.0-0.nightly-2022-05-25-123329
    observedGeneration: 2
    versionHash: jOIXVtM5Y-g=
kind: List
metadata:
  resourceVersion: ""

[root@preserve-jialiu-ansible ~]# oc get event -n openshift-cluster-version
LAST SEEN   TYPE      REASON              OBJECT                                          MESSAGE
3h11m       Warning   FailedScheduling    pod/cluster-version-operator-b4b6c5f9b-p7fjq    no nodes available to schedule pods
3h9m        Warning   FailedScheduling    pod/cluster-version-operator-b4b6c5f9b-p7fjq    no nodes available to schedule pods
3h4m        Normal    Scheduled           pod/cluster-version-operator-b4b6c5f9b-p7fjq    Successfully assigned openshift-cluster-version/cluster-version-operator-b4b6c5f9b-p7fjq to jialiu411a-5nb8n-master-2 by jialiu411a-5nb8n-bootstrap
3h2m        Warning   FailedMount         pod/cluster-version-operator-b4b6c5f9b-p7fjq    MountVolume.SetUp failed for volume "serving-cert" : secret "cluster-version-operator-serving-cert" not found
3h1m        Warning   FailedMount         pod/cluster-version-operator-b4b6c5f9b-p7fjq    Unable to attach or mount volumes: unmounted volumes=[serving-cert], unattached volumes=[etc-ssl-certs etc-cvo-updatepayloads serving-cert service-ca kube-api-access]: timed out waiting for the condition
3h1m        Normal    Pulling             pod/cluster-version-operator-b4b6c5f9b-p7fjq    Pulling image "registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
3h1m        Normal    Pulled              pod/cluster-version-operator-b4b6c5f9b-p7fjq    Successfully pulled image "registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1" in 1.384468759s
3h1m        Normal    Created             pod/cluster-version-operator-b4b6c5f9b-p7fjq    Created container cluster-version-operator
3h1m        Normal    Started             pod/cluster-version-operator-b4b6c5f9b-p7fjq    Started container cluster-version-operator
3h11m       Normal    SuccessfulCreate    replicaset/cluster-version-operator-b4b6c5f9b   Created pod: cluster-version-operator-b4b6c5f9b-p7fjq
3h11m       Normal    ScalingReplicaSet   deployment/cluster-version-operator             Scaled up replica set cluster-version-operator-b4b6c5f9b to 1
3h12m       Normal    LeaderElection      configmap/version                               jialiu411a-5nb8n-bootstrap_0a3ff57f-66cf-4f93-bbe0-484effcc4383 became leader
3h12m       Normal    RetrievePayload     clusterversion/version                          Retrieving and verifying payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
3h12m       Normal    LoadPayload         clusterversion/version                          Loading payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
3h12m       Normal    PayloadLoaded       clusterversion/version                          Payload loaded version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
166m        Normal    LeaderElection      configmap/version                               jialiu411a-5nb8n-master-2_83752e0b-1ef4-4c69-814f-8eeb54d50781 became leader
166m        Normal    RetrievePayload     clusterversion/version                          Retrieving and verifying payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
166m        Normal    LoadPayload         clusterversion/version                          Loading payload version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
166m        Normal    PayloadLoaded       clusterversion/version                          Payload loaded version="4.11.0-0.nightly-2022-05-25-123329" image="registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1"
77m         Normal    RetrievePayload     clusterversion/version                          Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd"

Expected results:
CVO and `oc adm upgrade` should clearly prompt user what issues happened there, but not pending there for a long time without any info.

Additional info:
Try the same upgrade path against a connected cluster, upgrade is kicked off soon, no such issues.

Comment 2 Johnny Liu 2022-05-26 13:34:42 UTC
After waiting for 7+ hours, the image validation error message finally printed out.

$ oc adm upgrade
<--snip-->
ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd" failure=The update cannot be verified: unable to verify sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd against keyrings: verifier-public-key-redhat

Comment 3 Jack Ottofaro 2022-05-26 14:54:31 UTC
It's not hung. CVO will continue to attempt to load the desired payload and update the Condition ReleaseAccepted which is indicating that the retrieval failed:

{
                       "lastTransitionTime": "2022-05-26T13:58:55Z",
                        "message": "Retrieving payload failed version=\"\" image=\"registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-05-20-21392\" failure=Unable to download and prepare the update: deadline exceeded, reason: \"DeadlineExceeded\", message: \"Job was active longer than specified deadline\"",
                        "reason": "RetrievePayload",
                        "status": "False",
                        "type": "ReleaseAccepted"
}

If this were a temporary issue, e.g. network connection flake that clears, the retrieval would succeed and the update would continue.

This also occurs if the user specifies a non-existent release, e.g. due to a typo in the "--to-image" name. The user can simply clear the current upgrade request and then specify the correct upgrade.

Comment 4 Johnny Liu 2022-05-26 15:33:11 UTC
For user experience, waiting for 7+ hours to get error message make no much difference with "hang".

And this testing is not some new testing, when I was running 4.9 -> 4.10 testing, image validation failure will be return in 5 mins, and be prompted to users. Now the behaviour obviously get changed.

Comment 5 liujia 2022-05-27 00:27:15 UTC
The waiting time is too too too long(from 05:46:29 to 11:01:03), and there is not any status from both `oc get clusterversion` and `oc adm upgrade` to indicate cvo is attempting to retrieving and verifying payload. So to users who trigger the upgrade, it looks like the upgrade is not triggerred at all.

# cat cvo.log|grep "ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd"|head -n3
I0526 05:46:29.770637       1 cvo.go:553] Desired version from spec is v1.Update{Version:"", Image:"registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd", Force:false}
I0526 05:46:29.770673       1 sync_worker.go:761] Detected while considering cluster version generation 3: version changed (from {4.11.0-0.nightly-2022-05-25-123329 registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1 false} to { registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd false})
I0526 05:46:29.770679       1 sync_worker.go:244] syncPayload: registry.ci.openshift.org/ocp/release@sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd (force=false)

# cat cvo.log|grep "The update cannot be verified"|head -n1
I0526 11:01:03.176815       1 sync_worker.go:397] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:83ca476a63dfafa49e35cab2ded1fbf3991cc3483875b1bf639eabda31faadfd against keyrings: verifier-public-key-redh

Comment 6 Lalatendu Mohanty 2022-06-01 20:03:17 UTC
Setting this to medium severity bug as this will hit very less % of clusters and will not impact the availability of the cluster

Comment 7 Jack Ottofaro 2022-10-07 13:53:27 UTC
When we separated payload load from payload apply (https://github.com/openshift/cluster-version-operator/pull/683) the context used for the retrieval changed as well. It went from one that was constrained by syncTimeout (2 -4 minutes) [1] to being the unconstrained shutdownContext [2]. However if "force" is specified we explicitly set a 2 minute timeout in RetrievePayload. We should create a new context with a reasonable timeout for RetrievePayload regardless of "force".

Not sure what was going on during the 7+ hours with the connection and why it didn't timeout sooner, error or something.

[1] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/sync_worker.go#L605
[2] https://github.com/openshift/cluster-version-operator/blob/57ffa7c610fb92ef4ccd9e9c49e75915e86e9296/pkg/cvo/cvo.go#L413

Comment 8 Jack Ottofaro 2022-10-11 13:04:37 UTC
I did not see anything in the must-gather's CVO log to help us understand what was going on during the 7+ hours during the verification step of RetrievePayload. If you still have the complete CVO log file that contained the "The update cannot be verified" log entry referenced in Comment 5 could you attach it please. If not, would it be possible to repeat this same test and then attach the resulting CVO log. Thanks.

Comment 12 Evgeni Vakhonin 2022-12-26 13:14:33 UTC
verifying on 4.13.0-0.nightly-2022-12-23-174810 to 4.13.0-0.nightly-2022-12-23-223710 disconnected private

mirrored unsigned 4.13.0-0.nightly-2022-12-23-223710 into the registry

sent upgrade command without force (time: 12:16:34)
# oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23

monitored "oc adm upgrade" for changes. 4 minutes later (time: 12:20:36) received:
Cluster version is 4.13.0-0.nightly-2022-12-23-174810

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23" failure=The update cann
ot be verified: unable to verify sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23 against keyrings: verifier-public-key-redhat

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.12&id=ea18d1b1-9c41-412c-8ff2-6f8d87bb99bd&version=4.1
3.0-0.nightly-2022-12-23-174810": dial tcp 54.234.217.153:443: connect: connection timed out

in cvo log:

I1226 12:16:36.486777       1 cvo.go:596] Desired version from spec is v1.Update{Architecture:"", Version:"", Image:"registry.ci.openshift.org/ocp/release@sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23", Force:false}
I1226 12:16:36.486815       1 sync_worker.go:811] Detected while considering cluster version generation 3: version changed (from { 4.13.0-0.nightly-2022-12-23-174810 registry.ci.openshift.org/ocp/release@sha256:83765f4d554616fd80e1c833855f206c317731cbb498d6098c4897237e77086c false} to {  registry.ci.openshift.org/ocp/release@sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23 false})
I1226 12:16:36.486822       1 sync_worker.go:261] syncPayload: registry.ci.openshift.org/ocp/release@sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23 (force=false)
I1226 12:16:36.487123       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RetrievePayload' Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23"
I1226 12:20:36.487304       1 sync_worker.go:437] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23 against keyrings: verifier-public-key-redhat

in events
2m1s        Normal    RetrievePayload         clusterversion/version                          Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23"
2m1s        Warning   RetrievePayloadFailed   clusterversion/version                          Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23" failure=The update cannot be verified: unable to verify sha256:76259072b4df3c6abbfc66f84795b94cd7bd632198783e4a24f8295ffd928f23 against keyrings: verifier-public-key-redhat

as expected

Comment 15 Evgeni Vakhonin 2023-01-03 16:50:21 UTC
oh great! made a typo while trying to mention Jack, and now bugzilla not letting me go edit the comment!
anyway, the error was:

ReleaseAccepted=False
  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:48f9326c2defcd5a801725246eb3efdc6d3a1013e6cee8f379
aa0a757ebfaba9" failure=Unable to download and prepare the update: client rate limiter Wait returned an error: context deadline exceeded

Comment 17 Johnny Liu 2023-01-04 12:23:26 UTC
Per Evgeni's test result, seem like the fix PR introduce some critical regression issues, which will block disconnected cluster upgrade even with --force option, it is going to block many test cases, so raise its serverity.

Comment 19 Evgeni Vakhonin 2023-01-04 15:21:25 UTC
ok tested on the same configuration from 4.13.0-0.nightly-2022-12-22-120609 to 4.13.0-0.nightly-2022-12-23-174810 (while 22-120609 is before the pr) and it is upgrading successfully! so indeed there is regression caused by the fix, to the outgoing cvo as it looks like

Comment 20 Petr Muller 2023-02-03 14:22:48 UTC
@evakhoni I finally managed to polish https://github.com/openshift/cluster-version-operator/pull/896 enough. I added unit tests that should simulate both the original (this) bug and the regression from https://bugzilla.redhat.com/show_bug.cgi?id=2090680#c14. Still, it seems this part of CVO tends to cause surprises, it'd be really great if you can give that PR a shot, targeting both the original bug and the regression.

Comment 21 Evgeni Vakhonin 2023-02-05 16:39:41 UTC
@pmuller np! I'll try to pre-merge verify it, including regression testing.

Comment 22 Evgeni Vakhonin 2023-02-06 13:54:15 UTC
pre-merge verifying using 4.13 cluster-bot build form https://github.com/openshift/cluster-version-operator/pull/896

mirrored unsigned build and applied upgrade without force
date 
Mon 06 Feb 2023 12:45:48 IST
╰─ oc adm upgrade --to-image registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae --allow-explicit-upgrade 
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
Requested update to release image registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae

╰─ oc adm upgrade
Cluster version is 4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.12&id=080d4192-46bc-496e-876a-01c5ff6d4c4a&version=4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest": dial tcp 44.199.160.73:443: connect: connection timed out

------------------ 2 minutes later --------

Mon 06 Feb 2023 12:47:49 IST
Cluster version is 4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae" failure=The update cannot be verified: unable to verify sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae against keyrings: verifier-public-key-redhat

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.12&id=080d4192-46bc-496e-876a-01c5ff6d4c4a&version=4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest": dial tcp 44.199.160.73:443: connect: connection timed out

as expected.

in cvo log:
I0206 10:45:47.783591       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RetrievePayload' Retrieving and verifying payload version="" image="registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae"
I0206 10:47:47.783625       1 sync_worker.go:437] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae against keyrings: verifier-public-key-redhat

events:
oc get event -n openshift-cluster-version
.........
75s         Normal    RetrievePayload         clusterversion/version                           Retrieving and verifying payload version="" image="registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae"
75s         Warning   RetrievePayloadFailed   clusterversion/version                           Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae" failure=The update cannot be verified: unable to verify sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae against keyrings: verifier-public-key-redhat


regression testing:
applying same upgrade target, without ImageContentSourcePolicy to target repository path:
date
Mon 06 Feb 2023 12:54:32 IST
╰─ oc adm upgrade --to-image registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae --allow-explicit-upgrade --force
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae

╰─ oc adm upgrade
Cluster version is 4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae" failure=The update cannot be verified: unable to verify sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae against keyrings: verifier-public-key-redhat

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.12&id=080d4192-46bc-496e-876a-01c5ff6d4c4a&version=4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest": dial tcp 44.209.252.60:443: connect: connection timed out

-------------------- after 5+ minutes
Mon 06 Feb 2023 12:59:48 IST
Cluster version is 4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae" failure=Unable to download and prepare the update: timed out waiting for the condition

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.12&id=080d4192-46bc-496e-876a-01c5ff6d4c4a&version=4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest": dial tcp 44.209.252.60:443: connect: connection timed out

---------------------- after another 4 minutes
Mon 06 Feb 2023 13:03:48 IST
Failing=True:

  Reason: UpdatePayloadFailed
  Message: Could not update serviceaccount "openshift-apiserver-operator/openshift-apiserver-operator" (445 of 833)

Error while reconciling 4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest: the update could not be applied

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-dl17ckk/release@sha256:a348e4ecf7d7dc175246aefb80db9711fc4737438822b90676dcb549ad3920ae" failure=Unable to download and prepare the update: timed out waiting for the condition

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.12&id=080d4192-46bc-496e-876a-01c5ff6d4c4a&version=4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest": dial tcp 44.199.160.73:443: connect: connection timed out
------------------------
updated ImageContentSourcePolicy with proper repository path
Mon 06 Feb 2023 13:07:20 IST
╰─ oc edit ImageContentSourcePolicy image-policy-1
imagecontentsourcepolicy.operator.openshift.io/image-policy-1 edited
---------------------- <3 minutes later
Mon 06 Feb 2023 13:10:02 IST
info: An upgrade is in progress. Working towards 4.13.0-0.ci.test-2023-02-06-071146-ci-ln-dl17ckk-latest: 22 of 833 done (2% complete)

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.12&id=080d4192-46bc-496e-876a-01c5ff6d4c4a&version=4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest": context canceled

upgrade is progressing as expected.

another regression test:
reverted and upgraded straight with ImageContentSourcePolicy already in place
Mon 06 Feb 2023 15:05:47 IST
╰─ oc adm upgrade --to-image registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:f650a0e99457836880f2b6a4f451084b612aaf2bf05f30946906f09903389c0f --allow-explicit-upgrade --force
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:f650a0e99457836880f2b6a4f451084b612aaf2bf05f30946906f09903389c0f

╰─ oc adm upgrade
Cluster version is 4.13.0-0.ci.test-2023-02-06-071146-ci-ln-dl17ckk-latest

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.12&id=080d4192-46bc-496e-876a-01c5ff6d4c4a&version=4.13.0-0.ci.test-2023-02-06-071146-ci-ln-dl17ckk-latest": dial tcp 44.209.252.60:443: connect: connection timed out
--------------- 2 minutes later:
Mon 06 Feb 2023 15:07:48 IST
info: An upgrade is in progress. Working towards 4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest: 9 of 833 done (1% complete)

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.12
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.12&id=080d4192-46bc-496e-876a-01c5ff6d4c4a&version=4.13.0-0.ci.test-2023-02-06-071146-ci-ln-dl17ckk-latest": dial tcp 44.199.160.73:443: connect: connection timed out

upgrade is progressing as expected.


@pmuller the fix looks successful, previous regression not reproduced as expected. however 2 questions:
1) 
Failing=True: Reason: UpdatePayloadFailed Message: Could not update serviceaccount "openshift-apiserver-operator/openshift-apiserver-operator" (445 of 833)
is this an expected behavior while trying to upgrade to image while repository not in ImageContentSourcePolicy? I expected to see only ReleaseAccepted=False with "timed out waiting for the condition"
2)
a 5:16 minutes delay between the two `payload failed version` (keyrings to timeout) and +another 4:00 minutes for the failing, is this an expected delay in 4.13?

Comment 23 Petr Muller 2023-02-07 18:16:01 UTC
Thanks for looking into this! We may be onto something interesting here. This is the way I'm reading the timeline of the second scenario. It shows some unexpected behavior but likely unrelated to the PR, maybe we are seeing a new bug, or possibly a behavior that is "unlocked" by us proceeding further into the scenario after the fix.

-         Start: Cluster is at <ci-ln-dkrq3mk>
- ~12:54:32 IST: oc adm upgrade --to-image <ci-ln-dl17ckk> --allow-explicit-upgrade --force
- ~12:54:32 IST: oc adm upgrade shows ReleaseAccepted=False Reason: RetrievePayload Message: Retrieving payload failed image="<ci-ln-dl17ckk>" failure=The update cannot be verified...

--> This ReleaseAccepted=False is there as a remnant of the previous test scenario (I assume)

-   Start + ~5m 
- ~12:59:48 IST: Cluster is (still) at <ci-ln-dkrq3mk>
- ~12:59:48 IST: oc adm upgrade shows ReleaseAccepted=False Reason: RetrievePayload Message: Retrieving payload failed image="<ci-ln-dl17ckk>" failure=Unable to download and prepare the update...

--> So after this ~5m, we already went through the 1) try to verify and timeout 2) proceed because --force 3) try to download and timeout
--> IMO, the actual test scenario ends here. We attempted to upgrade, it failed to download, the cluster should go back to business as usual until the download is fixed or upgrade aborted etc. We expected nothing else to happen (right?).

--> But you continued, and things got potentially interesting:

-   Start + ~9m 
- ~13:03:48 IST: oc adm upgrade shows Failing=True Reason: UpdatePayloadFailed Message: Could not update serviceaccount "openshift-apiserver-operator/openshift-apiserver-operator" (445 of 833)

--> This is the interesting part: "Error while reconciling 4.13.0-0.ci.test-2023-02-06-062603-ci-ln-dkrq3mk-latest: the update could not be applied"
                                                                                              ^^^^^^^^^^^^^
--> This is the *current* cluster version, not the update target one. This is not the cluster somehow trying to actually update to the new version. I talked with Lala today and he suspects that the "fail to verify but force, fail to download" scenario can result in CVO running in a confused state / wrong mode where it attempts to forcibly "upgrade" into the current version, which would be a yet undiscovered regression from a past change in CVO.

---

So to answer your questions:

1) Is `Could not update serviceaccount "openshift-apiserver-operator/openshift-apiserver-operator` an expected behavior while trying to upgrade to image while repository not in ImageContentSourcePolicy?

No, but this happens *after* we try to upgrade to the undownloadable image, fail, and see the expected behavior (correct message in the ReleaseAccepted=False condition). It fails during reconciliation to the *current* payload, which is definitely suspicious.

2)a 5:16 minutes delay between the two `payload failed version` (keyrings to timeout) and +another 4:00 minutes for the failing, is this an expected delay in 4.13?

The ~5m delay is fine, we saw what we wanted after these 5m. The failure afterwards should not happen, so it is not part of the "expected delay".

And to propose next steps:

1) It would be useful to see the CVO log from the second scenario, to see 1) why was CVO reconciling to the current version manifest for that OAO ServiceAccount 2) what was the actual failure that CVO saw
2) It would be useful to test whether this behavior is deterministic (can you try to reproduce it?)
3) It would be useful to test this scenario with a past OCP version before a *certain* change: I dont know which one but Lala knows so I'm NEEDINFOing him.

Comment 24 Evgeni Vakhonin 2023-02-08 12:10:40 UTC
ok I have reproduced again in build from the same PR

installed ipi-on-aws disconnected private cluster

Server Version: 4.13.0-0.ci.test-2023-02-08-064507-ci-ln-vtrvh42-latest

patched cco (needed cause this one using manual mode cloud credentials)
╰─ oc patch cloudcredential.operator.openshift.io/cluster --patch '{"metadata":{"annotations": {"cloudcredential.openshift.io/upgradeable-to": "99.99"}}}' --type=merge                      
cloudcredential.operator.openshift.io/cluster patched

upgrade without force
╰─ oc adm upgrade --to-image registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca --allow-explicit-upgrade

Wed 08 Feb 2023 13:12:57 IST
ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca" failure=The update cannot be verified: unable to verify sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca against keyrings: verifier-public-key-redhat

with force
╰─ oc adm upgrade --to-image registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca --allow-explicit-upgrade --force
Wed 08 Feb 2023 13:18:49 IST
Cluster version is 4.13.0-0.ci.test-2023-02-08-064507-ci-ln-vtrvh42-latest

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca" failure=The update cannot be verified: unable to verify sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca against keyrings: verifier-public-key-redhat

Wed 08 Feb 2023 13:22:58 IST
Cluster version is 4.13.0-0.ci.test-2023-02-08-064507-ci-ln-vtrvh42-latest

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca" failure=Unable to download and prepare the update: timed out waiting for the condition

Wed 08 Feb 2023 13:26:57 IST
Failing=True:

  Reason: UpdatePayloadFailed
  Message: Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)

Error while reconciling 4.13.0-0.ci.test-2023-02-08-064507-ci-ln-vtrvh42-latest: the update could not be applied

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca" failure=Unable to download and prepare the update: timed out waiting for the condition

Comment 27 Petr Muller 2023-02-08 17:14:10 UTC
OK I think I know what is going on. It seems that the 2m/4m timeouts in `RetrievePayload` are starving out the actual manifest reconciliation operations *somehow*, which makes the whole loop iteration take too much time and eventually get canceled:

115514:I0208 11:10:57.056651       1 sync_worker.go:261] syncPayload: registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca (force=false)
115527:I0208 11:12:57.057460       1 sync_worker.go:437] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca against keyrings: verifier-public-key-redhat
115539:I0208 11:12:57.070591       1 sync_worker.go:261] syncPayload: registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca (force=false)
115558:I0208 11:14:57.071738       1 sync_worker.go:437] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca against keyrings: verifier-public-key-redhat
^^^ These are the two 2m-timeout syncPayload runs (this is before --force was issued around 11:18:49Z)

115563:I0208 11:14:57.072226       1 sync_worker.go:888] apply: 4.13.0-0.ci.test-2023-02-08-064507-ci-ln-vtrvh42-latest on generation 3 in state Reconciling at attempt 0
115564:I0208 11:14:57.074037       1 task_graph.go:477] Running 0 on worker 1
115565:I0208 11:14:57.074057       1 task_graph.go:477] Running 2 on worker 1
115566:I0208 11:14:57.074075       1 sync_worker.go:993] Running sync for namespace "openshift-cluster-storage-operator" (253 of 834)
115567:I0208 11:14:57.074130       1 task_graph.go:477] Running 1 on worker 0
115568:I0208 11:14:57.078122       1 sync_worker.go:1013] Done syncing for namespace "openshift-cluster-storage-operator" (253 of 834)
115569:I0208 11:14:57.078149       1 task_graph.go:477] Running 3 on worker 1
115571:I0208 11:14:57.082282       1 sync_worker.go:993] Running sync for namespace "openshift-apiserver-operator" (441 of 834)
115573:I0208 11:14:57.085137       1 sync_worker.go:993] Running sync for namespace "openshift-marketplace" (645 of 834)
115574:I0208 11:14:57.089063       1 sync_worker.go:1013] Done syncing for namespace "openshift-apiserver-operator" (441 of 834)
115575:I0208 11:14:57.089169       1 sync_worker.go:993] Running sync for openshiftapiserver "cluster" (442 of 834)
115576:I0208 11:14:57.089221       1 sync_worker.go:1013] Done syncing for namespace "openshift-marketplace" (645 of 834)
115577:I0208 11:14:57.089299       1 sync_worker.go:993] Running sync for serviceaccount "openshift-marketplace/marketplace-operator" (646 of 834)
^^^ CVO starts reconciling manifests, gets some stuff done

115586:I0208 11:14:57.092496       1 sync_worker.go:261] syncPayload: registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca (force=false)
115591:I0208 11:14:57.093052       1 sync_worker.go:1013] Done syncing for serviceaccount "openshift-marketplace/marketplace-operator" (646 of 834)
115606:I0208 11:16:57.093557       1 sync_worker.go:437] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca against keyrings: verifier-public-key-redhat
^^^ Another 2m timeout

115609:I0208 11:16:57.093889       1 sync_worker.go:993] Running sync for clusterrole "marketplace-operator" (647 of 834)
^^^ We haven't started any manifest sync while we waited for 2m inside `syncPayload`

115611:I0208 11:16:57.093962       1 sync_worker.go:1013] Done syncing for openshiftapiserver "cluster" (442 of 834)
115612:I0208 11:16:57.094051       1 sync_worker.go:993] Running sync for configmap "openshift-apiserver-operator/openshift-apiserver-operator-config" (443 of 834)
115624:I0208 11:16:57.105504       1 sync_worker.go:261] syncPayload: registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca (force=false)
115628:I0208 11:16:57.109993       1 sync_worker.go:1013] Done syncing for configmap "openshift-apiserver-operator/openshift-apiserver-operator-config" (443 of 834)
115641:I0208 11:18:57.105749       1 sync_worker.go:437] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca against keyrings: verifier-public-key-redhat
^^^ Another 2m timeout

115645:I0208 11:18:57.105911       1 sync_worker.go:993] Running sync for configmap "openshift-apiserver-operator/trusted-ca-bundle" (444 of 834)
115649:I0208 11:18:57.106115       1 sync_worker.go:1013] Done syncing for clusterrole "marketplace-operator" (647 of 834)
^^^ Again we did not start syncing while waiting in syncPayload. Also, for some reason the marketplace-operator ClusterRole took 2m to sync as well (maybe we log before we start to wait?)

115651:I0208 11:18:57.106448       1 sync_worker.go:261] syncPayload: registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca (force=true)
115720:I0208 11:22:57.107405       1 sync_worker.go:437] loadUpdatedPayload syncPayload err=Unable to download and prepare the update: timed out waiting for the condition
^^^ This ran after --force so it took 4m to finish instead of 2m

115724:I0208 11:22:57.107819       1 sync_worker.go:993] Running sync for role "openshift-marketplace/marketplace-operator" (648 of 834)
^^^ So this waited for 4m until it was allowed to start.

115725:I0208 11:22:57.107887       1 task_graph.go:474] Canceled worker 1 while waiting for work
115726:I0208 11:22:57.107900       1 sync_worker.go:1013] Done syncing for configmap "openshift-apiserver-operator/trusted-ca-bundle" (444 of 834)
115727:I0208 11:22:57.107911       1 task_graph.go:474] Canceled worker 0 while waiting for work
^^^ Something got fed up waiting

115728:I0208 11:22:57.107918       1 task_graph.go:523] Workers finished
115729:I0208 11:22:57.107925       1 task_graph.go:546] Result of work: [update context deadline exceeded at 8 of 834 Could not update role "openshift-marketplace/marketplace-operator" (648 of 834)]
^^^ This was canceled mid-flight?

115744:I0208 11:22:57.117233       1 sync_worker.go:261] syncPayload: registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca (force=true)
115814:I0208 11:26:57.117407       1 sync_worker.go:437] loadUpdatedPayload syncPayload err=Unable to download and prepare the update: timed out waiting for the condition
115829:I0208 11:26:57.128766       1 sync_worker.go:261] syncPayload: registry.build05.ci.openshift.org/ci-ln-dkrq3mk/release@sha256:0ce4ae5cb1eade573f95edae6c2cec05580f94609d02cd3bf85b450be0d362ca (force=true)

I have put an annotated log with further notes here: https://gist.github.com/petr-muller/6ca65f2a3be175937e0a558313b23957

I think this is not a new bug, just the timeouts were never set correct enough for the CVO not be in a more buggy state when it cannot verify and download the image.

Comment 28 Petr Muller 2023-02-08 17:22:30 UTC
The above is Evgeni's CVO log grepped with (with stripped lines before 11:10:00 that are irrelevant):

$ grep -e 'apply: ' -e 'Running sync for' -e 'Done syncing for' -e syncPayload -e 'task_graph.go'  cluster-version-operator-688849d8d9-8hwnp.log

Comment 29 Petr Muller 2023-02-17 16:43:36 UTC
I have investigated the concerns from the above and filed https://issues.redhat.com/browse/OCPBUGS-7714 for the problem. That behavior is just "uncovered" by this bugfix, not caused by it. The fix actually improves the state for the clusters that hit it - I believe that previously the manifest reconciliation was completely broken while the unbounded retrieval was timing out for hours, we just did not notice.

Comment 32 Evgeni Vakhonin 2023-02-21 17:30:40 UTC
verifying again since there's a commit merged after the last pre-merge verification.
4.13.0-0.nightly-2023-02-20-182956 to 4.13.0-0.nightly-2023-02-21-014524 disconnected private on aws

❯ date && oc adm upgrade
Tue 21 Feb 2023 19:02:41 IST
Cluster version is 4.13.0-0.nightly-2023-02-20-182956

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.13
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.13&id=d1588271-2bff-4b23-835c-b23446f3845e&version=4.13.0-0.nightly-2023-02-20-182956": dial tcp 34.227.168.110:443: connect: connection timed out

mirrored and sent upgrade command without force:
❯ date && oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031
Tue 21 Feb 2023 19:05:34 IST
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
Requested update to release image registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031

-----------2 minutes later--------------
Tue 21 Feb 2023 19:07:36 IST
Cluster version is 4.13.0-0.nightly-2023-02-20-182956

ReleaseAccepted=False

  Reason: RetrievePayload
  Message: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031" failure=The update cannot be verified: unable to verify sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031 against keyrings: verifier-public-key-redhat

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.13
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.13&id=d1588271-2bff-4b23-835c-b23446f3845e&version=4.13.0-0.nightly-2023-02-20-182956": dial tcp 34.227.168.110:443: connect: connection timed out

as expected.

in events:
90s         Normal    RetrievePayload         clusterversion/version   Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031"
90s         Warning   RetrievePayloadFailed   clusterversion/version   Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031" failure=The update cannot be verified: unable to verify sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031 against keyrings: verifier-public-key-redhat

cvo log:
I0221 17:05:34.855322       1 cvo.go:598] Desired version from spec is v1.Update{Architecture:"", Version:"", Image:"registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031", Force:false}
I0221 17:05:34.855373       1 sync_worker.go:811] Detected while considering cluster version generation 3: version changed (from { 4.13.0-0.nightly-2023-02-20-182956 registry.ci.openshift.org/ocp/release@sha256:9593a6e1b9455e9e388b0ae3d17a90abba4c0eba2c4e2db4184dadda412624de false} to {  registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031 false})
I0221 17:05:34.855382       1 sync_worker.go:261] syncPayload: registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031 (force=false)
I0221 17:05:34.855474       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'RetrievePayload' Retrieving and verifying payload version="" image="registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031"
I0221 17:07:34.855614       1 sync_worker.go:437] loadUpdatedPayload syncPayload err=The update cannot be verified: unable to verify sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031 against keyrings: verifier-public-key-redhat

as expected.

regression testing: retrying with --force
❯ date && oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031 --force 
Tue 21 Feb 2023 19:08:50 IST
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Requested update to release image registry.ci.openshift.org/ocp/release@sha256:ddb6974a1afe869e79bbaf25fe125e23adc38b552e44a105983fb41fdfc09031

-----------3 minutes later------------------
Tue 21 Feb 2023 19:11:48 IST
info: An upgrade is in progress. Working towards 4.13.0-0.nightly-2023-02-21-014524: 20 of 838 done (2% complete)

Upstream is unset, so the cluster will use an appropriate default.
Channel: stable-4.13
warning: Cannot display available updates:
  Reason: RemoteFailed
  Message: Unable to retrieve available updates: Get "https://api.openshift.com/api/upgrades_info/v1/graph?arch=amd64&channel=stable-4.13&id=d1588271-2bff-4b23-835c-b23446f3845e&version=4.13.0-0.nightly-2023-02-20-182956": context canceled


upgrade is progressing as expected.as expected

Comment 35 errata-xmlrpc 2023-05-17 22:46:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.13.0 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:1326


Note You need to log in before you can comment on or make changes to this bug.