2070805 – ClusterVersion: could not download the update

Bug 2070805 - ClusterVersion: could not download the update

Summary: ClusterVersion: could not download the update

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.11.0
Assignee:	W. Trevor King
QA Contact:	Evgeni Vakhonin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	2080058
TreeView+	depends on / blocked

Reported:	2022-04-01 03:05 UTC by jroche
Modified:	2022-08-10 11:03 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	2080058 (view as bug list)
Environment:
Last Closed:	2022-08-10 11:03:06 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 760	None	Merged	Bug 2070805: pkg/cvo/updatepayload: Prune previous payload downloads	2022-04-04 22:31:57 UTC
Github	openshift cluster-version-operator pull 765	None	Merged	Bug 2070805: pkg/cvo/updatepayload: Shift previous-download removal into the job	2022-04-27 06:01:28 UTC
Github	openshift cluster-version-operator pull 767	None	Merged	Bug 2070805: pkg/cvo/updatepayload: Restore shell for rm globbing	2022-06-28 17:55:40 UTC
Red Hat Knowledge Base (Solution)	6965075	None	None	None	2022-07-26 10:10:25 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 11:03:20 UTC

Description jroche 2022-04-01 03:05:20 UTC

Description of problem:

ClusterID: cc782851-976b-494c-90ea-d5125936e134
ClusterVersion: Updating to "4.10.5" from "4.10.4" for 2 hours: Unable to apply 4.10.5: could not download the update
ClusterOperators:
	All healthy and stable

Cluster trying to upgrade from 4.10.5 from 4.10.4 is stuck with with the above error reported on the clusterversion.

A pod in the openshift-cluster-version namespace keeps being created and error-ing.
We managed to grab a log which had:


oc logs -n openshift-cluster-version version-4.10.5-9jv69-4kxs7
mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/manifests/manifests'; unable to remove target: Directory not empty


I will attach a must-gather and adm inspect of the openshift-cluster-version (although the adm inspect seemed to error grabbing the version-4.10.5 pod details) in a private comment.


This is a gcp cluster.

Comment 4 W. Trevor King 2022-04-07 13:55:54 UTC

Guess at a test plan:

1. Install a nightly with the patch.
2. exec into the CVO container. Also figure out which node it's running on for later.
3. Create some noise that looks like leaks from previous releases, including comment 0's 4.10.5 md5:
$ for X in $(seq 9) HbO7IDc7tyIg9utw3sd_tg; do for DIR in manifests cvo-manifests; do FDIR="/etc/cvo/updatepayloads/${X}/${DIR}" && mkdir -p "${FDIR}" && touch "${FDIR}/test.yaml"; done; done
4. Trigger an update to 4.10.5:
$ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:ee6a9c7a11f883e90489229f6c6dc78b434af12f5646f4f9411d73a98969f02a
5. The cluster-version operator will launch a version Job, with a version-... Pod.

Before the fix, the version-... Pod's logs would include "unable to remove target: Directory not empty", as seen in comment 0, and ClusterVersion conditions would include "could not download the update".

With the fix, the version-... pod should succeed, and the CVO will begin the update (updating from a 4.11 nightly to 4.10.5 will probably blow up, but all we care about here is "did the CVO begin moving towards the target release?" not "do we successfully complete the update to the target release?").

Once the CVO begins updating:

$ oc debug "node/${NODE_FROM_STEP_2}" -- ls /host/etc/cvo/updatepayloads

Before the fix, all the 1, 2, 3, ... subdirectories we'd created in step 3 would still be there. With the fix, the CVO will have removed those, and only HbO7IDc7tyIg9utw3sd_tg should remain.

Comment 5 Evgeni Vakhonin 2022-04-07 19:31:15 UTC

looking for a way to reproduce...
so far, creating the files in /etc/cvo/updatepayloads/ did not reproduced the version pod error as in op. upgraded 4.10.4 to 4.10.5 without failure, and populated HbO7IDc7tyIg9utw3sd_tg with manifests, and the 1, 2, 3,... directories remained, no "mv: inter-device move failed" in cvo log. :(

Comment 6 liujia 2022-04-08 02:15:23 UTC

Just had a try with following steps, reproduced.

1. Triggered upgrade from v4.10.4 to v4.10.5.

2. After upgrade started and then abort the upgrade to back to 4.10.4 with --force and --allow-upgrade-with-warnings.
# ./oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:9f9c3aaca64f62af992bae5de1e984571c8b812f598b74c84dc630b064389fb7 --force --allow-upgrade-with-warnings
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading:

  Reason: 
  Message: Working towards 4.10.5: 83 of 758 done (10% complete)

Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:9f9c3aaca64f62af992bae5de1e984571c8b812f598b74c84dc630b064389fb7

3. Check the cluster returned back to v4.10.4, login to the node to check 4.10.5's manifests already downloaded.
# ls -la /etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/
total 56
drwxr-xr-x. 5 root root    69 Apr  8 01:02 .
drwxr-xr-x. 3 root root    36 Apr  8 01:13 ..
drwxr-xr-x. 2 root root    23 Apr  8 00:56 cvo-manifests
drwxr-xr-x. 3 root root    40 Apr  8 01:02 manifests
drwxrwxrwx. 2 root root 40960 Mar 14 08:04 release-manifests

4. remove cvo-manifests and release-manifests to make this dir broken.
# ls -la /etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/
total 0
drwxr-xr-x. 3 root root 23 Apr  8 01:59 .
drwxr-xr-x. 3 root root 36 Apr  8 01:13 ..
drwxr-xr-x. 3 root root 40 Apr  8 01:02 manifests

5. Do upgrade from 4.10.4 to 4.10.5 again, and reproduced.
# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.4    True        True          10m     Unable to apply 4.10.5: could not download the update
# ./oc logs version-4.10.5-hx5fd-lj5fh
mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/manifests/manifests'; unable to remove target: Directory not empty

Comment 7 Evgeni Vakhonin 2022-04-10 10:01:45 UTC

reproduced with correct file placement, has to be /manifests/manifests as in #c6 by Jia


oc get pods -n openshift-cluster-version -owide
NAME                                        READY   STATUS    RESTARTS   AGE   IP         NODE                                                     NOMINATED NODE   READINESS GATES
cluster-version-operator-85dd988454-sjt62   1/1     Running   0          53m   10.0.0.5   evakhoni-100930-8n4jp-master-0.c.openshift-qe.internal   <none>           <none>


oc debug node/evakhoni-100930-8n4jp-master-0.c.openshift-qe.internal -- /bin/bash -c 'for X in $(seq 9) HbO7IDc7tyIg9utw3sd_tg; do FDIR="/host/etc/cvo/updatepayloads/${X}/manifests/manifests" && mkdir -p "${FDIR}" && touch "${FDIR}/test.yaml"; done'
Starting pod/evakhoni-100930-8n4jp-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...


oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:ee6a9c7a11f883e90489229f6c6dc78b434af12f5646f4f9411d73a98969f02a
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:ee6a9c7a11f883e90489229f6c6dc78b434af12f5646f4f9411d73a98969f02a


oc adm upgrade                                                                                                  
info: An upgrade is in progress. Unable to apply quay.io/openshift-release-dev/ocp-release@sha256:ee6a9c7a11f883e90489229f6c6dc78b434af12f5646f4f9411d73a98969f02a: could not download the update


oc logs -n openshift-cluster-version version--hdsf9-k29gm 
mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/manifests/manifests'; unable to remove target: Directory not empty

Comment 8 Evgeni Vakhonin 2022-04-10 10:42:12 UTC

however verifying from 4.11 to 4.10.5 as suggested by wking here https://bugzilla.redhat.com/show_bug.cgi?id=2070805#c4 is impossible, cause version pod being pulled is of the target version (4.10.5 in this case) so the bug reproduced even from fixed build:

Server Version: 4.11.0-0.nightly-2022-04-06-213816

oc get pods -n openshift-cluster-version -owide
NAME                                        READY   STATUS    RESTARTS   AGE   IP         NODE                                                     NOMINATED NODE   READINESS GATES
cluster-version-operator-6dfd5f57d8-scn65   1/1     Running   0          72m   10.0.0.4   evakhoni-100932-w9snb-master-0.c.openshift-qe.internal   <none>           <none>

oc debug node/evakhoni-100932-w9snb-master-0.c.openshift-qe.internal -- /bin/bash -c 'for X in $(seq 9) HbO7IDc7tyIg9utw3sd_tg; do FDIR="/host/etc/cvo/updatepayloads/${X}/manifests/manifests" && mkdir -p "${FDIR}" && touch "${FDIR}/test.yaml"; done'
Starting pod/evakhoni-100932-w9snb-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...


oc get pods
NAME                                        READY   STATUS    RESTARTS   AGE
cluster-version-operator-6dfd5f57d8-scn65   1/1     Running   0          3h6m
version--mhsd6-qfdqn                        0/1     Error     3          48s


oc logs version--mhsd6-qfdqn 
mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/manifests/manifests'; unable to remove target: Directory not empty

Comment 9 Evgeni Vakhonin 2022-04-10 16:46:23 UTC

also looks like we cannot guarantee that the 'version' pod is gonna be scheduled on the same node as cvo, as demonstrated here:
oc get pods -owide
NAME                                       READY   STATUS      RESTARTS   AGE   IP            NODE                                                     NOMINATED NODE   READINESS GATES
cluster-version-operator-db946689b-qhvhg   1/1     Running     0          51s   10.0.0.3      evakhoni-100930-8n4jp-master-2.c.openshift-qe.internal   <none>           <none>
version--ldtzk-s5pw2                       0/1     Completed   0          68s   10.129.0.90   evakhoni-100930-8n4jp-master-0.c.openshift-qe.internal   <none>           <none>



so using a hybrid approach to verify:

1) obtained release hash by triggering upgrade to another fixed version, higher than tested, on another cluster.
then observed downloaded manifests

for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}')
do oc debug node/$node -- /bin/bash -c 'ls /host/etc/cvo/updatepayloads/' 
done

Starting pod/evakhoni-101622-q6qqc-master-2copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
brRTeZnZSIkZ2J2YMdQozg
(in this case, the hash is for version 4.11.0-0.nightly-2022-04-07-053433)



2) tested the target cluster from 4.11.0-0.nightly-2022-04-06-213816 (after fix) to 4.11.0-0.nightly-2022-04-07-053433 (after fix)
generated garbage with target hash manifests/manifests + dummy directories as suggested on #c4 with correction for manifests/manifests as used by me in https://bugzilla.redhat.com/show_bug.cgi?id=2070805#c7
but this time on all master nodes

for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}'); do oc debug node/$node -- /bin/bash -c 'for X in $(seq 9) brRTeZnZSIkZ2J2YMdQozg; do FDIR="/host/etc/cvo/updatepayloads/${X}/manifests/manifests" && mkdir -p "${FDIR}" && touch "${FDIR}/test.yaml"; done'; done
Starting pod/evakhoni-101813-zh5cf-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
Starting pod/evakhoni-101813-zh5cf-master-1copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
Starting pod/evakhoni-101813-zh5cf-master-2copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...



3) triggered the upgrade and observing 'version' pod log...
oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:8f4ac72e32e701f6609fe00c4f021670cab8466eb38b98c141764ceb6c3d8ab5
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release@sha256:8f4ac72e32e701f6609fe00c4f021670cab8466eb38b98c141764ceb6c3d8ab5

oc get pods -owide -n openshift-cluster-version 
NAME                                        READY   STATUS    RESTARTS      AGE   IP            NODE                                                     NOMINATED NODE   READINESS GATES
cluster-version-operator-6dfd5f57d8-j9rsn   1/1     Running   0             36m   10.0.0.4      evakhoni-101813-zh5cf-master-2.c.openshift-qe.internal   <none>           <none>
version--pspnp-72crs                        0/1     Error     2 (16s ago)   22s   10.128.0.49   evakhoni-101813-zh5cf-master-2.c.openshift-qe.internal   <none>           <none>

oc logs version--pspnp-72crs -n openshift-cluster-version 
mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg/manifests/manifests'; unable to remove target: Directory not empty


this time however, with no 'could not download the update' error
oc get clusterversions.config.openshift.io         
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-06-213816   True        False         20m     Cluster version is 4.11.0-0.nightly-2022-04-06-213816

however digging deeper status revealed:
oc get clusterversions.config.openshift.io version -ojson|jq '.status'                                     
{
  "availableUpdates": null,
  "capabilities": {
    "enabledCapabilities": [
      "baremetal",
      "marketplace",
      "openshift-samples"
    ],
    "knownCapabilities": [
      "baremetal",
      "marketplace",
      "openshift-samples"
    ]
  },
  "conditions": [
    {
      "lastTransitionTime": "2022-04-10T15:19:46Z",
      "message": "Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-04-06-213816 not found in the \"stable-4.11\" channel",
      "reason": "VersionNotFound",
      "status": "False",
      "type": "RetrievedUpdates"
    },
    {
      "lastTransitionTime": "2022-04-10T15:19:46Z",
      "message": "Capabilities match configured spec",
      "reason": "AsExpected",
      "status": "False",
      "type": "ImplicitlyEnabledCapabilities"
    },
    {
      "lastTransitionTime": "2022-04-10T15:57:47Z",
      "message": "Retrieving payload failed version=\"\" image=\"registry.ci.openshift.org/ocp/release@sha256:8f4ac72e32e701f6609fe00c4f021670cab8466eb38b98c141764ceb6c3d8ab5\" failure=Unable to download and prepare the update: deadline exceeded, reason: \"DeadlineExceeded\", message: \"Job was active longer than specified deadline\"",
      "reason": "RetrievePayload",
      "status": "False",
      "type": "ReleaseAccepted"
    },
    {
      "lastTransitionTime": "2022-04-10T15:37:47Z",
      "message": "Done applying 4.11.0-0.nightly-2022-04-06-213816",
      "status": "True",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2022-04-10T15:37:47Z",
      "status": "False",
      "type": "Failing"
    },
    {
      "lastTransitionTime": "2022-04-10T15:59:02Z",
      "message": "Cluster version is 4.11.0-0.nightly-2022-04-06-213816",
      "status": "False",
      "type": "Progressing"
    }
  ],
  "desired": {
    "image": "registry.ci.openshift.org/ocp/release@sha256:40bd2cfbbd80cc192acc3d9fe047790cf4592beb577d144961cdf465392a5133",
    "version": "4.11.0-0.nightly-2022-04-06-213816"
  },
  "history": [
    {
      "completionTime": "2022-04-10T15:37:47Z",
      "image": "registry.ci.openshift.org/ocp/release@sha256:40bd2cfbbd80cc192acc3d9fe047790cf4592beb577d144961cdf465392a5133",
      "startedTime": "2022-04-10T15:19:46Z",
      "state": "Completed",
      "verified": false,
      "version": "4.11.0-0.nightly-2022-04-06-213816"
    }
  ],
  "observedGeneration": 3,
  "versionHash": "PWepNsbeUMA="
}



@wking what do you think?

Comment 10 liujia 2022-04-11 08:25:02 UTC

With the reproduce in https://bugzilla.redhat.com/show_bug.cgi?id=2070805#c6. I checked the issue should be fixed now.

Before upgrade from 4.11.0-0.nightly-2022-04-06-213816 to 4.11.0-0.nightly-2022-04-07-053433, there are both broken dir on node ip-10-0-164-22us-east-2computeinternal and ip-10-0-196-183us-east-2computeinternal.

# for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}');do oc debug node/$node -- /bin/bash -c 'ls /host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg -la';done 
Starting pod/ip-10-0-155-70us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
ls: cannot access '/host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg': No such file or directory

Removing debug pod ...
Starting pod/ip-10-0-164-22us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
total 4
drwxr-xr-x. 3 root root   23 Apr 11 08:02 .
drwxr-xr-x. 4 root root   66 Apr 11 07:59 ..
drwxr-xr-x. 2 root root 4096 Apr  7 03:55 manifests

Removing debug pod ...
Starting pod/ip-10-0-196-183us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
total 4
drwxr-xr-x. 3 root root   23 Apr 11 07:52 .
drwxr-xr-x. 3 root root   36 Apr 11 07:49 ..
drwxr-xr-x. 2 root root 4096 Apr  7 03:55 manifests

Removing debug pod ...

After trigger the upgrade from 4.11.0-0.nightly-2022-04-06-213816 to 4.11.0-0.nightly-2022-04-07-053433.
1) Checked the latest version pod was scheduled on node ip-10-0-164-22.us-east-2.compute.internal.
# ./oc get po -owide
NAME                                        READY   STATUS      RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
cluster-version-operator-6dfd5f57d8-6rk4c   1/1     Running     0          3m24s   10.0.164.22   ip-10-0-164-22.us-east-2.compute.internal    <none>           <none>
version--7fjrv-mbp4t                        0/1     Completed   0          12m     10.129.0.62   ip-10-0-164-22.us-east-2.compute.internal    <none>           <none>
version--ksmdr-26xjw                        0/1     Completed   0          4m34s   10.129.0.68   ip-10-0-164-22.us-east-2.compute.internal    <none>           <none>
version--sh8xz-5ns5h                        0/1     Completed   0          4s      10.129.0.77   ip-10-0-164-22.us-east-2.compute.internal    <none>           <none>
version--zkrnl-6k2ks                        0/1     Completed   0          14m     10.128.0.74   ip-10-0-196-183.us-east-2.compute.internal   <none>           <none>

2) Checked the payload was re-fetched on the scheduled node successfully.
# for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}');do oc debug node/$node -- /bin/bash -c 'ls /host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg -la';done 
Starting pod/ip-10-0-155-70us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
ls: cannot access '/host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg': No such file or directory

Removing debug pod ...
Starting pod/ip-10-0-164-22us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
total 64
drwxr-xr-x. 4 root root    48 Apr 11 08:03 .
drwxr-xr-x. 4 root root    66 Apr 11 07:59 ..
drwxr-xr-x. 3 root root  4096 Apr 11 08:03 manifests
drwxrwxrwx. 2 root root 45056 Apr  7 03:10 release-manifests

Removing debug pod ...
Starting pod/ip-10-0-196-183us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
total 4
drwxr-xr-x. 3 root root   23 Apr 11 07:52 .
drwxr-xr-x. 3 root root   36 Apr 11 07:49 ..
drwxr-xr-x. 2 root root 4096 Apr  7 03:55 manifests

Removing debug pod ...

3) Checked upgrade is in progress
# ./oc get clusterversion -ojson|jq .items[].status.conditions[]
...

{
  "lastTransitionTime": "2022-04-11T06:34:01Z",
  "message": "Payload loaded version=\"4.11.0-0.nightly-2022-04-07-053433\" image=\"registry.ci.openshift.org/ocp/release@sha256:8f4ac72e32e701f6609fe00c4f021670cab8466eb38b98c141764ceb6c3d8ab5\"",
  "reason": "PayloadLoaded",
  "status": "True",
  "type": "ReleaseAccepted"
}
...
{
  "lastTransitionTime": "2022-04-11T08:03:37Z",
  "message": "Working towards 4.11.0-0.nightly-2022-04-07-053433: 615 of 786 done (78% complete)",
  "status": "True",
  "type": "Progressing"
}

Comment 11 Evgeni Vakhonin 2022-04-11 21:05:35 UTC

well, reproduced yet another time from 4.11.0-0.nightly-2022-04-06-213816 to 4.11.0-0.nightly-2022-04-07-053433

started an upgrade, and immediately reverted, few times, while deleting 'release-manifests'

upgrade:
oc adm upgrade  --allow-explicit-upgrade --force --allow-upgrade-with-warnings --to-image registry.ci.openshift.org/ocp/release@sha256:8f4ac72e32e701f6609fe00c4f021670cab8466eb38b98c141764ceb6c3d8ab5 #07

after few seconds:
#oc adm upgrade  --allow-explicit-upgrade --force --allow-upgrade-with-warnings --to-image registry.ci.openshift.org/ocp/release@sha256:40bd2cfbbd80cc192acc3d9fe047790cf4592beb577d144961cdf465392a5133 #06

then after fully reverted, deleted 'release-manifests':
#oc get clusterversions.config.openshift.io 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-04-06-213816   True        False         5m49s   Cluster version is 4.11.0-0.nightly-2022-04-06-213816

#for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}'); do echo $node; oc debug node/$node -- /bin/bash -c 'rm -rf /host/etc/cvo/updatepayloads/*/release-manifests'; done 2>/dev/null

evakhoni-112100-kwkp4-master-0.c.openshift-qe.internal
evakhoni-112100-kwkp4-master-1.c.openshift-qe.internal
evakhoni-112100-kwkp4-master-2.c.openshift-qe.internal

then again few times until nodes filled with manifests/manifests/
#for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}'); do echo $node; oc debug node/$node -- /bin/bash -c 'ls -d /host/etc/cvo/updatepayloads/*/manifests/manifests'; done 2>/dev/null

evakhoni-112100-kwkp4-master-0.c.openshift-qe.internal
ls: cannot access '/host/etc/cvo/updatepayloads/*/manifests/manifests': No such file or directory
evakhoni-112100-kwkp4-master-1.c.openshift-qe.internal
/host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg/manifests/manifests
evakhoni-112100-kwkp4-master-2.c.openshift-qe.internal
/host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg/manifests/manifests

then upgraded one more time to 4.11.0-0.nightly-2022-04-07-053433, and received CrashLoopBackOff
#oc get pods -w
NAME                                        READY   STATUS      RESTARTS   AGE
cluster-version-operator-6dfd5f57d8-cwq9b   1/1     Running     0          15m
version--9vtkl-jknfv                        0/1     Completed   0          39m
version--b8g6t-kc9xh                        0/1     Completed   0          48m
version--bcqp9-j82mx                        0/1     Completed   0          17m
version--bzklh-rbl7z                        0/1     Completed   0          15m
version--dk8v5-kjx95                        0/1     Completed   0          15m
version--lswr4-8xb8f                        0/1     Completed   0          69m
version--s4gfk-f4qvw                        0/1     Completed   0          69m
version--vrkf8-58wdz                        0/1     Completed   0          73m
version--whc68-ltk58                        0/1     Completed   0          73m
version--jpt5r-9h6wg                        0/1     Pending     0          0s
version--jpt5r-9h6wg                        0/1     ContainerCreating   0          0s
version--jpt5r-9h6wg                        0/1     ContainerCreating   0          2s
version--jpt5r-9h6wg                        0/1     Error               0          2s
version--jpt5r-9h6wg                        0/1     Error               1 (1s ago)   3s
version--jpt5r-9h6wg                        0/1     CrashLoopBackOff    1 (1s ago)   4s
version--jpt5r-9h6wg                        0/1     Error               2 (18s ago)   21s

#oc logs version--jpt5r-9h6wg
mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg/manifests/manifests'; unable to remove target: Directory not empty

quickly collected a must-gather, i was able to catch the pod

#omg get pods
NAME                                       READY  STATUS     RESTARTS  AGE
cluster-version-operator-6dfd5f57d8-cwq9b  1/1    Running    0         17m
version--9vtkl-jknfv                       0/1    Succeeded  0         41m
version--b8g6t-kc9xh                       0/1    Succeeded  0         51m
version--bcqp9-j82mx                       0/1    Succeeded  0         20m
version--bzklh-rbl7z                       0/1    Succeeded  0         17m
version--dk8v5-kjx95                       0/1    Succeeded  0         18m
version--jpt5r-9h6wg                       0/1    Running    1         17s
version--lswr4-8xb8f                       0/1    Succeeded  0         1h12m
version--s4gfk-f4qvw                       0/1    Succeeded  0         1h11m
version--vrkf8-58wdz                       0/1    Succeeded  0         1h15m
version--whc68-ltk58                       0/1    Succeeded  0         1h15m

#omg logs version--jpt5r-9h6wg
/home/evakhoni/93193/must-gather.local.3458720062476842962/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-e47b07aaeabdf633f489780b29c6b92d2447fa33b9627b3f5dfc7301478fb025/namespaces/openshift-cluster-version/pods/version--jpt5r-9h6wg/payload/payload/logs/current.log
2022-04-11T20:42:35.078131997Z mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg/manifests/manifests'; unable to remove target: Directory not empty

Comment 13 W. Trevor King 2022-04-18 20:54:20 UTC

Evgeni found the:

  2022-04-11T20:42:32.056428784Z W0411 20:42:32.056387       1 updatepayload.go:149] failed to prune update payload directory: unlinkat /etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg: read-only file system

issue, and I've filed [1] to avoid that.

[1]: https://github.com/openshift/cluster-version-operator/pull/765

Comment 16 Evgeni Vakhonin 2022-04-26 14:26:13 UTC

hmm.. verifying on 4.11.0-0.nightly-2022-04-26-030643 to 4.11.0-0.nightly-2022-04-26-085341

tried the last method of starting and reverting, with deleting release-manifests..
did one round of:
oc adm upgrade --allow-explicit-upgrade --force --to-image=registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 #new
oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --force --to-image=registry.ci.openshift.org/ocp/release@sha256:eb1de01c387ad7fa9d82ae7249fc3ede1706043ccd8e1d06bcfb67e5a2741b57 #old
waited for revert to complete
deleted release-manifests, and triggered upgrade again to 26-085341, (without allow-upgrade-with-warnings as before) and received:

  - lastTransitionTime: "2022-04-26T13:04:21Z"
    message: 'Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57"
      failure=Unable to download and prepare the update: stat /etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg/release-manifests:
      no such file or directory'
    reason: RetrievePayload
    status: "False"
    type: ReleaseAccepted

cleared, and tried again with allow-upgrade-with-warnings

version pod is Completed
version--f89rb-6z4xl                        0/1     Completed   0          4m14s   10.130.0.156   evakhoni-1204-5bb7m-master-1.c.openshift-qe.internal   <none>           <no

still the same error:

oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --force --to-image=registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 #new 
warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway
warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures.
Updating to release image registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57

spec:
{
  "channel": "stable-4.11",
  "clusterID": "08f5e725-e263-4b4c-9fb5-ced2feaf70a0",
  "desiredUpdate": {
    "force": true,
    "image": "registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57",
    "version": ""
  }
}


2022-04-26T09:10:15Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-04-26-030643 not found in the "stable-4.11" channel
2022-04-26T09:10:15Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec
2022-04-26T13:04:21Z ReleaseAccepted=False RetrievePayload: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57" failure=Unable to download and prepare the update: stat /etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg/release-manifests: no such file or directory
2022-04-26T09:35:34Z Available=True : Done applying 4.11.0-0.nightly-2022-04-26-030643
2022-04-26T09:34:04Z Failing=False : 
2022-04-26T13:08:14Z Progressing=False : Cluster version is 4.11.0-0.nightly-2022-04-26-030643

in CVO log:
W0426 13:30:58.464679       1 updatepayload.go:116] An image was retrieved from "registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57" that failed verification: The update cannot be verified: unable to locate a valid signature for one or more sources
I0426 13:31:07.520829       1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57" failure=Unable to download and prepare the update: stat /etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg/release-manifests: no such file or directory

nothing in version pods log

so while not exactly our bug from before, but looks like a regression of CVO manifest validation mechanism 

@wking WDYT?

Comment 17 W. Trevor King 2022-04-27 04:26:05 UTC

Poking around in a must-gather from the comment 16 cluster:

$ for X in namespaces/openshift-cluster-version/pods/version--*/*.yaml; do yaml2json < "${X}" | jq -r '(.metadata | (.creationTimestamp + " " + .name)) + " " + .status.containerStatuses[].image + " " + .status.phase'; done | sort
2022-04-26T12:48:48Z version--rxf54-n9bjd registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 Succeeded
2022-04-26T12:49:27Z version--22zzn-m57z2 registry.ci.openshift.org/ocp/release@sha256:eb1de01c387ad7fa9d82ae7249fc3ede1706043ccd8e1d06bcfb67e5a2741b57 Succeeded
2022-04-26T13:02:50Z version--s4p2x-x6tkc registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 Succeeded
2022-04-26T13:04:12Z version--2cq6b-49bq4 registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 Succeeded
2022-04-26T13:30:58Z version--f89rb-6z4xl registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 Succeeded

so all of the jobs were happy.  Commands on that 13:30 pod:

$ yaml2json <namespaces/openshift-cluster-version/pods/version--f89rb-6z4xl/*.yaml | jq -c '.spec | [.initContainers, .containers][][] | .command'
["rm","-fR","/etc/cvo/updatepayloads/*"]
["mkdir","/etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg-c2dmr"]
["mv","/manifests","/etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg-c2dmr/manifests"]
["mv","/release-manifests","/etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg-c2dmr/release-manifests"]
["mv","/etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg-c2dmr","/etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg"]

Ah, shell globbing in the 'rm' probably not going to work now that I've dropped the shell...

And then in the CVO logs:

$ grep '13:3.* loadUpdatedPayload' namespaces/openshift-cluster-version/pods/cluster-version-operator-655f6955b4-mcqd5/cluster-version-operator/cluster-version-operator/logs/current.log
2022-04-26T13:31:07.520741308Z I0426 13:31:07.520674       1 sync_worker.go:376] loadUpdatedPayload syncPayload err=Unable to download and prepare the update: stat /etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg/release-manifests: no such file or directory

So let me file a v3 pull to fix the 'rm' bit...

Comment 19 Evgeni Vakhonin 2022-05-01 08:04:30 UTC

pre-merge verified before nightly was available..
both from unpatched 4.11.0-0.nightly-2022-04-26-181148 to patched 4.11.0-0.ci.test-2022-04-28-120349-ci-ln-hqnmgjt-latest
and from patched 4.11.0-0.ci.test-2022-04-28-120349-ci-ln-hqnmgjt-latest to unpatched 4.11.0-0.nightly-2022-04-26-181148
and also from 4.10.12 to patched 4.11.0-0.ci.test-2022-04-28-120349-ci-ln-hqnmgjt-latest to pick release before 2 first PRs to upgrade from..

using the same method as in https://bugzilla.redhat.com/show_bug.cgi?id=2070805#c16

1) started upgrade
2) reverted back
3) invalidated the current payload by deleting release-manifests
4) checked status, pods, and log
5) repeated

done 6 cycles, no version pod crash detected
oc get pods                         
NAME                                        READY   STATUS      RESTARTS   AGE
cluster-version-operator-5b469f5d6d-s7qjn   1/1     Running     0          118s
version--58p9p-cjcdx                        0/1     Completed   0          59m
version--6q287-x45gm                        0/1     Completed   0          51m
version--9hfs5-th5bj                        0/1     Completed   0          56m
version--cjmk8-b8pjm                        0/1     Completed   0          55m
version--glgvs-9hzmn                        0/1     Completed   0          2m10s
version--hfmm8-j8mrl                        0/1     Completed   0          69m
version--htvkc-5sp4b                        0/1     Completed   0          70m
version--ktvk8-hfpr5                        0/1     Completed   0          39m
version--lpdjz-wkwvl                        0/1     Completed   0          60m
version--lsrdd-r4zls                        0/1     Completed   0          2m59s
version--r7xrx-g8g6s                        0/1     Completed   0          38m
version--z4bwd-5mz2v                        0/1     Completed   0          52m

no messages in version pods log as expected
for pod in `oc get pods -n openshift-cluster-version -ojsonpath='{.items[1:].metadata.name}'`; do echo -e "$pod\nlogs:"; oc logs pod/$pod ; done
version--58p9p-cjcdx
logs:
version--6q287-x45gm
logs:
version--9hfs5-th5bj
logs:
version--cjmk8-b8pjm
logs:
version--glgvs-9hzmn
logs:
version--hfmm8-j8mrl
logs:
version--htvkc-5sp4b
logs:
version--ktvk8-hfpr5
logs:
version--lpdjz-wkwvl
logs:
version--lsrdd-r4zls
logs:
version--r7xrx-g8g6s
logs:
version--z4bwd-5mz2v
logs:

no error in cvo log, as expected
no error in cvo status, ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.11.0-0.ci.test-2022-04-28-120349-ci-ln-hqnmgjt-latest"

no manifests/manifests dir
for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}');do oc debug node/$node -- /bin/bash -c 'ls /host/etc/cvo/updatepayloads/*/manifests/manifests -lAR';done 
Starting pod/evakhoni-1906-dhpv9-master-0copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
ls: cannot access '/host/etc/cvo/updatepayloads/*/manifests/manifests': No such file or directory

Removing debug pod ...
Starting pod/evakhoni-1906-dhpv9-master-1copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
ls: cannot access '/host/etc/cvo/updatepayloads/*/manifests/manifests': No such file or directory

Removing debug pod ...
Starting pod/evakhoni-1906-dhpv9-master-2copenshift-qeinternal-debug ...
To use host binaries, run `chroot /host`
ls: cannot access '/host/etc/cvo/updatepayloads/*/manifests/manifests': No such file or directory

all verified as expected

Comment 20 Evgeni Vakhonin 2022-05-11 05:42:56 UTC

note: it is still sometimes possible to reproduce while upgrading from unfixed-to-fixed build, which is expected according to dev, and i was able to recover the cluster to upgrade to fixed with the following workaround:
1) removed the old stuck manifests from all masters
╰─ for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}'); do echo $node; oc debug node/$node -- /bin/bash -c 'rm -rf /host/etc/cvo/updatepayloads/*'; done
2) cleared
╰─ oc adm upgrade --clear                                                                                                               
3) applied upgrade again 
╰─ oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --force --to-image=...

Comment 22 W. Trevor King 2022-07-26 10:10:26 UTC

Skimming some of the earlier comments here, I see some mentions of --force.  That's a big hammer:

  $ oc adm upgrade --help | grep force
   The cluster may report that the upgrade should not be performed due to a content verification error or update precondition failures such as operators blocking upgrades. Do not upgrade to images that are not appropriately signed without understanding the risks of upgrading your cluster to untrusted code. If you must override this protection use the --force flag.
        --force=false: Forcefully upgrade the cluster even when upgrade release image validation fails and the cluster is reporting errors.

Sometimes you need that hammer, e.g. when verifying bugs by updating to unsigned CI release builds.  But for folks moving between signed releases, it's best to avoid --force if at all possible.  If you're being bit by this issue please see the notes and recommended recovery steps in [1].

[1]: https://access.redhat.com/solutions/6965075

Comment 25 errata-xmlrpc 2022-08-10 11:03:06 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.