Description of problem: ClusterID: cc782851-976b-494c-90ea-d5125936e134 ClusterVersion: Updating to "4.10.5" from "4.10.4" for 2 hours: Unable to apply 4.10.5: could not download the update ClusterOperators: All healthy and stable Cluster trying to upgrade from 4.10.5 from 4.10.4 is stuck with with the above error reported on the clusterversion. A pod in the openshift-cluster-version namespace keeps being created and error-ing. We managed to grab a log which had: oc logs -n openshift-cluster-version version-4.10.5-9jv69-4kxs7 mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/manifests/manifests'; unable to remove target: Directory not empty I will attach a must-gather and adm inspect of the openshift-cluster-version (although the adm inspect seemed to error grabbing the version-4.10.5 pod details) in a private comment. This is a gcp cluster.
Guess at a test plan: 1. Install a nightly with the patch. 2. exec into the CVO container. Also figure out which node it's running on for later. 3. Create some noise that looks like leaks from previous releases, including comment 0's 4.10.5 md5: $ for X in $(seq 9) HbO7IDc7tyIg9utw3sd_tg; do for DIR in manifests cvo-manifests; do FDIR="/etc/cvo/updatepayloads/${X}/${DIR}" && mkdir -p "${FDIR}" && touch "${FDIR}/test.yaml"; done; done 4. Trigger an update to 4.10.5: $ oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:ee6a9c7a11f883e90489229f6c6dc78b434af12f5646f4f9411d73a98969f02a 5. The cluster-version operator will launch a version Job, with a version-... Pod. Before the fix, the version-... Pod's logs would include "unable to remove target: Directory not empty", as seen in comment 0, and ClusterVersion conditions would include "could not download the update". With the fix, the version-... pod should succeed, and the CVO will begin the update (updating from a 4.11 nightly to 4.10.5 will probably blow up, but all we care about here is "did the CVO begin moving towards the target release?" not "do we successfully complete the update to the target release?"). Once the CVO begins updating: $ oc debug "node/${NODE_FROM_STEP_2}" -- ls /host/etc/cvo/updatepayloads Before the fix, all the 1, 2, 3, ... subdirectories we'd created in step 3 would still be there. With the fix, the CVO will have removed those, and only HbO7IDc7tyIg9utw3sd_tg should remain.
looking for a way to reproduce... so far, creating the files in /etc/cvo/updatepayloads/ did not reproduced the version pod error as in op. upgraded 4.10.4 to 4.10.5 without failure, and populated HbO7IDc7tyIg9utw3sd_tg with manifests, and the 1, 2, 3,... directories remained, no "mv: inter-device move failed" in cvo log. :(
Just had a try with following steps, reproduced. 1. Triggered upgrade from v4.10.4 to v4.10.5. 2. After upgrade started and then abort the upgrade to back to 4.10.4 with --force and --allow-upgrade-with-warnings. # ./oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:9f9c3aaca64f62af992bae5de1e984571c8b812f598b74c84dc630b064389fb7 --force --allow-upgrade-with-warnings warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. warning: --allow-upgrade-with-warnings is bypassing: the cluster is already upgrading: Reason: Message: Working towards 4.10.5: 83 of 758 done (10% complete) Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:9f9c3aaca64f62af992bae5de1e984571c8b812f598b74c84dc630b064389fb7 3. Check the cluster returned back to v4.10.4, login to the node to check 4.10.5's manifests already downloaded. # ls -la /etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/ total 56 drwxr-xr-x. 5 root root 69 Apr 8 01:02 . drwxr-xr-x. 3 root root 36 Apr 8 01:13 .. drwxr-xr-x. 2 root root 23 Apr 8 00:56 cvo-manifests drwxr-xr-x. 3 root root 40 Apr 8 01:02 manifests drwxrwxrwx. 2 root root 40960 Mar 14 08:04 release-manifests 4. remove cvo-manifests and release-manifests to make this dir broken. # ls -la /etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/ total 0 drwxr-xr-x. 3 root root 23 Apr 8 01:59 . drwxr-xr-x. 3 root root 36 Apr 8 01:13 .. drwxr-xr-x. 3 root root 40 Apr 8 01:02 manifests 5. Do upgrade from 4.10.4 to 4.10.5 again, and reproduced. # ./oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.4 True True 10m Unable to apply 4.10.5: could not download the update # ./oc logs version-4.10.5-hx5fd-lj5fh mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/manifests/manifests'; unable to remove target: Directory not empty
reproduced with correct file placement, has to be /manifests/manifests as in #c6 by Jia oc get pods -n openshift-cluster-version -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-version-operator-85dd988454-sjt62 1/1 Running 0 53m 10.0.0.5 evakhoni-100930-8n4jp-master-0.c.openshift-qe.internal <none> <none> oc debug node/evakhoni-100930-8n4jp-master-0.c.openshift-qe.internal -- /bin/bash -c 'for X in $(seq 9) HbO7IDc7tyIg9utw3sd_tg; do FDIR="/host/etc/cvo/updatepayloads/${X}/manifests/manifests" && mkdir -p "${FDIR}" && touch "${FDIR}/test.yaml"; done' Starting pod/evakhoni-100930-8n4jp-master-0copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Removing debug pod ... oc adm upgrade --allow-explicit-upgrade --to-image quay.io/openshift-release-dev/ocp-release@sha256:ee6a9c7a11f883e90489229f6c6dc78b434af12f5646f4f9411d73a98969f02a warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:ee6a9c7a11f883e90489229f6c6dc78b434af12f5646f4f9411d73a98969f02a oc adm upgrade info: An upgrade is in progress. Unable to apply quay.io/openshift-release-dev/ocp-release@sha256:ee6a9c7a11f883e90489229f6c6dc78b434af12f5646f4f9411d73a98969f02a: could not download the update oc logs -n openshift-cluster-version version--hdsf9-k29gm mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/manifests/manifests'; unable to remove target: Directory not empty
however verifying from 4.11 to 4.10.5 as suggested by wking here https://bugzilla.redhat.com/show_bug.cgi?id=2070805#c4 is impossible, cause version pod being pulled is of the target version (4.10.5 in this case) so the bug reproduced even from fixed build: Server Version: 4.11.0-0.nightly-2022-04-06-213816 oc get pods -n openshift-cluster-version -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-version-operator-6dfd5f57d8-scn65 1/1 Running 0 72m 10.0.0.4 evakhoni-100932-w9snb-master-0.c.openshift-qe.internal <none> <none> oc debug node/evakhoni-100932-w9snb-master-0.c.openshift-qe.internal -- /bin/bash -c 'for X in $(seq 9) HbO7IDc7tyIg9utw3sd_tg; do FDIR="/host/etc/cvo/updatepayloads/${X}/manifests/manifests" && mkdir -p "${FDIR}" && touch "${FDIR}/test.yaml"; done' Starting pod/evakhoni-100932-w9snb-master-0copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Removing debug pod ... oc get pods NAME READY STATUS RESTARTS AGE cluster-version-operator-6dfd5f57d8-scn65 1/1 Running 0 3h6m version--mhsd6-qfdqn 0/1 Error 3 48s oc logs version--mhsd6-qfdqn mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/HbO7IDc7tyIg9utw3sd_tg/manifests/manifests'; unable to remove target: Directory not empty
also looks like we cannot guarantee that the 'version' pod is gonna be scheduled on the same node as cvo, as demonstrated here: oc get pods -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-version-operator-db946689b-qhvhg 1/1 Running 0 51s 10.0.0.3 evakhoni-100930-8n4jp-master-2.c.openshift-qe.internal <none> <none> version--ldtzk-s5pw2 0/1 Completed 0 68s 10.129.0.90 evakhoni-100930-8n4jp-master-0.c.openshift-qe.internal <none> <none> so using a hybrid approach to verify: 1) obtained release hash by triggering upgrade to another fixed version, higher than tested, on another cluster. then observed downloaded manifests for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}') do oc debug node/$node -- /bin/bash -c 'ls /host/etc/cvo/updatepayloads/' done Starting pod/evakhoni-101622-q6qqc-master-2copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` brRTeZnZSIkZ2J2YMdQozg (in this case, the hash is for version 4.11.0-0.nightly-2022-04-07-053433) 2) tested the target cluster from 4.11.0-0.nightly-2022-04-06-213816 (after fix) to 4.11.0-0.nightly-2022-04-07-053433 (after fix) generated garbage with target hash manifests/manifests + dummy directories as suggested on #c4 with correction for manifests/manifests as used by me in https://bugzilla.redhat.com/show_bug.cgi?id=2070805#c7 but this time on all master nodes for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}'); do oc debug node/$node -- /bin/bash -c 'for X in $(seq 9) brRTeZnZSIkZ2J2YMdQozg; do FDIR="/host/etc/cvo/updatepayloads/${X}/manifests/manifests" && mkdir -p "${FDIR}" && touch "${FDIR}/test.yaml"; done'; done Starting pod/evakhoni-101813-zh5cf-master-0copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Removing debug pod ... Starting pod/evakhoni-101813-zh5cf-master-1copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Removing debug pod ... Starting pod/evakhoni-101813-zh5cf-master-2copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` Removing debug pod ... 3) triggered the upgrade and observing 'version' pod log... oc adm upgrade --force --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:8f4ac72e32e701f6609fe00c4f021670cab8466eb38b98c141764ceb6c3d8ab5 warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image registry.ci.openshift.org/ocp/release@sha256:8f4ac72e32e701f6609fe00c4f021670cab8466eb38b98c141764ceb6c3d8ab5 oc get pods -owide -n openshift-cluster-version NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-version-operator-6dfd5f57d8-j9rsn 1/1 Running 0 36m 10.0.0.4 evakhoni-101813-zh5cf-master-2.c.openshift-qe.internal <none> <none> version--pspnp-72crs 0/1 Error 2 (16s ago) 22s 10.128.0.49 evakhoni-101813-zh5cf-master-2.c.openshift-qe.internal <none> <none> oc logs version--pspnp-72crs -n openshift-cluster-version mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg/manifests/manifests'; unable to remove target: Directory not empty this time however, with no 'could not download the update' error oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-06-213816 True False 20m Cluster version is 4.11.0-0.nightly-2022-04-06-213816 however digging deeper status revealed: oc get clusterversions.config.openshift.io version -ojson|jq '.status' { "availableUpdates": null, "capabilities": { "enabledCapabilities": [ "baremetal", "marketplace", "openshift-samples" ], "knownCapabilities": [ "baremetal", "marketplace", "openshift-samples" ] }, "conditions": [ { "lastTransitionTime": "2022-04-10T15:19:46Z", "message": "Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-04-06-213816 not found in the \"stable-4.11\" channel", "reason": "VersionNotFound", "status": "False", "type": "RetrievedUpdates" }, { "lastTransitionTime": "2022-04-10T15:19:46Z", "message": "Capabilities match configured spec", "reason": "AsExpected", "status": "False", "type": "ImplicitlyEnabledCapabilities" }, { "lastTransitionTime": "2022-04-10T15:57:47Z", "message": "Retrieving payload failed version=\"\" image=\"registry.ci.openshift.org/ocp/release@sha256:8f4ac72e32e701f6609fe00c4f021670cab8466eb38b98c141764ceb6c3d8ab5\" failure=Unable to download and prepare the update: deadline exceeded, reason: \"DeadlineExceeded\", message: \"Job was active longer than specified deadline\"", "reason": "RetrievePayload", "status": "False", "type": "ReleaseAccepted" }, { "lastTransitionTime": "2022-04-10T15:37:47Z", "message": "Done applying 4.11.0-0.nightly-2022-04-06-213816", "status": "True", "type": "Available" }, { "lastTransitionTime": "2022-04-10T15:37:47Z", "status": "False", "type": "Failing" }, { "lastTransitionTime": "2022-04-10T15:59:02Z", "message": "Cluster version is 4.11.0-0.nightly-2022-04-06-213816", "status": "False", "type": "Progressing" } ], "desired": { "image": "registry.ci.openshift.org/ocp/release@sha256:40bd2cfbbd80cc192acc3d9fe047790cf4592beb577d144961cdf465392a5133", "version": "4.11.0-0.nightly-2022-04-06-213816" }, "history": [ { "completionTime": "2022-04-10T15:37:47Z", "image": "registry.ci.openshift.org/ocp/release@sha256:40bd2cfbbd80cc192acc3d9fe047790cf4592beb577d144961cdf465392a5133", "startedTime": "2022-04-10T15:19:46Z", "state": "Completed", "verified": false, "version": "4.11.0-0.nightly-2022-04-06-213816" } ], "observedGeneration": 3, "versionHash": "PWepNsbeUMA=" } @wking what do you think?
With the reproduce in https://bugzilla.redhat.com/show_bug.cgi?id=2070805#c6. I checked the issue should be fixed now. Before upgrade from 4.11.0-0.nightly-2022-04-06-213816 to 4.11.0-0.nightly-2022-04-07-053433, there are both broken dir on node ip-10-0-164-22us-east-2computeinternal and ip-10-0-196-183us-east-2computeinternal. # for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}');do oc debug node/$node -- /bin/bash -c 'ls /host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg -la';done Starting pod/ip-10-0-155-70us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` ls: cannot access '/host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg': No such file or directory Removing debug pod ... Starting pod/ip-10-0-164-22us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` total 4 drwxr-xr-x. 3 root root 23 Apr 11 08:02 . drwxr-xr-x. 4 root root 66 Apr 11 07:59 .. drwxr-xr-x. 2 root root 4096 Apr 7 03:55 manifests Removing debug pod ... Starting pod/ip-10-0-196-183us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` total 4 drwxr-xr-x. 3 root root 23 Apr 11 07:52 . drwxr-xr-x. 3 root root 36 Apr 11 07:49 .. drwxr-xr-x. 2 root root 4096 Apr 7 03:55 manifests Removing debug pod ... After trigger the upgrade from 4.11.0-0.nightly-2022-04-06-213816 to 4.11.0-0.nightly-2022-04-07-053433. 1) Checked the latest version pod was scheduled on node ip-10-0-164-22.us-east-2.compute.internal. # ./oc get po -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES cluster-version-operator-6dfd5f57d8-6rk4c 1/1 Running 0 3m24s 10.0.164.22 ip-10-0-164-22.us-east-2.compute.internal <none> <none> version--7fjrv-mbp4t 0/1 Completed 0 12m 10.129.0.62 ip-10-0-164-22.us-east-2.compute.internal <none> <none> version--ksmdr-26xjw 0/1 Completed 0 4m34s 10.129.0.68 ip-10-0-164-22.us-east-2.compute.internal <none> <none> version--sh8xz-5ns5h 0/1 Completed 0 4s 10.129.0.77 ip-10-0-164-22.us-east-2.compute.internal <none> <none> version--zkrnl-6k2ks 0/1 Completed 0 14m 10.128.0.74 ip-10-0-196-183.us-east-2.compute.internal <none> <none> 2) Checked the payload was re-fetched on the scheduled node successfully. # for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}');do oc debug node/$node -- /bin/bash -c 'ls /host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg -la';done Starting pod/ip-10-0-155-70us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` ls: cannot access '/host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg': No such file or directory Removing debug pod ... Starting pod/ip-10-0-164-22us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` total 64 drwxr-xr-x. 4 root root 48 Apr 11 08:03 . drwxr-xr-x. 4 root root 66 Apr 11 07:59 .. drwxr-xr-x. 3 root root 4096 Apr 11 08:03 manifests drwxrwxrwx. 2 root root 45056 Apr 7 03:10 release-manifests Removing debug pod ... Starting pod/ip-10-0-196-183us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` total 4 drwxr-xr-x. 3 root root 23 Apr 11 07:52 . drwxr-xr-x. 3 root root 36 Apr 11 07:49 .. drwxr-xr-x. 2 root root 4096 Apr 7 03:55 manifests Removing debug pod ... 3) Checked upgrade is in progress # ./oc get clusterversion -ojson|jq .items[].status.conditions[] ... { "lastTransitionTime": "2022-04-11T06:34:01Z", "message": "Payload loaded version=\"4.11.0-0.nightly-2022-04-07-053433\" image=\"registry.ci.openshift.org/ocp/release@sha256:8f4ac72e32e701f6609fe00c4f021670cab8466eb38b98c141764ceb6c3d8ab5\"", "reason": "PayloadLoaded", "status": "True", "type": "ReleaseAccepted" } ... { "lastTransitionTime": "2022-04-11T08:03:37Z", "message": "Working towards 4.11.0-0.nightly-2022-04-07-053433: 615 of 786 done (78% complete)", "status": "True", "type": "Progressing" }
well, reproduced yet another time from 4.11.0-0.nightly-2022-04-06-213816 to 4.11.0-0.nightly-2022-04-07-053433 started an upgrade, and immediately reverted, few times, while deleting 'release-manifests' upgrade: oc adm upgrade --allow-explicit-upgrade --force --allow-upgrade-with-warnings --to-image registry.ci.openshift.org/ocp/release@sha256:8f4ac72e32e701f6609fe00c4f021670cab8466eb38b98c141764ceb6c3d8ab5 #07 after few seconds: #oc adm upgrade --allow-explicit-upgrade --force --allow-upgrade-with-warnings --to-image registry.ci.openshift.org/ocp/release@sha256:40bd2cfbbd80cc192acc3d9fe047790cf4592beb577d144961cdf465392a5133 #06 then after fully reverted, deleted 'release-manifests': #oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-04-06-213816 True False 5m49s Cluster version is 4.11.0-0.nightly-2022-04-06-213816 #for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}'); do echo $node; oc debug node/$node -- /bin/bash -c 'rm -rf /host/etc/cvo/updatepayloads/*/release-manifests'; done 2>/dev/null evakhoni-112100-kwkp4-master-0.c.openshift-qe.internal evakhoni-112100-kwkp4-master-1.c.openshift-qe.internal evakhoni-112100-kwkp4-master-2.c.openshift-qe.internal then again few times until nodes filled with manifests/manifests/ #for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}'); do echo $node; oc debug node/$node -- /bin/bash -c 'ls -d /host/etc/cvo/updatepayloads/*/manifests/manifests'; done 2>/dev/null evakhoni-112100-kwkp4-master-0.c.openshift-qe.internal ls: cannot access '/host/etc/cvo/updatepayloads/*/manifests/manifests': No such file or directory evakhoni-112100-kwkp4-master-1.c.openshift-qe.internal /host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg/manifests/manifests evakhoni-112100-kwkp4-master-2.c.openshift-qe.internal /host/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg/manifests/manifests then upgraded one more time to 4.11.0-0.nightly-2022-04-07-053433, and received CrashLoopBackOff #oc get pods -w NAME READY STATUS RESTARTS AGE cluster-version-operator-6dfd5f57d8-cwq9b 1/1 Running 0 15m version--9vtkl-jknfv 0/1 Completed 0 39m version--b8g6t-kc9xh 0/1 Completed 0 48m version--bcqp9-j82mx 0/1 Completed 0 17m version--bzklh-rbl7z 0/1 Completed 0 15m version--dk8v5-kjx95 0/1 Completed 0 15m version--lswr4-8xb8f 0/1 Completed 0 69m version--s4gfk-f4qvw 0/1 Completed 0 69m version--vrkf8-58wdz 0/1 Completed 0 73m version--whc68-ltk58 0/1 Completed 0 73m version--jpt5r-9h6wg 0/1 Pending 0 0s version--jpt5r-9h6wg 0/1 ContainerCreating 0 0s version--jpt5r-9h6wg 0/1 ContainerCreating 0 2s version--jpt5r-9h6wg 0/1 Error 0 2s version--jpt5r-9h6wg 0/1 Error 1 (1s ago) 3s version--jpt5r-9h6wg 0/1 CrashLoopBackOff 1 (1s ago) 4s version--jpt5r-9h6wg 0/1 Error 2 (18s ago) 21s #oc logs version--jpt5r-9h6wg mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg/manifests/manifests'; unable to remove target: Directory not empty quickly collected a must-gather, i was able to catch the pod #omg get pods NAME READY STATUS RESTARTS AGE cluster-version-operator-6dfd5f57d8-cwq9b 1/1 Running 0 17m version--9vtkl-jknfv 0/1 Succeeded 0 41m version--b8g6t-kc9xh 0/1 Succeeded 0 51m version--bcqp9-j82mx 0/1 Succeeded 0 20m version--bzklh-rbl7z 0/1 Succeeded 0 17m version--dk8v5-kjx95 0/1 Succeeded 0 18m version--jpt5r-9h6wg 0/1 Running 1 17s version--lswr4-8xb8f 0/1 Succeeded 0 1h12m version--s4gfk-f4qvw 0/1 Succeeded 0 1h11m version--vrkf8-58wdz 0/1 Succeeded 0 1h15m version--whc68-ltk58 0/1 Succeeded 0 1h15m #omg logs version--jpt5r-9h6wg /home/evakhoni/93193/must-gather.local.3458720062476842962/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-e47b07aaeabdf633f489780b29c6b92d2447fa33b9627b3f5dfc7301478fb025/namespaces/openshift-cluster-version/pods/version--jpt5r-9h6wg/payload/payload/logs/current.log 2022-04-11T20:42:35.078131997Z mv: inter-device move failed: '/manifests' to '/etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg/manifests/manifests'; unable to remove target: Directory not empty
Evgeni found the: 2022-04-11T20:42:32.056428784Z W0411 20:42:32.056387 1 updatepayload.go:149] failed to prune update payload directory: unlinkat /etc/cvo/updatepayloads/brRTeZnZSIkZ2J2YMdQozg: read-only file system issue, and I've filed [1] to avoid that. [1]: https://github.com/openshift/cluster-version-operator/pull/765
hmm.. verifying on 4.11.0-0.nightly-2022-04-26-030643 to 4.11.0-0.nightly-2022-04-26-085341 tried the last method of starting and reverting, with deleting release-manifests.. did one round of: oc adm upgrade --allow-explicit-upgrade --force --to-image=registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 #new oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --force --to-image=registry.ci.openshift.org/ocp/release@sha256:eb1de01c387ad7fa9d82ae7249fc3ede1706043ccd8e1d06bcfb67e5a2741b57 #old waited for revert to complete deleted release-manifests, and triggered upgrade again to 26-085341, (without allow-upgrade-with-warnings as before) and received: - lastTransitionTime: "2022-04-26T13:04:21Z" message: 'Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57" failure=Unable to download and prepare the update: stat /etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg/release-manifests: no such file or directory' reason: RetrievePayload status: "False" type: ReleaseAccepted cleared, and tried again with allow-upgrade-with-warnings version pod is Completed version--f89rb-6z4xl 0/1 Completed 0 4m14s 10.130.0.156 evakhoni-1204-5bb7m-master-1.c.openshift-qe.internal <none> <no still the same error: oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --force --to-image=registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 #new warning: The requested upgrade image is not one of the available updates.You have used --allow-explicit-upgrade for the update to proceed anyway warning: --force overrides cluster verification of your supplied release image and waives any update precondition failures. Updating to release image registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 spec: { "channel": "stable-4.11", "clusterID": "08f5e725-e263-4b4c-9fb5-ced2feaf70a0", "desiredUpdate": { "force": true, "image": "registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57", "version": "" } } 2022-04-26T09:10:15Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.11.0-0.nightly-2022-04-26-030643 not found in the "stable-4.11" channel 2022-04-26T09:10:15Z ImplicitlyEnabledCapabilities=False AsExpected: Capabilities match configured spec 2022-04-26T13:04:21Z ReleaseAccepted=False RetrievePayload: Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57" failure=Unable to download and prepare the update: stat /etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg/release-manifests: no such file or directory 2022-04-26T09:35:34Z Available=True : Done applying 4.11.0-0.nightly-2022-04-26-030643 2022-04-26T09:34:04Z Failing=False : 2022-04-26T13:08:14Z Progressing=False : Cluster version is 4.11.0-0.nightly-2022-04-26-030643 in CVO log: W0426 13:30:58.464679 1 updatepayload.go:116] An image was retrieved from "registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57" that failed verification: The update cannot be verified: unable to locate a valid signature for one or more sources I0426 13:31:07.520829 1 event.go:285] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"openshift-cluster-version", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'RetrievePayloadFailed' Retrieving payload failed version="" image="registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57" failure=Unable to download and prepare the update: stat /etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg/release-manifests: no such file or directory nothing in version pods log so while not exactly our bug from before, but looks like a regression of CVO manifest validation mechanism @wking WDYT?
Poking around in a must-gather from the comment 16 cluster: $ for X in namespaces/openshift-cluster-version/pods/version--*/*.yaml; do yaml2json < "${X}" | jq -r '(.metadata | (.creationTimestamp + " " + .name)) + " " + .status.containerStatuses[].image + " " + .status.phase'; done | sort 2022-04-26T12:48:48Z version--rxf54-n9bjd registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 Succeeded 2022-04-26T12:49:27Z version--22zzn-m57z2 registry.ci.openshift.org/ocp/release@sha256:eb1de01c387ad7fa9d82ae7249fc3ede1706043ccd8e1d06bcfb67e5a2741b57 Succeeded 2022-04-26T13:02:50Z version--s4p2x-x6tkc registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 Succeeded 2022-04-26T13:04:12Z version--2cq6b-49bq4 registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 Succeeded 2022-04-26T13:30:58Z version--f89rb-6z4xl registry.ci.openshift.org/ocp/release@sha256:f9875a76c9867901d6e441f2eca7130a838255324b701604201152aa2f332e57 Succeeded so all of the jobs were happy. Commands on that 13:30 pod: $ yaml2json <namespaces/openshift-cluster-version/pods/version--f89rb-6z4xl/*.yaml | jq -c '.spec | [.initContainers, .containers][][] | .command' ["rm","-fR","/etc/cvo/updatepayloads/*"] ["mkdir","/etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg-c2dmr"] ["mv","/manifests","/etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg-c2dmr/manifests"] ["mv","/release-manifests","/etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg-c2dmr/release-manifests"] ["mv","/etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg-c2dmr","/etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg"] Ah, shell globbing in the 'rm' probably not going to work now that I've dropped the shell... And then in the CVO logs: $ grep '13:3.* loadUpdatedPayload' namespaces/openshift-cluster-version/pods/cluster-version-operator-655f6955b4-mcqd5/cluster-version-operator/cluster-version-operator/logs/current.log 2022-04-26T13:31:07.520741308Z I0426 13:31:07.520674 1 sync_worker.go:376] loadUpdatedPayload syncPayload err=Unable to download and prepare the update: stat /etc/cvo/updatepayloads/dm4kctz_pd--9HfyZ99hVg/release-manifests: no such file or directory So let me file a v3 pull to fix the 'rm' bit...
pre-merge verified before nightly was available.. both from unpatched 4.11.0-0.nightly-2022-04-26-181148 to patched 4.11.0-0.ci.test-2022-04-28-120349-ci-ln-hqnmgjt-latest and from patched 4.11.0-0.ci.test-2022-04-28-120349-ci-ln-hqnmgjt-latest to unpatched 4.11.0-0.nightly-2022-04-26-181148 and also from 4.10.12 to patched 4.11.0-0.ci.test-2022-04-28-120349-ci-ln-hqnmgjt-latest to pick release before 2 first PRs to upgrade from.. using the same method as in https://bugzilla.redhat.com/show_bug.cgi?id=2070805#c16 1) started upgrade 2) reverted back 3) invalidated the current payload by deleting release-manifests 4) checked status, pods, and log 5) repeated done 6 cycles, no version pod crash detected oc get pods NAME READY STATUS RESTARTS AGE cluster-version-operator-5b469f5d6d-s7qjn 1/1 Running 0 118s version--58p9p-cjcdx 0/1 Completed 0 59m version--6q287-x45gm 0/1 Completed 0 51m version--9hfs5-th5bj 0/1 Completed 0 56m version--cjmk8-b8pjm 0/1 Completed 0 55m version--glgvs-9hzmn 0/1 Completed 0 2m10s version--hfmm8-j8mrl 0/1 Completed 0 69m version--htvkc-5sp4b 0/1 Completed 0 70m version--ktvk8-hfpr5 0/1 Completed 0 39m version--lpdjz-wkwvl 0/1 Completed 0 60m version--lsrdd-r4zls 0/1 Completed 0 2m59s version--r7xrx-g8g6s 0/1 Completed 0 38m version--z4bwd-5mz2v 0/1 Completed 0 52m no messages in version pods log as expected for pod in `oc get pods -n openshift-cluster-version -ojsonpath='{.items[1:].metadata.name}'`; do echo -e "$pod\nlogs:"; oc logs pod/$pod ; done version--58p9p-cjcdx logs: version--6q287-x45gm logs: version--9hfs5-th5bj logs: version--cjmk8-b8pjm logs: version--glgvs-9hzmn logs: version--hfmm8-j8mrl logs: version--htvkc-5sp4b logs: version--ktvk8-hfpr5 logs: version--lpdjz-wkwvl logs: version--lsrdd-r4zls logs: version--r7xrx-g8g6s logs: version--z4bwd-5mz2v logs: no error in cvo log, as expected no error in cvo status, ReleaseAccepted=True PayloadLoaded: Payload loaded version="4.11.0-0.ci.test-2022-04-28-120349-ci-ln-hqnmgjt-latest" no manifests/manifests dir for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}');do oc debug node/$node -- /bin/bash -c 'ls /host/etc/cvo/updatepayloads/*/manifests/manifests -lAR';done Starting pod/evakhoni-1906-dhpv9-master-0copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` ls: cannot access '/host/etc/cvo/updatepayloads/*/manifests/manifests': No such file or directory Removing debug pod ... Starting pod/evakhoni-1906-dhpv9-master-1copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` ls: cannot access '/host/etc/cvo/updatepayloads/*/manifests/manifests': No such file or directory Removing debug pod ... Starting pod/evakhoni-1906-dhpv9-master-2copenshift-qeinternal-debug ... To use host binaries, run `chroot /host` ls: cannot access '/host/etc/cvo/updatepayloads/*/manifests/manifests': No such file or directory all verified as expected
note: it is still sometimes possible to reproduce while upgrading from unfixed-to-fixed build, which is expected according to dev, and i was able to recover the cluster to upgrade to fixed with the following workaround: 1) removed the old stuck manifests from all masters ╰─ for node in $(oc get nodes -l 'node-role.kubernetes.io/master' -ojsonpath='{.items[:].metadata.name}'); do echo $node; oc debug node/$node -- /bin/bash -c 'rm -rf /host/etc/cvo/updatepayloads/*'; done 2) cleared ╰─ oc adm upgrade --clear 3) applied upgrade again ╰─ oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --force --to-image=...
Skimming some of the earlier comments here, I see some mentions of --force. That's a big hammer: $ oc adm upgrade --help | grep force The cluster may report that the upgrade should not be performed due to a content verification error or update precondition failures such as operators blocking upgrades. Do not upgrade to images that are not appropriately signed without understanding the risks of upgrading your cluster to untrusted code. If you must override this protection use the --force flag. --force=false: Forcefully upgrade the cluster even when upgrade release image validation fails and the cluster is reporting errors. Sometimes you need that hammer, e.g. when verifying bugs by updating to unsigned CI release builds. But for folks moving between signed releases, it's best to avoid --force if at all possible. If you're being bit by this issue please see the notes and recommended recovery steps in [1]. [1]: https://access.redhat.com/solutions/6965075
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069