Description of problem: We have had a number of clusters on 4.10.15 attempt an upgrade (setting desiredUpdate.version to 4.10.16) CVO gets into a state where it reports this error in its Status: - lastTransitionTime: "2022-05-26T14:06:47Z" message: 'Retrieving payload failed version="4.10.16" image="quay.io/openshift-release-dev/ocp-release@sha256:a546cd80eae8f94ea0779091e978a09ad47ea94f0769b153763881edb2f5056e" failure=Unable to download and prepare the update: deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"' reason: RetrievePayload status: "False" type: ReleaseAccepted There will be a "version" pod in CrashLoopBackOff in openshift-cluster-version at the same time. CVO never seems to recover from this state, and no alerts are seemingly generated which would allow detection of the cluster in this state. Deleting the cluster-version-operator pod seems to allow the cluster to download and progress the upgrade. Version-Release number of the following components: OCP 4.10.15 How reproducible: We have observed this on 4 several clusters. Expected results: CVO should be able to self-recover from this situation.
Bug 2080058 shipped in 4.10.14 in this space, and bug 2083370 shipped in 4.10.15 also in this space. Poking around in the must-gather from comment 1 (sorry, external folks): $ tar xOz must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-7768c7f9f5-ddl44/cluster-version-operator/cluster-version-operator/logs/current.log <02527285_must-gather-20220531_022021Z.tar.gz >cvo.log $ grep 'Job version' cvo.log | head -n2 2022-05-30T23:51:27.016921458Z I0530 23:51:27.016897 1 batch.go:24] Job version-4.10.16-vz4f5 in namespace openshift-cluster-version is not ready, continuing to wait. 2022-05-30T23:51:30.018124208Z I0530 23:51:30.018086 1 batch.go:24] Job version-4.10.16-vz4f5 in namespace openshift-cluster-version is not ready, continuing to wait. $ grep 'Job version' cvo.log | tail -n2 2022-05-31T02:13:47.471864036Z I0531 02:13:47.471829 1 batch.go:24] Job version-4.10.16-dk76g in namespace openshift-cluster-version is not ready, continuing to wait. 2022-05-31T02:13:50.474119919Z I0531 02:13:50.474083 1 batch.go:24] Job version-4.10.16-dk76g in namespace openshift-cluster-version is not ready, continuing to wait. That logged line is from [1]. Checking on the job: $ tar -xOz must-gather/namespaces/openshift-cluster-version/batch/jobs.yaml <02527285_must-gather-20220531_022021Z.tar.gz | yaml2json | jq -r '.items[] | {spec: (.spec | {activeDeadlineSeconds}), status}' { "spec": { "activeDeadlineSeconds": 120 }, "status": { "active": 1, "startTime": "2022-05-31T02:12:35Z" } } And checking on the backing pod: $ tar -xOz must-gather/namespaces/openshift-cluster-version/pods/version-4.10.16-dk76g-6qdx2/version-4.10.16-dk76g-6qdx2.yaml <02527285_must-gather-20220531_022021Z.tar.gz | yaml2json | jq '.status.initContainerStatuses[] | select(.restartCount > 0)' { "containerID": "cri-o://877d6542d0b6bec0319783afb0faaa0dc2c16eea8b94231f4d63de54f0de9423", "image": "quay.io/openshift-release-dev/ocp-release@sha256:a546cd80eae8f94ea0779091e978a09ad47ea94f0769b153763881edb2f5056e", "imageID": "quay.io/openshift-release-dev/ocp-release@sha256:a546cd80eae8f94ea0779091e978a09ad47ea94f0769b153763881edb2f5056e", "lastState": { "terminated": { "containerID": "cri-o://877d6542d0b6bec0319783afb0faaa0dc2c16eea8b94231f4d63de54f0de9423", "exitCode": 1, "finishedAt": "2022-05-31T02:13:19Z", "reason": "Error", "startedAt": "2022-05-31T02:13:19Z" } }, "name": "cleanup", "ready": false, "restartCount": 3, "state": { "waiting": { "message": "back-off 40s restarting failed container=cleanup pod=version-4.10.16-dk76g-6qdx2_openshift-cluster-version(c5773446-72dc-4454-97b7-fdd42cb228ea)", "reason": "CrashLoopBackOff" } } } $ tar -xOz must-gather/namespaces/openshift-cluster-version/pods/version-4.10.16-dk76g-6qdx2/cleanup/cleanup/logs/current.log <02527285_must-gather-20220531_022021Z.tar.gz 2022-05-31T02:13:19.710822157Z rm: invalid option -- 'c' 2022-05-31T02:13:19.710822157Z Try 'rm ./-cgXkuYo_RfOyhs3_AZGxQ' to remove the file '-cgXkuYo_RfOyhs3_AZGxQ'. 2022-05-31T02:13:19.710822157Z Try 'rm --help' for more information. $ tar -xOz must-gather/namespaces/openshift-cluster-version/pods/version-4.10.16-dk76g-6qdx2/version-4.10.16-dk76g-6qdx2.yaml <02527285_must-gather-20220531_022021Z.tar.gz | yaml2json | jq -c '.spec.initContainers[] | select(.name == "cleanup").command' ["sh","-c","rm -fR *"] Ah, there's a local filename starting with - coming out of that * expansion. That will be a regression injected by bug 2080058 in 4.10.14, which will only bite some subset of clusters that had previously moved through a version which hashed to a leading hyphen. Or something? Unclear to me how deleting the CVO pod would have unstuck this... [1]: https://github.com/openshift/cluster-version-operator/blob/b3da2d3eba82adcd53198d662607f21641817c4a/lib/resourcebuilder/batch.go#L24
The following statement (or a link to this section) can be pasted into bugs when adding ImpactStatementRequested: We're asking the following questions to evaluate whether or not this bug warrants changing update recommendations from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions. Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters? reasoning: This allows us to populate from, to, and matchingRules in conditional update recommendations for "the $SOURCE_RELEASE to $TARGET_RELEASE update is not recommended for clusters like $THIS". example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet. Check your vulnerability with oc ... or the following PromQL count (...) > 0. example: All customers upgrading from 4.y.z to 4.y+1.z fail. Check your vulnerability with oc adm upgrade to show your current cluster version. What is the impact? Is it serious enough to warrant removing update recommendations? reasoning: This allows us to populate name and message in conditional update recommendations for "...because if you update, $THESE_CONDITIONS may cause $THESE_UNFORTUNATE_SYMPTOMS". example: Around 2 minute disruption in edge routing for 10% of clusters. Check with oc .... example: Up to 90 seconds of API downtime. Check with curl .... example: etcd loses quorum and you have to restore from backup. Check with ssh .... How involved is remediation? reasoning: This allows administrators who are already vulnerable, or who chose to waive conditional-update risks, to recover their cluster. And even moderately serious impacts might be acceptable if they are easy to mitigate. example: Issue resolves itself after five minutes. example: Admin can run a single: oc .... example: Admin must SSH to hosts, restore from backups, or other non standard admin activities. Is this a regression? reasoning: Updating between two vulnerable releases may not increase exposure (unless rebooting during the update increases vulnerability, etc.). We only qualify update recommendations if the update increases exposure. example: No, it has always been like this we just never noticed. example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1.
Created attachment 1885899 [details] Image pullspec hashing script Looking at impacted releases with the attached hashing script: $ go build hasher.go The following releases will hash with a leading hyphen: $ for Y in $(seq 2 10); do curl -s "https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.${Y}" | jq -r '.nodes[].payload' | ./hasher; done | sort | uniq | grep ' -> -' | while read IMAGE SEP HASH; do VERSION="$(oc adm release info "${IMAGE}" 2>/dev/null | sed -n s'/Name:[[:space:]]*//p')"; echo "${VERSION} ${IMAGE} ${HASH}"; done | sort -V 4.2.11 quay.io/openshift-release-dev/ocp-release@sha256:49ee20ee3102b15a7cf4c019fd8875134fda41ccda1dc27b6e4483ded2aa8a5c -5C-jd960zwGw8g3SDVeUg 4.2.25 quay.io/openshift-release-dev/ocp-release@sha256:dfbe59ca5dcc017475a0e1c703f51750c1bde63f12c725fbe4b7a599e36eb725 --Ufil8iVGp4v0fh3OnQXg 4.3.31 quay.io/openshift-release-dev/ocp-release@sha256:6395ddd44276c4a1d760c77f9f5d8dabf302df7b84afd7b3147c97bdf268ab0f -yIabYNEY2V8At6XIA8_9g 4.4.25 quay.io/openshift-release-dev/ocp-release@sha256:6f544f0159d20d18ab54619caa82983684497225e2a2fcf0e74ad60ca74b1871 -_QdHCJD-Ev76EVGHDYZyw 4.5.15 quay.io/openshift-release-dev/ocp-release@sha256:1df294ebe5b84f0eeceaa85b2162862c390143f5e84cda5acc22cc4529273c4c -cgXkuYo_RfOyhs3_AZGxQ 4.5.22 quay.io/openshift-release-dev/ocp-release@sha256:38d0bcb5443666b93a0c117f41ce5d5d8b3602b411c574f4e164054c43408a01 -0bX7BjpLoBa1j1hWXegtA 4.6.0-rc.4 quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 -QRDRzNeshuxExJctkKaiw 4.6.46 quay.io/openshift-release-dev/ocp-release@sha256:08180bc0b4765240beb07f9ee037a89442f90ca6cca9a4a682e73fd208ab2330 -e6aNUW32cbdCSYzgfvXVg Vulnerable clusters are expected to be clusters that updated into one of those leading-hyphen versions, continued on to 4.10.14 or later, and then tried to update out to any later release. Although I'm still working on confirming that expectation and the recovery process. And folks might also have image pullspecs that hash with a leading hyphen if they have been using 'oc adm upgrade --to-image registry.example.com/...' or similar synonym, or hotfixes, etc. I only hashed the pullspecs that showed up in candidate-4.y channels for 4.2 and later.
Cluster bot 'launch 4.6.45 aws' [1]: $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version 4.6.45 $ oc version --client # only 4.9 and later clients support 'oc adm upgrade channel ...' Client Version: 4.11.0-0.nightly-2022-05-20-213928 Kustomize Version: v4.5.4 Kick off the first hop to 4.6.46: $ oc adm upgrade channel stable-4.7 $ oc adm upgrade --to 4.6.46 $ watch oc adm upgrade ...wait for update to complete... Kick off the second hop: $ oc adm upgrade --to 4.7.49 Check to confirm that passing through 4.6.46 did pick up the troublesome file: $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- bash -c 'ls /host/etc/cvo/updatepayloads/'; done Starting pod/ip-10-0-145-44ec2internal-debug ... To use host binaries, run `chroot /host` ls: cannot access '/host/etc/cvo/updatepayloads/': No such file or directory Removing debug pod ... error: non-zero exit code from debug container Starting pod/ip-10-0-157-157ec2internal-debug ... To use host binaries, run `chroot /host` bz9-9v43JSTAOW6SuWQNnQ Removing debug pod ... Starting pod/ip-10-0-253-68ec2internal-debug ... To use host binaries, run `chroot /host` -e6aNUW32cbdCSYzgfvXVg Removing debug pod ... Note that the troublesome -e6... is only on one node. If the vulnerable 4.10.14 or 4.10.15 CVO is launching version pods on a node that doesn't happen to have the hyphen-starting release, we won't trigger the bug. Which is presumably how the comment 2 cluster made it from 4.10.14 to 4.10.15 and only hit this issue going from 4.10.15 to 4.10.16. Back to waiting out the 4.7 update: $ watch oc adm upgrade ...wait for update to complete... Now on to 4.8: $ oc adm upgrade channel stable-4.8 $ oc adm upgrade --to 4.8.39 $ watch oc adm upgrade ...wait for update to complete... On to 4.9: $ oc adm upgrade channel eus-4.10 $ oc -n openshift-config patch configmap admin-acks --patch '{"data":{"ack-4.8-kube-1.22-api-removals-in-4.9":"true"}}' --type=merge $ oc adm upgrade --to 4.9.33 I wanted to continue on to 4.10.14 and then on from there to 4.10.15 to see if I could reproduce the issue, but hit the 3h-post-install cluster-bot timeout. I'll try again later by synthetically injecting hyphenated directories with 'oc debug' so I can skip the chained-update setup. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1532205770345549824
Filling out the impact-statement template, based on my current understanding: Which 4.y.z to 4.y'.z' updates increase vulnerability? * Customers updating from 4.10.14 and later 4.10.z until we fix this bug. Which types of clusters? * Clusters which have passed through 4.6.46 or other versions listed in comment 4 in the past. Check with: $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- bash -c 'ls /host/etc/cvo/updatepayloads/'; done If the output contains any hyphenated entries, that control-plane node is at risk. What is the impact? * If the cluster-version operator pod is on an at-risk control-plane node, it will fail with ReleaseAccepted=False with reason=DeadlineExceeded and "Job was active longer than specified deadline". * Unless the cluster passed through a number of versions from comment 4, it is likely that the other control-plane nodes are not at risk. Is it serious enough to warrant removing update recommendations? * We expect that few clusters which have touched the commend 4 versions will update to the vulnerable 4.10.14 and later 4.10.z before we fix this bug. The exposure is basically "really old clusters that update each week". And while we have a bunch of those, the bulk of the fleet is newer or updates less frequently. How involved is remediation? * If you delete the cluster-version operator with: $ oc -n openshift-cluster-version delete pod -l k8s-app=cluster-version-operator The replacement pod (created by the Deployment controller) may be scheduled to a different, not-at-risk control-plane node, which will allow that update to proceed. Subsequent updates may have the cluster-version operator back on an at-risk control-plane node, so this is not a long-term fix. * Remove the hyphenated content (and all other past release content) from all control-plane nodes with: $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- bash -c 'rm -fR /host/etc/cvo/updatepayloads/*'; done This is a permanent fix, essentially giving yourself the cleanup that is in-flight to ship with the product in cluster-version-operator#783. Is this a regression? * Yes, we regressed in 4.10.14 via bug 2080058, which did not consider this hyphen-starting base64 case.
Reproducing with synthetic dashed-directory injection in a 'launch 4.10.14 aws' cluster-bot cluster [1]: $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version 4.10.14 $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- mkdir -p /host/etc/cvo/updatepayloads/-cccccccc; done Update to 4.10.15: $ oc adm upgrade channel stable-4.10 $ oc adm upgrade --to 4.10.15 $ oc -n openshift-cluster-version get pods | grep ^version version-4.10.15-s8std-hmch4 0/1 Init:Error 3 (29s ago) 51s $ oc -n openshift-cluster-version logs -c cleanup version-4.10.15-s8std-hmch4 rm: invalid option -- 'c' Try 'rm ./-cccccccc' to remove the file '-cccccccc'. Try 'rm --help' for more information. So successfully reproduced. Test the mitigation recommendation from comment 6: $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- bash -c 'rm -fR /host/etc/cvo/updatepayloads/*'; done That seems to have recovered the version pod, since it's gone: $ oc -n openshift-cluster-version get pods NAME READY STATUS RESTARTS AGE cluster-version-operator-67c6dc764b-gdtl8 1/1 Running 0 50m But the update is still stuck, presumably because bug 2083370 was only fixed in 4.10.15, so 4.10.14 is vulnerable to that. Recovering by clearing the update and coming in again: $ oc adm upgrade --clear $ watch oc adm upgrade ...wait for the cluster to realize it's still happy on 4.10.14... $ oc adm upgrade --to 4.10.15 And shortly thereafter: $ oc adm upgrade info: An upgrade is in progress. Working towards 4.10.15: 95 of 771 done (12% complete) ... So mitigation confirmed (even if you have to mix in a bug 2083370 mitigation as well if you're leaving 4.10.14).
As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2091770#c7 we are not considering this bug as upgrade blocker.
reproducing on later build Server Version: 4.10.17 injecting dashed dir as in https://bugzilla.redhat.com/show_bug.cgi?id=2091770#c7 $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- mkdir -p /host/etc/cvo/updatepayloads/-cccccccc; done upgrading to 4.11.0-0.nightly-2022-05-25-123329 $ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1 --force version pod crashed $ oc get -n openshift-cluster-version pods NAME READY STATUS RESTARTS AGE cluster-version-operator-6cddff4f74-b6h55 1/1 Running 0 139m version--l9vbc-mpcqw 0/1 Init:Error 2 (27s ago) 30s $ oc logs -n openshift-cluster-version pod/version--l9vbc-mpcqw Error from server (BadRequest): container "rename-to-final-location" in pod "version--l9vbc-mpcqw" is waiting to start: PodInitializing $ oc -n openshift-cluster-version logs version--l9vbc-mpcqw -c cleanup rm: invalid option -- 'c' Try 'rm ./-cccccccc' to remove the file '-cccccccc'. Try 'rm --help' for more information. verifying Server Version: 4.11.0-0.nightly-2022-06-04-014713 $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- mkdir -p /host/etc/cvo/updatepayloads/-cccccccc; done upgrading to 4.11.0-0.nightly-2022-06-04-180008 $ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:54f775170ea8323770ba9501a9556ffb570de05856f28abde58a540e94be8903 --force pod looks good: $ oc get -n openshift-cluster-version pods NAME READY STATUS RESTARTS AGE cluster-version-operator-5b9d8495c8-q745c 1/1 Running 0 10s version--69szg-4tpwn 0/1 Completed 0 24s logs looks good: $ oc -n openshift-cluster-version logs version--69szg-4tpwn Defaulted container "rename-to-final-location" out of: rename-to-final-location, cleanup (init), make-temporary-directory (init), move-operator-manifests-to-temporary-directory (init), move-release-manifests-to-temporary-directory (init) $ oc -n openshift-cluster-version logs version--69szg-4tpwn -c cleanup (no output) upgrade started and progressing info: An upgrade is in progress. Working towards 4.11.0-0.nightly-2022-06-04-180008: 678 of 802 done (84% complete), waiting up to 40 minutes on machine-config
I have hit this issue (logged for OCP 4.10.15) for my "sandbox" AWS ROSA cluster, trying to upgrade it to 4.10.13 - 4.10.14. I have tried cleanup steps from #Comment 6 (thank you @wking for documenting them in such details), but unfortunately wasn't able to "roll back" the stuck upgrade to 4.10.14: 1. Cleanup of /updatepayload directories for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- bash -c 'rm -fR /host/etc/cvo/updatepayloads/*'; done Creating debug namespace/openshift-debug-node-s8frn ... Starting pod/ip-10-0-146-203us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Removing debug pod ... Removing debug namespace/openshift-debug-node-s8frn ... Creating debug namespace/openshift-debug-node-m52v4 ... Starting pod/ip-10-0-197-147us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Removing debug pod ... Removing debug namespace/openshift-debug-node-m52v4 ... Creating debug namespace/openshift-debug-node-trww5 ... Starting pod/ip-10-0-223-67us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Removing debug pod ... Removing debug namespace/openshift-debug-node-trww5 ... 2. Checking cluster version: oc -n openshift-cluster-version get pods NAME READY STATUS RESTARTS AGE cluster-version-operator-68c9dcdd8d-fp9mv 1/1 Running 0 160m 3. Attempt to rollback ongoing clyster upgrade oc adm upgrade --clear error: Unable to cancel current rollout: admission webhook "regular-user-validation.managed.openshift.io" denied the request: Prevented from accessing Red Hat managed resources. This is in an effort to prevent harmful actions that may cause unintended consequences or affect the stability of the cluster. If you have any questions about this, please reach out to Red Hat support at https://access.redhat.com/support My understanding that is a specific ROSA issue as such webhook is not activated for self-managed OCP deployments? Thanks!
I'd like to "port" some comments from the Support case https://access.redhat.com/support/cases/#/case/03237175 I opened and just OKd to close (since my cluster is ROSA). 1. I previously had installed ACS Central & Collector components into this (ROSA) and another (on-prem self managed) OCP clusters. In both cases the issue seems to be related to SCCs installed in the cluster by ACS operator (there is an option whether install it or not when its configured). In particular 'stackrox-collector,' SCC seems to be adding an SCC policy to the 'version-*' pod that is supposed to be privileged and downloads the update image. The version yaml had this SCC: securityContext: privileged: true readOnlyRootFilesystem: true with the following annotation in metadata: openshift.io/scc: stackrox-collector This produces logs from the version-* pod like: mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_adminack_configmap.yaml': Read-only file system mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_admingate_configmap.yaml': Read-only file system mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_clusteroperator.crd.yaml': Read-only file system mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_clusterversion.crd.yaml': Read-only file system mv: cannot remove '....' 2. I confirmed that stackrox-collector SCC is indeed enabled clusterwide: oc get scc | grep stackrox stackrox-admission-control false [] RunAsAny RunAsAny RunAsAny RunAsAny 0 true ["configMap","downwardAPI","emptyDir","secret"] stackrox-collector true [] RunAsAny RunAsAny RunAsAny RunAsAny 0 true ["configMap","downwardAPI","emptyDir","hostPath","secret"] stackrox-sensor false [] RunAsAny RunAsAny RunAsAny RunAsAny 0 true ["configMap","downwardAPI","emptyDir","secret"] and then deleted 'stackrox-collector'one 3. I have applied "clean-up script" for master nodes from: https://bugzilla.redhat.com/show_bug.cgi?id=2091770#c6 to clean up lingering uploads from master node(s) I deleted the cluster-version-operator pod and observed that the recreated Job pod is finally able to complete the update image pull job: oc -n openshift-cluster-version get pod NAME READY STATUS RESTARTS AGE cluster-version-operator-7488488946-zgvv5 1/1 Running 0 8m20s version-4.10.15-nx46f-dnpjw 0/1 Completed 0 49m. <== was failing due to read-only file system before Same root cause and same approach worked for my other, on-prem OCP 4.10.5 cluster. Just wanted to share this as this "SCC injection" into generated pods via annotations (openshift.io/scc: stackrox-collector) caused its filesystem to be read-only and was failing upgrades.
> I deleted the cluster-version-operator pod and observed that the recreated Job pod is finally able to complete the update image pull job: This SCC injection issue is separate from this bug's hyphenated-directory issue, although both of them cause the version-... pod to fail, so they have similar downstream affects. We're tracking SCC-injection reporting as an RFE in [1]. [1]: https://issues.redhat.com/browse/OTA-680
*** Bug 2098219 has been marked as a duplicate of this bug. ***
Expanding comment 4's list of releases that hash to a hyphen prefix to other architectures: $ for ARCH in amd64 ppc46le s390x; do echo "${ARCH}"; for Y in $(seq 2 10); do curl -s "https://api.openshift.com/api/upgrades_info/graph?arch=${ARCH}&channel=candidate-4.${Y}" | jq -r '.n odes[].payload' | ./hasher; done | sort | uniq | grep ' -> -' | while read IMAGE SEP HASH; do VERSION="$(oc adm release info "${IMAGE}" 2>/dev/null | sed -n s'/Name:[[:space:]]*//p')"; echo "${VERSION} ${IMAGE} ${HASH}"; done | sort -V; done amd64 4.2.11 quay.io/openshift-release-dev/ocp-release@sha256:49ee20ee3102b15a7cf4c019fd8875134fda41ccda1dc27b6e4483ded2aa8a5c -5C-jd960zwGw8g3SDVeUg 4.2.25 quay.io/openshift-release-dev/ocp-release@sha256:dfbe59ca5dcc017475a0e1c703f51750c1bde63f12c725fbe4b7a599e36eb725 --Ufil8iVGp4v0fh3OnQXg 4.3.31 quay.io/openshift-release-dev/ocp-release@sha256:6395ddd44276c4a1d760c77f9f5d8dabf302df7b84afd7b3147c97bdf268ab0f -yIabYNEY2V8At6XIA8_9g 4.4.25 quay.io/openshift-release-dev/ocp-release@sha256:6f544f0159d20d18ab54619caa82983684497225e2a2fcf0e74ad60ca74b1871 -_QdHCJD-Ev76EVGHDYZyw 4.5.15 quay.io/openshift-release-dev/ocp-release@sha256:1df294ebe5b84f0eeceaa85b2162862c390143f5e84cda5acc22cc4529273c4c -cgXkuYo_RfOyhs3_AZGxQ 4.5.22 quay.io/openshift-release-dev/ocp-release@sha256:38d0bcb5443666b93a0c117f41ce5d5d8b3602b411c574f4e164054c43408a01 -0bX7BjpLoBa1j1hWXegtA 4.6.0-rc.4 quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 -QRDRzNeshuxExJctkKaiw 4.6.46 quay.io/openshift-release-dev/ocp-release@sha256:08180bc0b4765240beb07f9ee037a89442f90ca6cca9a4a682e73fd208ab2330 -e6aNUW32cbdCSYzgfvXVg ppc46le s390x 4.4.31 quay.io/openshift-release-dev/ocp-release@sha256:82b710ad9b4be8e03476e35e8a020f9aea4f6cf3c4ef1a2fe44185416c7f5f44 -UivIFK1okDl2gNxI0Mg-g 4.8.1 quay.io/openshift-release-dev/ocp-release@sha256:7dc99696cdd7cfe1b2c3cf685cbf6dcdaac9c210f17dd694881501808114145b -pb-bxUk4VFhlfvGwK1cvw 4.9.11 quay.io/openshift-release-dev/ocp-release@sha256:21fc2e5429882e17e444704aa46da3ca65478bf78379e0bb56c676a7a138b529 -Z3v6Xwil50NxkVEk615Jw So ppc64le has no exposure, and s390x has more recent exposure than amd64.
We dropped UpgradeBlocker in comment 8, based on the not-too-awkward mitigation from comment 7 and the impact statement from comment 6. I'm adding it back now, along with UpdateRecommendationsBlocked, because we decided that this was still awkward enough to be worth conditional-risk declarations for 4.10.* -> 4.10.(14 <= z < 20) [1]. We've also published a KCS so folks don't have to poke around in this bug to get the user-facing details [2]. [1]: https://github.com/openshift/cincinnati-graph-data/pull/2118 [2]: https://access.redhat.com/solutions/6965075
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069