2091770 – CVO gets stuck downloading an upgrade, with the version pod complaining about invalid options

Bug 2091770 - CVO gets stuck downloading an upgrade, with the version pod complaining about invalid options

Summary: CVO gets stuck downloading an upgrade, with the version pod complaining about...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	W. Trevor King
QA Contact:	Evgeni Vakhonin
Docs Contact:
URL:
Whiteboard:	UpdateRecommendationsBlocked
Duplicates (1):	2098219 (view as bug list)
Depends On:
Blocks:	2094078
TreeView+	depends on / blocked

Reported:	2022-05-31 03:10 UTC by Matt Bargenquast
Modified:	2022-08-10 11:15 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 11:15:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Image pullspec hashing script (452 bytes, text/plain) 2022-06-01 23:41 UTC, W. Trevor King	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 783	None	Merged	Bug 2091770: pkg/cvo/updatepayload: Guard against 'rm -fR -whatever' with ./*	2022-06-06 18:44:33 UTC
Red Hat Knowledge Base (Solution)	6965075	None	None	None	2022-06-29 18:52:27 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 11:15:34 UTC

Description Matt Bargenquast 2022-05-31 03:10:18 UTC

Description of problem:

We have had a number of clusters on 4.10.15 attempt an upgrade (setting desiredUpdate.version to 4.10.16)

CVO gets into a state where it reports this error in its Status:

    - lastTransitionTime: "2022-05-26T14:06:47Z"                                                                                                                         
      message: 'Retrieving payload failed version="4.10.16" image="quay.io/openshift-release-dev/ocp-release@sha256:a546cd80eae8f94ea0779091e978a09ad47ea94f0769b153763881edb2f5056e" failure=Unable to download and prepare the update: deadline exceeded, reason: "DeadlineExceeded", message: "Job was active longer than specified deadline"'                                                                 
      reason: RetrievePayload                                                                                                                                                                            
      status: "False"                                                                                                                                                                                    
      type: ReleaseAccepted                    

There will be a "version" pod in CrashLoopBackOff in openshift-cluster-version at the same time.

CVO never seems to recover from this state, and no alerts are seemingly generated which would allow detection of the cluster in this state.

Deleting the cluster-version-operator pod seems to allow the cluster to download and progress the upgrade.

Version-Release number of the following components:

OCP 4.10.15

How reproducible:

We have observed this on 4 several clusters.

Expected results:

CVO should be able to self-recover from this situation.

Comment 2 W. Trevor King 2022-06-01 06:08:51 UTC

Bug 2080058 shipped in 4.10.14 in this space, and bug 2083370 shipped in 4.10.15 also in this space.  Poking around in the must-gather from comment 1 (sorry, external folks):

$ tar xOz must-gather/namespaces/openshift-cluster-version/pods/cluster-version-operator-7768c7f9f5-ddl44/cluster-version-operator/cluster-version-operator/logs/current.log <02527285_must-gather-20220531_022021Z.tar.gz >cvo.log
$ grep 'Job version' cvo.log | head -n2
2022-05-30T23:51:27.016921458Z I0530 23:51:27.016897       1 batch.go:24] Job version-4.10.16-vz4f5 in namespace openshift-cluster-version is not ready, continuing to wait.
2022-05-30T23:51:30.018124208Z I0530 23:51:30.018086       1 batch.go:24] Job version-4.10.16-vz4f5 in namespace openshift-cluster-version is not ready, continuing to wait.
$ grep 'Job version' cvo.log | tail -n2
2022-05-31T02:13:47.471864036Z I0531 02:13:47.471829       1 batch.go:24] Job version-4.10.16-dk76g in namespace openshift-cluster-version is not ready, continuing to wait.
2022-05-31T02:13:50.474119919Z I0531 02:13:50.474083       1 batch.go:24] Job version-4.10.16-dk76g in namespace openshift-cluster-version is not ready, continuing to wait.

That logged line is from [1].  Checking on the job:

  $ tar -xOz must-gather/namespaces/openshift-cluster-version/batch/jobs.yaml <02527285_must-gather-20220531_022021Z.tar.gz | yaml2json | jq -r '.items[] | {spec: (.spec | {activeDeadlineSeconds}), status}'
  {
    "spec": {
      "activeDeadlineSeconds": 120
    },
    "status": {
      "active": 1,
      "startTime": "2022-05-31T02:12:35Z"
    }
  }

And checking on the backing pod:

  $ tar -xOz must-gather/namespaces/openshift-cluster-version/pods/version-4.10.16-dk76g-6qdx2/version-4.10.16-dk76g-6qdx2.yaml <02527285_must-gather-20220531_022021Z.tar.gz | yaml2json | jq '.status.initContainerStatuses[] | select(.restartCount > 0)'
  {
    "containerID": "cri-o://877d6542d0b6bec0319783afb0faaa0dc2c16eea8b94231f4d63de54f0de9423",
    "image": "quay.io/openshift-release-dev/ocp-release@sha256:a546cd80eae8f94ea0779091e978a09ad47ea94f0769b153763881edb2f5056e",
    "imageID": "quay.io/openshift-release-dev/ocp-release@sha256:a546cd80eae8f94ea0779091e978a09ad47ea94f0769b153763881edb2f5056e",
    "lastState": {
      "terminated": {
        "containerID": "cri-o://877d6542d0b6bec0319783afb0faaa0dc2c16eea8b94231f4d63de54f0de9423",
        "exitCode": 1,
        "finishedAt": "2022-05-31T02:13:19Z",
        "reason": "Error",
        "startedAt": "2022-05-31T02:13:19Z"
      }
    },
    "name": "cleanup",
    "ready": false,
    "restartCount": 3,
    "state": {
      "waiting": {
        "message": "back-off 40s restarting failed container=cleanup pod=version-4.10.16-dk76g-6qdx2_openshift-cluster-version(c5773446-72dc-4454-97b7-fdd42cb228ea)",
        "reason": "CrashLoopBackOff"
      }
    }
  }
  $ tar -xOz must-gather/namespaces/openshift-cluster-version/pods/version-4.10.16-dk76g-6qdx2/cleanup/cleanup/logs/current.log <02527285_must-gather-20220531_022021Z.tar.gz
  2022-05-31T02:13:19.710822157Z rm: invalid option -- 'c'
  2022-05-31T02:13:19.710822157Z Try 'rm ./-cgXkuYo_RfOyhs3_AZGxQ' to remove the file '-cgXkuYo_RfOyhs3_AZGxQ'.
  2022-05-31T02:13:19.710822157Z Try 'rm --help' for more information.
  $ tar -xOz must-gather/namespaces/openshift-cluster-version/pods/version-4.10.16-dk76g-6qdx2/version-4.10.16-dk76g-6qdx2.yaml <02527285_must-gather-20220531_022021Z.tar.gz | yaml2json | jq -c '.spec.initContainers[] | select(.name == "cleanup").command'
  ["sh","-c","rm -fR *"]

Ah, there's a local filename starting with - coming out of that * expansion.  That will be a regression injected by bug 2080058 in 4.10.14, which will only bite some subset of clusters that had previously moved through a version which hashed to a leading hyphen.  Or something?  Unclear to me how deleting the CVO pod would have unstuck this...

[1]: https://github.com/openshift/cluster-version-operator/blob/b3da2d3eba82adcd53198d662607f21641817c4a/lib/resourcebuilder/batch.go#L24

Comment 3 Lalatendu Mohanty 2022-06-01 13:56:41 UTC

The following statement (or a link to this section) can be pasted into bugs when adding ImpactStatementRequested:

We're asking the following questions to evaluate whether or not this bug warrants changing update recommendations from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?

reasoning: This allows us to populate from, to, and matchingRules in conditional update recommendations for "the $SOURCE_RELEASE to $TARGET_RELEASE update is not recommended for clusters like $THIS".
example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet. Check your vulnerability with oc ... or the following PromQL count (...) > 0.
example: All customers upgrading from 4.y.z to 4.y+1.z fail. Check your vulnerability with oc adm upgrade to show your current cluster version.

What is the impact? Is it serious enough to warrant removing update recommendations?

reasoning: This allows us to populate name and message in conditional update recommendations for "...because if you update, $THESE_CONDITIONS may cause $THESE_UNFORTUNATE_SYMPTOMS".
example: Around 2 minute disruption in edge routing for 10% of clusters. Check with oc ....
example: Up to 90 seconds of API downtime. Check with curl ....
example: etcd loses quorum and you have to restore from backup. Check with ssh ....

How involved is remediation?

reasoning: This allows administrators who are already vulnerable, or who chose to waive conditional-update risks, to recover their cluster. And even moderately serious impacts might be acceptable if they are easy to mitigate.
example: Issue resolves itself after five minutes.
example: Admin can run a single: oc ....
example: Admin must SSH to hosts, restore from backups, or other non standard admin activities.

Is this a regression?

reasoning: Updating between two vulnerable releases may not increase exposure (unless rebooting during the update increases vulnerability, etc.). We only qualify update recommendations if the update increases exposure.
example: No, it has always been like this we just never noticed.
example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1.

Comment 4 W. Trevor King 2022-06-01 23:41:50 UTC

Created attachment 1885899 [details]
Image pullspec hashing script

Looking at impacted releases with the attached hashing script:

  $ go build hasher.go

The following releases will hash with a leading hyphen:

$ for Y in $(seq 2 10); do curl -s "https://api.openshift.com/api/upgrades_info/graph?channel=candidate-4.${Y}" | jq -r '.nodes[].payload' | ./hasher; done | sort | uniq | grep ' -> -' | while read IMAGE SEP HASH; do VERSION="$(oc adm release info "${IMAGE}" 2>/dev/null | sed -n s'/Name:[[:space:]]*//p')"; echo "${VERSION} ${IMAGE} ${HASH}"; done | sort -V
4.2.11 quay.io/openshift-release-dev/ocp-release@sha256:49ee20ee3102b15a7cf4c019fd8875134fda41ccda1dc27b6e4483ded2aa8a5c -5C-jd960zwGw8g3SDVeUg
4.2.25 quay.io/openshift-release-dev/ocp-release@sha256:dfbe59ca5dcc017475a0e1c703f51750c1bde63f12c725fbe4b7a599e36eb725 --Ufil8iVGp4v0fh3OnQXg
4.3.31 quay.io/openshift-release-dev/ocp-release@sha256:6395ddd44276c4a1d760c77f9f5d8dabf302df7b84afd7b3147c97bdf268ab0f -yIabYNEY2V8At6XIA8_9g
4.4.25 quay.io/openshift-release-dev/ocp-release@sha256:6f544f0159d20d18ab54619caa82983684497225e2a2fcf0e74ad60ca74b1871 -_QdHCJD-Ev76EVGHDYZyw
4.5.15 quay.io/openshift-release-dev/ocp-release@sha256:1df294ebe5b84f0eeceaa85b2162862c390143f5e84cda5acc22cc4529273c4c -cgXkuYo_RfOyhs3_AZGxQ
4.5.22 quay.io/openshift-release-dev/ocp-release@sha256:38d0bcb5443666b93a0c117f41ce5d5d8b3602b411c574f4e164054c43408a01 -0bX7BjpLoBa1j1hWXegtA
4.6.0-rc.4 quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 -QRDRzNeshuxExJctkKaiw
4.6.46 quay.io/openshift-release-dev/ocp-release@sha256:08180bc0b4765240beb07f9ee037a89442f90ca6cca9a4a682e73fd208ab2330 -e6aNUW32cbdCSYzgfvXVg

Vulnerable clusters are expected to be clusters that updated into one of those leading-hyphen versions, continued on to 4.10.14 or later, and then tried to update out to any later release.  Although I'm still working on confirming that expectation and the recovery process.

And folks might also have image pullspecs that hash with a leading hyphen if they have been using 'oc adm upgrade --to-image registry.example.com/...' or similar synonym, or hotfixes, etc.  I only hashed the pullspecs that showed up in candidate-4.y channels for 4.2 and later.

Comment 5 W. Trevor King 2022-06-02 07:22:24 UTC

Cluster bot 'launch 4.6.45 aws' [1]:

  $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version
  4.6.45
  $ oc version --client  # only 4.9 and later clients support 'oc adm upgrade channel ...'
  Client Version: 4.11.0-0.nightly-2022-05-20-213928
  Kustomize Version: v4.5.4

Kick off the first hop to 4.6.46:

  $ oc adm upgrade channel stable-4.7
  $ oc adm upgrade --to 4.6.46
  $ watch oc adm upgrade
  ...wait for update to complete...

Kick off the second hop:

  $ oc adm upgrade --to 4.7.49

Check to confirm that passing through 4.6.46 did pick up the troublesome file:

  $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- bash -c 'ls /host/etc/cvo/updatepayloads/'; done
  Starting pod/ip-10-0-145-44ec2internal-debug ...
  To use host binaries, run `chroot /host`
  ls: cannot access '/host/etc/cvo/updatepayloads/': No such file or directory

  Removing debug pod ...
  error: non-zero exit code from debug container
  Starting pod/ip-10-0-157-157ec2internal-debug ...
  To use host binaries, run `chroot /host`
  bz9-9v43JSTAOW6SuWQNnQ

  Removing debug pod ...
  Starting pod/ip-10-0-253-68ec2internal-debug ...
  To use host binaries, run `chroot /host`
  -e6aNUW32cbdCSYzgfvXVg

  Removing debug pod ...

Note that the troublesome -e6... is only on one node.  If the vulnerable 4.10.14 or 4.10.15 CVO is launching version pods on a node that doesn't happen to have the hyphen-starting release, we won't trigger the bug.  Which is presumably how the comment 2 cluster made it from 4.10.14 to 4.10.15 and only hit this issue going from 4.10.15 to 4.10.16.  Back to waiting out the 4.7 update:

  $ watch oc adm upgrade
  ...wait for update to complete...

Now on to 4.8:

  $ oc adm upgrade channel stable-4.8
  $ oc adm upgrade --to 4.8.39
  $ watch oc adm upgrade
  ...wait for update to complete...

On to 4.9:

  $ oc adm upgrade channel eus-4.10
  $ oc -n openshift-config patch configmap admin-acks --patch '{"data":{"ack-4.8-kube-1.22-api-removals-in-4.9":"true"}}' --type=merge
  $ oc adm upgrade --to 4.9.33

I wanted to continue on to 4.10.14 and then on from there to 4.10.15 to see if I could reproduce the issue, but hit the 3h-post-install cluster-bot timeout.  I'll try again later by synthetically injecting hyphenated directories with 'oc debug' so I can skip the chained-update setup.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1532205770345549824

Comment 6 W. Trevor King 2022-06-02 07:35:51 UTC

Filling out the impact-statement template, based on my current understanding:

Which 4.y.z to 4.y'.z' updates increase vulnerability?
* Customers updating from 4.10.14 and later 4.10.z until we fix this bug.

Which types of clusters?
* Clusters which have passed through 4.6.46 or other versions listed in comment 4 in the past.  Check with:

    $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- bash -c 'ls /host/etc/cvo/updatepayloads/'; done

  If the output contains any hyphenated entries, that control-plane node is at risk.

What is the impact?
* If the cluster-version operator pod is on an at-risk control-plane node, it will fail with ReleaseAccepted=False with reason=DeadlineExceeded and "Job was active longer than specified deadline".
* Unless the cluster passed through a number of versions from comment 4, it is likely that the other control-plane nodes are not at risk.

Is it serious enough to warrant removing update recommendations?
* We expect that few clusters which have touched the commend 4 versions will update to the vulnerable 4.10.14 and later 4.10.z before we fix this bug.  The exposure is basically "really old clusters that update each week".  And while we have a bunch of those, the bulk of the fleet is newer or updates less frequently.

How involved is remediation?
* If you delete the cluster-version operator with:

    $ oc -n openshift-cluster-version delete pod -l k8s-app=cluster-version-operator

  The replacement pod (created by the Deployment controller) may be scheduled to a different, not-at-risk control-plane node, which will allow that update to proceed.  Subsequent updates may have the cluster-version operator back on an at-risk control-plane node, so this is not a long-term fix.

* Remove the hyphenated content (and all other past release content) from all control-plane nodes with:

    $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- bash -c 'rm -fR /host/etc/cvo/updatepayloads/*'; done

  This is a permanent fix, essentially giving yourself the cleanup that is in-flight to ship with the product in cluster-version-operator#783.

Is this a regression?
* Yes, we regressed in 4.10.14 via bug 2080058, which did not consider this hyphen-starting base64 case.

Comment 7 W. Trevor King 2022-06-02 16:10:47 UTC

Reproducing with synthetic dashed-directory injection in a 'launch 4.10.14 aws' cluster-bot cluster [1]:

  $ oc get -o jsonpath='{.status.desired.version}{"\n"}' clusterversion version
  4.10.14
  $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- mkdir -p /host/etc/cvo/updatepayloads/-cccccccc; done

Update to 4.10.15:

  $ oc adm upgrade channel stable-4.10
  $ oc adm upgrade --to 4.10.15
  $ oc -n openshift-cluster-version get pods | grep ^version
  version-4.10.15-s8std-hmch4                 0/1     Init:Error   3 (29s ago)   51s
  $ oc -n openshift-cluster-version logs -c cleanup version-4.10.15-s8std-hmch4
  rm: invalid option -- 'c'
  Try 'rm ./-cccccccc' to remove the file '-cccccccc'.
  Try 'rm --help' for more information.

So successfully reproduced.  Test the mitigation recommendation from comment 6:

  $ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- bash -c 'rm -fR /host/etc/cvo/updatepayloads/*'; done

That seems to have recovered the version pod, since it's gone:

  $ oc -n openshift-cluster-version get pods
  NAME                                        READY   STATUS    RESTARTS   AGE
  cluster-version-operator-67c6dc764b-gdtl8   1/1     Running   0          50m

But the update is still stuck, presumably because bug 2083370 was only fixed in 4.10.15, so 4.10.14 is vulnerable to that.  Recovering by clearing the update and coming in again:

  $ oc adm upgrade --clear
  $ watch oc adm upgrade
  ...wait for the cluster to realize it's still happy on 4.10.14...
  $ oc adm upgrade --to 4.10.15

And shortly thereafter:

  $ oc adm upgrade
  info: An upgrade is in progress. Working towards 4.10.15: 95 of 771 done (12% complete)
  ...

So mitigation confirmed (even if you have to mix in a bug 2083370 mitigation as well if you're leaving 4.10.14).

Comment 8 Lalatendu Mohanty 2022-06-02 16:43:35 UTC

As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2091770#c7 we are not considering this bug as upgrade blocker.

Comment 11 Evgeni Vakhonin 2022-06-06 17:59:20 UTC

reproducing on later build
Server Version: 4.10.17
injecting dashed dir as in https://bugzilla.redhat.com/show_bug.cgi?id=2091770#c7
$ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- mkdir -p /host/etc/cvo/updatepayloads/-cccccccc; done
upgrading to 4.11.0-0.nightly-2022-05-25-123329
$ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:13bfc31eb4a284ce691e848c25d9120dbde3f0852d4be64be4b90953ac914bf1 --force
version pod crashed
$ oc get -n openshift-cluster-version pods
NAME                                        READY   STATUS       RESTARTS      AGE
cluster-version-operator-6cddff4f74-b6h55   1/1     Running      0             139m
version--l9vbc-mpcqw                        0/1     Init:Error   2 (27s ago)   30s

$ oc logs -n openshift-cluster-version pod/version--l9vbc-mpcqw
Error from server (BadRequest): container "rename-to-final-location" in pod "version--l9vbc-mpcqw" is waiting to start: PodInitializing
$ oc -n openshift-cluster-version logs version--l9vbc-mpcqw -c cleanup
rm: invalid option -- 'c'
Try 'rm ./-cccccccc' to remove the file '-cccccccc'.
Try 'rm --help' for more information.



verifying
Server Version: 4.11.0-0.nightly-2022-06-04-014713
$ for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- mkdir -p /host/etc/cvo/updatepayloads/-cccccccc; done
upgrading to 4.11.0-0.nightly-2022-06-04-180008
$ oc adm upgrade --allow-explicit-upgrade --to-image registry.ci.openshift.org/ocp/release@sha256:54f775170ea8323770ba9501a9556ffb570de05856f28abde58a540e94be8903 --force
pod looks good:
$ oc get -n openshift-cluster-version pods
NAME                                        READY   STATUS      RESTARTS   AGE
cluster-version-operator-5b9d8495c8-q745c   1/1     Running     0          10s
version--69szg-4tpwn                        0/1     Completed   0          24s
logs looks good:
$ oc -n openshift-cluster-version logs version--69szg-4tpwn
Defaulted container "rename-to-final-location" out of: rename-to-final-location, cleanup (init), make-temporary-directory (init), move-operator-manifests-to-temporary-directory (init), move-release-manifests-to-temporary-directory (init)
$ oc -n openshift-cluster-version logs version--69szg-4tpwn -c cleanup
(no output)

upgrade started and progressing
info: An upgrade is in progress. Working towards 4.11.0-0.nightly-2022-06-04-180008: 678 of 802 done (84% complete), waiting up to 40 minutes on machine-config

Comment 12 Daniel Zilberman 2022-06-07 00:32:21 UTC

I have hit this issue (logged for OCP 4.10.15) for my "sandbox" AWS ROSA cluster, trying to upgrade it to 4.10.13 - 4.10.14.
I have tried cleanup steps from #Comment 6 (thank you @wking for documenting them in such details), but unfortunately wasn't able to "roll back" the stuck upgrade to 4.10.14:

1. Cleanup of /updatepayload directories

for NODE in $(oc get -l node-role.kubernetes.io/master= -o jsonpath='{.items[*].metadata.name}' nodes); do oc debug --as-root "node/${NODE}" -- bash -c 'rm -fR /host/etc/cvo/updatepayloads/*'; done
Creating debug namespace/openshift-debug-node-s8frn ...
Starting pod/ip-10-0-146-203us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
Removing debug namespace/openshift-debug-node-s8frn ...
Creating debug namespace/openshift-debug-node-m52v4 ...
Starting pod/ip-10-0-197-147us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
Removing debug namespace/openshift-debug-node-m52v4 ...
Creating debug namespace/openshift-debug-node-trww5 ...
Starting pod/ip-10-0-223-67us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
Removing debug namespace/openshift-debug-node-trww5 ...

2. Checking cluster version:
oc -n openshift-cluster-version get pods
NAME READY STATUS RESTARTS AGE
cluster-version-operator-68c9dcdd8d-fp9mv 1/1 Running 0 160m

3. Attempt to rollback ongoing clyster upgrade

oc adm upgrade --clear

error: Unable to cancel current rollout: admission webhook "regular-user-validation.managed.openshift.io" denied the request: Prevented from accessing Red Hat managed resources. This is in an effort to prevent harmful actions that may cause unintended consequences or affect the stability of the cluster. If you have any questions about this, please reach out to Red Hat support at https://access.redhat.com/support

My understanding that is a specific ROSA issue as such webhook is not activated for self-managed OCP deployments? Thanks!

Comment 13 Daniel Zilberman 2022-06-09 22:03:19 UTC

I'd like to "port" some comments from the Support case https://access.redhat.com/support/cases/#/case/03237175 I opened and just OKd to close (since my cluster is ROSA).

1. I previously had installed ACS Central & Collector components into this (ROSA) and another (on-prem self managed) OCP clusters.

In both cases the issue seems to be related to SCCs installed in the cluster by ACS operator (there is an option whether install it or not when its configured). 
In particular 'stackrox-collector,' SCC seems to be adding an SCC policy to the 'version-*' pod that is supposed to be privileged and downloads the update image. The version yaml had this SCC:
 
     securityContext:
       privileged: true
       readOnlyRootFilesystem: true

 with the following annotation in metadata:
 
 openshift.io/scc: stackrox-collector
 
 This produces logs from the version-* pod like:
 
 mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_adminack_configmap.yaml': Read-only file system
 mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_admingate_configmap.yaml': Read-only file system
 mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_clusteroperator.crd.yaml': Read-only file system
 mv: cannot remove '/manifests/0000_00_cluster-version-operator_01_clusterversion.crd.yaml': Read-only file system
 mv: cannot remove '....'

2. I confirmed that stackrox-collector SCC is indeed enabled clusterwide:

oc get scc | grep stackrox
stackrox-admission-control        false   []                RunAsAny    RunAsAny           RunAsAny    RunAsAny    0            true             ["configMap","downwardAPI","emptyDir","secret"]
stackrox-collector                true    []                RunAsAny    RunAsAny           RunAsAny    RunAsAny    0            true             ["configMap","downwardAPI","emptyDir","hostPath","secret"]
stackrox-sensor                   false   []                RunAsAny    RunAsAny           RunAsAny    RunAsAny    0            true             ["configMap","downwardAPI","emptyDir","secret"]

and then deleted 'stackrox-collector'one

3. I have applied "clean-up script" for master nodes from: https://bugzilla.redhat.com/show_bug.cgi?id=2091770#c6 to clean up lingering uploads from master node(s)
I deleted the cluster-version-operator pod and observed that the recreated Job pod is finally able to complete the update image pull job:

oc -n openshift-cluster-version get  pod
NAME                                           READY   STATUS      RESTARTS   AGE
cluster-version-operator-7488488946-zgvv5      1/1     Running     0          8m20s
version-4.10.15-nx46f-dnpjw                    0/1     Completed   0          49m. <== was failing due to read-only file system before

Same root cause and same approach worked for my other, on-prem OCP 4.10.5 cluster.

Just wanted to share this as this "SCC injection" into generated pods via annotations (openshift.io/scc: stackrox-collector) caused its filesystem to be read-only and was failing upgrades.

Comment 14 W. Trevor King 2022-06-09 22:12:26 UTC

> I deleted the cluster-version-operator pod and observed that the recreated Job pod is finally able to complete the update image pull job:

This SCC injection issue is separate from this bug's hyphenated-directory issue, although both of them cause the version-... pod to fail, so they have similar downstream affects.  We're tracking SCC-injection reporting as an RFE in [1].

[1]: https://issues.redhat.com/browse/OTA-680

Comment 16 Jack Ottofaro 2022-06-24 17:24:27 UTC

*** Bug 2098219 has been marked as a duplicate of this bug. ***

Comment 17 W. Trevor King 2022-06-28 19:29:50 UTC

Expanding comment 4's list of releases that hash to a hyphen prefix to other architectures:

$ for ARCH in amd64 ppc46le s390x; do echo "${ARCH}"; for Y in $(seq 2 10); do curl -s "https://api.openshift.com/api/upgrades_info/graph?arch=${ARCH}&channel=candidate-4.${Y}" | jq -r '.n
odes[].payload' | ./hasher; done | sort | uniq | grep ' -> -' | while read IMAGE SEP HASH; do VERSION="$(oc adm release info "${IMAGE}" 2>/dev/null | sed -n s'/Name:[[:space:]]*//p')"; echo "${VERSION} ${IMAGE} ${HASH}"; done | sort -V; done
amd64
4.2.11 quay.io/openshift-release-dev/ocp-release@sha256:49ee20ee3102b15a7cf4c019fd8875134fda41ccda1dc27b6e4483ded2aa8a5c -5C-jd960zwGw8g3SDVeUg
4.2.25 quay.io/openshift-release-dev/ocp-release@sha256:dfbe59ca5dcc017475a0e1c703f51750c1bde63f12c725fbe4b7a599e36eb725 --Ufil8iVGp4v0fh3OnQXg
4.3.31 quay.io/openshift-release-dev/ocp-release@sha256:6395ddd44276c4a1d760c77f9f5d8dabf302df7b84afd7b3147c97bdf268ab0f -yIabYNEY2V8At6XIA8_9g
4.4.25 quay.io/openshift-release-dev/ocp-release@sha256:6f544f0159d20d18ab54619caa82983684497225e2a2fcf0e74ad60ca74b1871 -_QdHCJD-Ev76EVGHDYZyw
4.5.15 quay.io/openshift-release-dev/ocp-release@sha256:1df294ebe5b84f0eeceaa85b2162862c390143f5e84cda5acc22cc4529273c4c -cgXkuYo_RfOyhs3_AZGxQ
4.5.22 quay.io/openshift-release-dev/ocp-release@sha256:38d0bcb5443666b93a0c117f41ce5d5d8b3602b411c574f4e164054c43408a01 -0bX7BjpLoBa1j1hWXegtA
4.6.0-rc.4 quay.io/openshift-release-dev/ocp-release@sha256:2c22e1c56831935a24efb827d2df572855ccd555c980070f77c39729526037d5 -QRDRzNeshuxExJctkKaiw
4.6.46 quay.io/openshift-release-dev/ocp-release@sha256:08180bc0b4765240beb07f9ee037a89442f90ca6cca9a4a682e73fd208ab2330 -e6aNUW32cbdCSYzgfvXVg
ppc46le
s390x
4.4.31 quay.io/openshift-release-dev/ocp-release@sha256:82b710ad9b4be8e03476e35e8a020f9aea4f6cf3c4ef1a2fe44185416c7f5f44 -UivIFK1okDl2gNxI0Mg-g
4.8.1 quay.io/openshift-release-dev/ocp-release@sha256:7dc99696cdd7cfe1b2c3cf685cbf6dcdaac9c210f17dd694881501808114145b -pb-bxUk4VFhlfvGwK1cvw
4.9.11 quay.io/openshift-release-dev/ocp-release@sha256:21fc2e5429882e17e444704aa46da3ca65478bf78379e0bb56c676a7a138b529 -Z3v6Xwil50NxkVEk615Jw

So ppc64le has no exposure, and s390x has more recent exposure than amd64.

Comment 18 W. Trevor King 2022-06-29 18:52:28 UTC

We dropped UpgradeBlocker in comment 8, based on the not-too-awkward mitigation from comment 7 and the impact statement from comment 6.  I'm adding it back now, along with UpdateRecommendationsBlocked, because we decided that this was still awkward enough to be worth conditional-risk declarations for 4.10.* -> 4.10.(14 <= z < 20) [1].  We've also published a KCS so folks don't have to poke around in this bug to get the user-facing details [2].

[1]: https://github.com/openshift/cincinnati-graph-data/pull/2118
[2]: https://access.redhat.com/solutions/6965075

Comment 19 errata-xmlrpc 2022-08-10 11:15:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.