Description of problem: ODF 4.10 deployment fails with all PVCs in openshift-monitoring namespace stuck in pending state. [aditi@nx142 4.10-4]$ oc get pvc -n openshift-monitoring NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE my-alertmanager-claim-alertmanager-main-0 Pending ocs-storagecluster-ceph-rbd 26m my-alertmanager-claim-alertmanager-main-1 Pending ocs-storagecluster-ceph-rbd 26m my-prometheus-claim-prometheus-k8s-0 Pending ocs-storagecluster-ceph-rbd 26m my-prometheus-claim-prometheus-k8s-1 Pending ocs-storagecluster-ceph-rbd 26m [aditi@nx142 4.10-4]$ Found the pod - rook-ceph-csi-detect-version also stuck in pending state as it is not able to pull quay.io/cephcsi/cephcsi:v3.4.0 for ppc64le. Normal Pulling 3m33s (x4 over 5m6s) kubelet Pulling image "quay.io/cephcsi/cephcsi:v3.4.0" Warning Failed 3m30s (x4 over 5m3s) kubelet Failed to pull image "quay.io/cephcsi/cephcsi:v3.4.0": rpc error: code = Unknown desc = choosing image instance: no image found in manifest list for architecture ppc64le, variant "", OS linux Warning Failed 3m30s (x4 over 5m3s) kubelet Error: ErrImagePull Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I took a look here and this is actually an issue with the ocs-operator repo that no longer generates the ROOK_CSI_* overrides in the CSV for some reason so we are using upstream images (which are built for amd64 only) in our downstream builds. Just to reiterate -- this is not just cephcsi image missing the override, there is no override for any of the sidecar images in the ocs-operator-bundle CSV in ODF 4.10. Retargetting.
Looking at the git log, this has been the case at least since 4.10.0-32 which we built on December 7th. The build 4.10.0-29 still had the ROOK_CSI_* overrides in the CSV so I suspect the regression to occur somewhere around that time. The commit-wise this is: 2667179611d88ec00bf136b74a9f424e9d76be1d -- working bf0946e7b4d47b799d209c491f0fcfbd71a4295c -- broken I can only see one commit in that range and it does not seem to be related to the CSV generation. I suspect this started occurring because of changes in rook around that time.
The changes in this https://github.com/red-hat-storage/ocs-operator/commit/bf0946e7b4d47b799d209c491f0fcfbd71a4295c were required in SCC in Openshift. I don't idea about CSV generation but I don't think it should impact CSV.
@branto We probably stopped adding env vars like "ROOK_CSI_SNAPSHOTTER_IMAGE" etc while generating CSV. I tried generating CSV with master and release-4.9 branches and it's working fine for me. Can you check if something changed with downstream script?
Testing the latest rook image, it looks like the issue is due to change of CSV name in Rook. Which is causing environment variable injection to fail. It's a quick fix. Providing devel_ack+ .
I have fixed the issue in Rook and resynced 4.10 with latest rook v1.8 here https://github.com/red-hat-storage/rook/pull/325 yesterday.
*** Bug 2037718 has been marked as a duplicate of this bug. ***
I am able to deploy ODF 4.10(4.10.0-73) now. Thanks
Thanks Aditi for confirming. Verified that downstream csi builds are present. Verified in version: ODF 4.10.0-79 OCP 4.10.0-0.nightly-2022-01-10-144202
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.10.0 enhancement, security & bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1372