I believe this relates to the GChat discussion here: https://chat.google.com/room/AAAAREGEba8/UhB2zicUymw The described errors seem the same to me. I believe Rakshith is working on a fix.
Rakshith, if this is the BZ you are working on a fix for, would you set yourself as assignee? If you think it's a separate issue, feel free to comment and I will keep looking.
Blaine, this is a separate issue. Rakshith is working on the deployment failure because of csv issues in vrc. This BZ is about OCP restricting pod to start if the pod is running in privilege mode. 3 pods are affected: 1. rook operator 2. noobaa 3. ocs-metrics exporter So, I have created one BZ for each operator. This may help to understand the problem https://kubernetes.io/blog/2021/12/09/pod-security-admission-beta/#privileged-level-and-workload. There is a workaround as mentioned here https://bugzilla.redhat.com/show_bug.cgi?id=2124379#c4 but we need a permanent fix.
Whatever I can read on the web it suggests either we have to make allowPrivilegeEscalation value false or label the namespace so the pod admission security has the information https://connect.redhat.com/en/blog/important-openshift-changes-pod-security-standards https://kubernetes.io/blog/2021/12/09/pod-security-admission-beta/#privileged-level-and-workload
Rook requires running privileged pods, so the workaround mentioned in [1] isn't just a workaround, it's a requirement. Of course we should not run pods privileged wherever possible, but this is not an option for Rook pods that need access to the underlying storage. Moving this to the ocs operator component for setting the namespace labels. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2124379#c7
BTW we hit this also in upgrade scenario with ODF 4.11 installed and when upgrading OCP to 4.12. Which means we need to have backport in 4.11 as well.
Hi, I tested it on [ODF4.12+LSO4.12] and the ocs-metrics-exporter pod stuck on CrashLoopBackOff state I added label PodSecurity admission on "openshift-storage" and "openshift-local-storage". ODF4.12+LSO4.12 Deployment, ocs-metrics-exporter pod stuck on CrashLoopBackOff state SetUp: ODF Version: 4.12.0-44 OCP Version: 4.12.0-0.nightly-2022-09-08-114806 Provider:Vmware Test Process: 1.Install LSO4.12 operator: $ oc get csv -n openshift-local-storage NAME DISPLAY VERSION REPLACES PHASE local-storage-operator.4.12.0-202209010624 Local Storage 4.12.0-202209010624 local-storage-operator.4.11.0-202208291725 Succeeded 2.Label PodSecurity admission on "openshift-storage": oc label namespace openshift-storage security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline --overwrite 3.Label PodSecurity admission on "openshift-local-storage": oc label namespace openshift-local-storage security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline --overwrite 4.Create ODF via UI: $ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-local-storage local-storage-operator.4.12.0-202209010624 Local Storage 4.12.0-202209010624 local-storage-operator.4.11.0-202208291725 Succeeded openshift-operator-lifecycle-manager packageserver Package Server 0.19.0 Succeeded openshift-storage mcg-operator.v4.12.0 NooBaa Operator 4.12.0 Succeeded openshift-storage ocs-operator.v4.12.0 OpenShift Container Storage 4.12.0 Succeeded openshift-storage odf-csi-addons-operator.v4.12.0 CSI Addons 4.12.0 Succeeded openshift-storage odf-operator.v4.12.0 OpenShift Data Foundation 4.12.0 Succeeded $ oc describe csv odf-operator.v4.12.0 -n openshift-storage | grep full Labels: full_version=4.12.0-44 5.Add disks to worker nodes [Vmware] 6.Install Storage System via UI storagecluster stuck on Progressing state 7.Storageclusters stuck on Progressing more than 20 min $ oc get storageclusters.ocs.openshift.io -n openshift-storage NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 22m Progressing 2022-09-11T12:30:54Z 4.12.0 Status: Conditions: Last Heartbeat Time: 2022-09-11T12:54:26Z Last Transition Time: 2022-09-11T12:30:55Z Message: Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd] Reason: ReconcileFailed Status: False Type: ReconcileComplete $ oc get storageclusters.ocs.openshift.io -n openshift-storage NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 39m Ready 2022-09-11T12:30:54Z 4.12.0 8.Check the status of ocs-metrics-exporter pod $ oc get pods -n openshift-storage | grep ocs-metrics-exporter ocs-metrics-exporter-8874fffd-2f6ft 0/1 CrashLoopBackOff 5 (2m19s ago) 146m [oviner@fedora auth]$ oc get pods ocs-metrics-exporter-8874fffd-2f6ft -n openshift-storage NAME READY STATUS RESTARTS AGE ocs-metrics-exporter-8874fffd-2f6ft 0/1 CrashLoopBackOff 5 (2m36s ago) 147m [oviner@fedora auth]$ oc logs ocs-metrics-exporter-8874fffd-2f6ft -n openshift-storage I0911 13:01:17.936183 1 main.go:29] using options: &{Apiserver: KubeconfigPath: Host:0.0.0.0 Port:8080 ExporterHost:0.0.0.0 ExporterPort:8081 Help:false AllowedNamespaces:[openshift-storage] flags:0xc000220a00 StopCh:<nil> Kubeconfig:<nil>} W0911 13:01:17.936366 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0911 13:01:17.941131 1 main.go:70] Running metrics server on 0.0.0.0:8080 I0911 13:01:17.941154 1 main.go:71] Running telemetry server on 0.0.0.0:8081 I0911 13:01:17.953225 1 rbd-mirror.go:213] skipping rbd mirror status update for pool openshift-storage/ocs-storagecluster-cephblockpool because mirroring is disabled I0911 13:01:17.955836 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-a4aedfd0 I0911 13:01:17.955860 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-6a3c119b I0911 13:01:17.955865 1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-163aafec E0911 13:01:46.997911 1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR panic: interface conversion: interface {} is *v1.CephCluster, not *v1.CephObjectStore goroutine 195 [running]: github.com/rook/rook/pkg/client/listers/ceph.rook.io/v1.cephObjectStoreNamespaceLister.List.func1({0x19d9280, 0xc0001cd900}) /remote-source/app/vendor/github.com/rook/rook/pkg/client/listers/ceph.rook.io/v1/cephobjectstore.go:84 +0xc5 k8s.io/client-go/tools/cache.ListAllByNamespace({0x1d33e90, 0xc00000ce88}, {0x7ffd580e3e7a, 0x11}, {0x1d19730, 0xc00052b680}, 0xc0004dcd60) /remote-source/app/vendor/k8s.io/client-go/tools/cache/listers.go:96 +0x39c github.com/rook/rook/pkg/client/listers/ceph.rook.io/v1.cephObjectStoreNamespaceLister.List({{0x1d33e90, 0xc00000ce88}, {0x7ffd580e3e7a, 0x18}}, {0x1d19730, 0xc00052b680}) /remote-source/app/vendor/github.com/rook/rook/pkg/client/listers/ceph.rook.io/v1/cephobjectstore.go:83 +0x6f github.com/red-hat-storage/ocs-operator/metrics/internal/collectors.getAllObjectStores({0x1ce2bb8, 0xc00051b1a0}, {0xc0000c39a0, 0x1, 0xc00078d718}) /remote-source/app/metrics/internal/collectors/ceph-object-store.go:87 +0x1c2 github.com/red-hat-storage/ocs-operator/metrics/internal/collectors.(*ClusterAdvanceFeatureCollector).Collect(0xc00022bec0, 0xc00078d760) /remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:87 +0x11e github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1() /remote-source/app/vendor/github.com/prometheus/client_golang/prometheus/registry.go:446 +0x102 created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather /remote-source/app/vendor/github.com/prometheus/client_golang/prometheus/registry.go:538 +0xb4d $ oc describe pods -n openshift-storage ocs-metrics-exporter-8874fffd-2f6ft Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: kube-api-access-kccjp: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true ConfigMapName: openshift-service-ca.crt ConfigMapOptional: <nil> QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s node.ocs.openshift.io/storage=true:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 148m default-scheduler Successfully assigned openshift-storage/ocs-metrics-exporter-8874fffd-2f6ft to compute-1 by control-plane-1 Normal AddedInterface 148m multus Add eth0 [10.131.0.33/23] from ovn-kubernetes Normal Pulling 148m kubelet Pulling image "quay.io/rhceph-dev/odf4-ocs-metrics-exporter-rhel8@sha256:72c02ff0dbf796fe821ab0358c294af19daa2023347b7f50d9a856d32a2e84b1" Normal Pulled 148m kubelet Successfully pulled image "quay.io/rhceph-dev/odf4-ocs-metrics-exporter-rhel8@sha256:72c02ff0dbf796fe821ab0358c294af19daa2023347b7f50d9a856d32a2e84b1" in 23.9199064s Normal Created 6m35s (x5 over 148m) kubelet Created container ocs-metrics-exporter Normal Started 6m35s (x5 over 148m) kubelet Started container ocs-metrics-exporter Warning BackOff 4m57s (x13 over 9m16s) kubelet Back-off restarting failed container Normal Pulled 4m46s (x5 over 10m) kubelet Container image "quay.io/rhceph-dev/odf4-ocs-metrics-exporter-rhel8@sha256:72c02ff0dbf796fe821ab0358c294af19daa2023347b7f50d9a856d32a2e84b1" already present on machine
Thanks Oded for the detailed comment. In `ClusterAdvanceFeatureCollector` we are using a single cache.Indexer for all types of objects (like CephCluster, CephObjectStore, StorageClass etc...) and while calling the `List` function on CephObjectStoreLister (or CephObjectStore namespace lister) we are getting already cached `CephCluster` objects. Taking this BZ
Oded, we should raise a separate BZ for the issue Arun is working on. Let this BZ remain for the original issue.
@tnielsen, in the above comment the link[1] actually creates the ns in the ocs operator e2e tests only. It doesn't come into play during operator installation. During operator installation, if done via CLI the user creates the Namespace, or if done via the UI the UI creates the Namespace. So we don't have control over the NS itself. Tagging Nitin also for better clarification on this. @nigoyal .
(In reply to Malay Kumar parida from comment #17) > @tnielsen, in the above comment the link[1] actually creates the > ns in the ocs operator e2e tests only. It doesn't come into play during > operator installation. During operator installation, if done via CLI the > user creates the Namespace, or if done via the UI the UI creates the > Namespace. > So we don't have control over the NS itself. Tagging Nitin also for better > clarification on this. @nigoyal . Got it, if that's only for testing it won't help the product to update the namespace labels there. Thanks for the explanation.
Starting from ocp 4.12.0-0.nightly-2022-10-05-053337 it contains https://issues.redhat.com/browse/OLM-2695, where OLM itself will enable label syncer even on namespaces with openshift- prefix. I am on ocp 4.12.0-0.nightly-2022-10-20-104328, And I can successfully install odf operator/ocs operator without the need of any explicit namespace labeling. All the csvs, deployments, pods do succeed. The same changes are now also included in ci builds. Someone from QE please confirm and we can decide on the bug accordingly.
Steps to reproduce are incomplete: after creating catalog source for the dev buils/images with ODF 4.12, the output of `oc get csv -A` is just: ``` $ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-operator-lifecycle-manager packageserver Package Server 0.19.0 Succeeded ``` While the output in the original bug report seems to show a state when ODF operator is already installed and StorageCluster created.
Verifying on vSphere platform with: OCP 4.12.0-0.nightly-2022-10-25-210451 ODF 4.12.0-82 And after manual UI driven installation of OCP operator, I see that both ocs and mcg operators were installed without any problems: ``` $ oc get csv -A NAMESPACE NAME DISPLAY VERSION REPLACES PHASE openshift-operator-lifecycle-manager packageserver Package Server 0.19.0 Succeeded openshift-storage mcg-operator.v4.12.0 NooBaa Operator 4.12.0 Succeeded openshift-storage ocs-operator.v4.12.0 OpenShift Container Storage 4.12.0 Succeeded openshift-storage odf-csi-addons-operator.v4.12.0 CSI Addons 4.12.0 Succeeded openshift-storage odf-operator.v4.12.0 OpenShift Data Foundation 4.12.0 Succeeded ``` ``` $ oc describe csv/ocs-operator.v4.12.0 -n openshift-storage | tail Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal RequirementsUnknown 9m24s (x2 over 9m24s) operator-lifecycle-manager requirements not yet checked Normal RequirementsNotMet 9m20s (x2 over 9m22s) operator-lifecycle-manager one or more requirements couldn't be found Normal AllRequirementsMet 9m operator-lifecycle-manager all requirements found, attempting install Normal InstallSucceeded 8m59s operator-lifecycle-manager waiting for install components to report healthy Normal InstallWaiting 8m59s operator-lifecycle-manager installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability. Normal InstallWaiting 8m27s operator-lifecycle-manager installing: waiting for deployment rook-ceph-operator to become ready: deployment "rook-ceph-operator" not available: Deployment does not have minimum availability. Normal InstallSucceeded 8m17s operator-lifecycle-manager install strategy completed with no errors ```