Bug 2124593 - [Rook clone] ODF4.12 Installation, ocs-operator.v4.12.0 and mcg-operator.v4.12.0 failed [NEEDINFO]
Summary: [Rook clone] ODF4.12 Installation, ocs-operator.v4.12.0 and mcg-operator.v4....
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.12
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: ODF 4.12.0
Assignee: Malay Kumar parida
QA Contact: Martin Bukatovic
URL:
Whiteboard:
Depends On: 2124379
Blocks: 2124591
TreeView+ depends on / blocked
 
Reported: 2022-09-06 14:59 UTC by Mudit Agarwal
Modified: 2023-08-09 17:00 UTC (History)
15 users (show)

Fixed In Version: 4.12.0-80
Doc Type: No Doc Update
Doc Text:
Clone Of: 2124379
Environment:
Last Closed: 2023-02-08 14:06:28 UTC
Embargoed:
muagarwa: needinfo? (oviner)


Attachments (Terms of Use)

Comment 3 Blaine Gardner 2022-09-06 21:20:05 UTC
I believe this relates to the GChat discussion here: https://chat.google.com/room/AAAAREGEba8/UhB2zicUymw

The described errors seem the same to me. I believe Rakshith is working on a fix.

Comment 4 Blaine Gardner 2022-09-06 21:23:20 UTC
Rakshith, if this is the BZ you are working on a fix for, would you set yourself as assignee? If you think it's a separate issue, feel free to comment and I will keep looking.

Comment 5 Mudit Agarwal 2022-09-07 03:57:37 UTC
Blaine, this is a separate issue. Rakshith is working on the deployment failure because of csv issues in vrc.
This BZ is about OCP restricting pod to start if the pod is running in privilege mode.

3 pods are affected:
1. rook operator
2. noobaa
3. ocs-metrics exporter

So, I have created one BZ for each operator.

This may help to understand the problem https://kubernetes.io/blog/2021/12/09/pod-security-admission-beta/#privileged-level-and-workload.
There is a workaround as mentioned here https://bugzilla.redhat.com/show_bug.cgi?id=2124379#c4 but we need a permanent fix.

Comment 6 Subham Rai 2022-09-07 12:16:29 UTC
Whatever I can read on the web it suggests either we have to make allowPrivilegeEscalation value false or label the namespace so the pod admission security has the information

https://connect.redhat.com/en/blog/important-openshift-changes-pod-security-standards
https://kubernetes.io/blog/2021/12/09/pod-security-admission-beta/#privileged-level-and-workload

Comment 7 Travis Nielsen 2022-09-07 21:20:15 UTC
Rook requires running privileged pods, so the workaround mentioned in [1] isn't just a workaround, it's a requirement. Of course we should not run pods privileged wherever possible, but this is not an option for Rook pods that need access to the underlying storage.

Moving this to the ocs operator component for setting the namespace labels.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2124379#c7

Comment 8 Petr Balogh 2022-09-08 10:56:09 UTC
BTW we hit this also in upgrade scenario with ODF 4.11 installed and when upgrading OCP to 4.12. Which means we need to have backport in 4.11 as well.

Comment 9 Oded 2022-09-11 13:22:50 UTC
Hi, I tested it on [ODF4.12+LSO4.12] and the ocs-metrics-exporter pod stuck on CrashLoopBackOff state
I added label PodSecurity admission on "openshift-storage" and "openshift-local-storage".

ODF4.12+LSO4.12 Deployment, ocs-metrics-exporter pod stuck on CrashLoopBackOff state

SetUp:
ODF Version: 4.12.0-44
OCP Version: 4.12.0-0.nightly-2022-09-08-114806
Provider:Vmware
Test Process:
1.Install LSO4.12 operator:
$ oc get csv -n  openshift-local-storage
NAME                                         DISPLAY         VERSION               REPLACES                                     PHASE
local-storage-operator.4.12.0-202209010624   Local Storage   4.12.0-202209010624   local-storage-operator.4.11.0-202208291725   Succeeded

2.Label PodSecurity admission on "openshift-storage":
 oc label namespace openshift-storage security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline --overwrite
 
3.Label PodSecurity admission on "openshift-local-storage":
 oc label namespace openshift-local-storage security.openshift.io/scc.podSecurityLabelSync=false pod-security.kubernetes.io/enforce=privileged pod-security.kubernetes.io/warn=baseline pod-security.kubernetes.io/audit=baseline --overwrite
 
4.Create ODF via UI:
$ oc get csv -A
NAMESPACE                              NAME                                         DISPLAY                       VERSION               REPLACES                                     PHASE
openshift-local-storage                local-storage-operator.4.12.0-202209010624   Local Storage                 4.12.0-202209010624   local-storage-operator.4.11.0-202208291725   Succeeded
openshift-operator-lifecycle-manager   packageserver                                Package Server                0.19.0                                                             Succeeded
openshift-storage                      mcg-operator.v4.12.0                         NooBaa Operator               4.12.0                                                             Succeeded
openshift-storage                      ocs-operator.v4.12.0                         OpenShift Container Storage   4.12.0                                                             Succeeded
openshift-storage                      odf-csi-addons-operator.v4.12.0              CSI Addons                    4.12.0                                                             Succeeded
openshift-storage                      odf-operator.v4.12.0                         OpenShift Data Foundation     4.12.0                                                             Succeeded

$  oc describe csv odf-operator.v4.12.0 -n openshift-storage | grep full
Labels:       full_version=4.12.0-44

5.Add disks to worker nodes [Vmware]

6.Install Storage System via UI
storagecluster stuck on Progressing state

7.Storageclusters stuck on Progressing more than 20 min
$ oc get storageclusters.ocs.openshift.io -n openshift-storage  
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   22m   Progressing              2022-09-11T12:30:54Z   4.12.0

Status:
  Conditions:
    Last Heartbeat Time:   2022-09-11T12:54:26Z
    Last Transition Time:  2022-09-11T12:30:55Z
    Message:               Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-cephfs,ocs-storagecluster-ceph-rbd]
    Reason:                ReconcileFailed
    Status:                False
    Type:                  ReconcileComplete

$ oc get storageclusters.ocs.openshift.io -n openshift-storage  
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   39m   Ready              2022-09-11T12:30:54Z   4.12.0


8.Check the status of ocs-metrics-exporter pod
$ oc get pods -n openshift-storage | grep ocs-metrics-exporter
ocs-metrics-exporter-8874fffd-2f6ft                               0/1     CrashLoopBackOff   5 (2m19s ago)   146m

[oviner@fedora auth]$ oc get pods ocs-metrics-exporter-8874fffd-2f6ft -n openshift-storage 
NAME                                  READY   STATUS             RESTARTS        AGE
ocs-metrics-exporter-8874fffd-2f6ft   0/1     CrashLoopBackOff   5 (2m36s ago)   147m

[oviner@fedora auth]$ oc logs ocs-metrics-exporter-8874fffd-2f6ft -n openshift-storage 
I0911 13:01:17.936183       1 main.go:29] using options: &{Apiserver: KubeconfigPath: Host:0.0.0.0 Port:8080 ExporterHost:0.0.0.0 ExporterPort:8081 Help:false AllowedNamespaces:[openshift-storage] flags:0xc000220a00 StopCh:<nil> Kubeconfig:<nil>}
W0911 13:01:17.936366       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0911 13:01:17.941131       1 main.go:70] Running metrics server on 0.0.0.0:8080
I0911 13:01:17.941154       1 main.go:71] Running telemetry server on 0.0.0.0:8081
I0911 13:01:17.953225       1 rbd-mirror.go:213] skipping rbd mirror status update for pool openshift-storage/ocs-storagecluster-cephblockpool because mirroring is disabled
I0911 13:01:17.955836       1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-a4aedfd0
I0911 13:01:17.955860       1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-6a3c119b
I0911 13:01:17.955865       1 pv.go:55] Skipping non Ceph CSI RBD volume local-pv-163aafec
E0911 13:01:46.997911       1 ceph-block-pool.go:137] Invalid image health for pool ocs-storagecluster-cephblockpool. Must be OK, UNKNOWN, WARNING or ERROR
panic: interface conversion: interface {} is *v1.CephCluster, not *v1.CephObjectStore

goroutine 195 [running]:
github.com/rook/rook/pkg/client/listers/ceph.rook.io/v1.cephObjectStoreNamespaceLister.List.func1({0x19d9280, 0xc0001cd900})
	/remote-source/app/vendor/github.com/rook/rook/pkg/client/listers/ceph.rook.io/v1/cephobjectstore.go:84 +0xc5
k8s.io/client-go/tools/cache.ListAllByNamespace({0x1d33e90, 0xc00000ce88}, {0x7ffd580e3e7a, 0x11}, {0x1d19730, 0xc00052b680}, 0xc0004dcd60)
	/remote-source/app/vendor/k8s.io/client-go/tools/cache/listers.go:96 +0x39c
github.com/rook/rook/pkg/client/listers/ceph.rook.io/v1.cephObjectStoreNamespaceLister.List({{0x1d33e90, 0xc00000ce88}, {0x7ffd580e3e7a, 0x18}}, {0x1d19730, 0xc00052b680})
	/remote-source/app/vendor/github.com/rook/rook/pkg/client/listers/ceph.rook.io/v1/cephobjectstore.go:83 +0x6f
github.com/red-hat-storage/ocs-operator/metrics/internal/collectors.getAllObjectStores({0x1ce2bb8, 0xc00051b1a0}, {0xc0000c39a0, 0x1, 0xc00078d718})
	/remote-source/app/metrics/internal/collectors/ceph-object-store.go:87 +0x1c2
github.com/red-hat-storage/ocs-operator/metrics/internal/collectors.(*ClusterAdvanceFeatureCollector).Collect(0xc00022bec0, 0xc00078d760)
	/remote-source/app/metrics/internal/collectors/cluster-advance-feature-use.go:87 +0x11e
github.com/prometheus/client_golang/prometheus.(*Registry).Gather.func1()
	/remote-source/app/vendor/github.com/prometheus/client_golang/prometheus/registry.go:446 +0x102
created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather
	/remote-source/app/vendor/github.com/prometheus/client_golang/prometheus/registry.go:538 +0xb4d
	
$ oc describe pods -n openshift-storage  ocs-metrics-exporter-8874fffd-2f6ft
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-kccjp:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             node.ocs.openshift.io/storage=true:NoSchedule
Events:
  Type     Reason          Age                     From               Message
  ----     ------          ----                    ----               -------
  Normal   Scheduled       148m                    default-scheduler  Successfully assigned openshift-storage/ocs-metrics-exporter-8874fffd-2f6ft to compute-1 by control-plane-1
  Normal   AddedInterface  148m                    multus             Add eth0 [10.131.0.33/23] from ovn-kubernetes
  Normal   Pulling         148m                    kubelet            Pulling image "quay.io/rhceph-dev/odf4-ocs-metrics-exporter-rhel8@sha256:72c02ff0dbf796fe821ab0358c294af19daa2023347b7f50d9a856d32a2e84b1"
  Normal   Pulled          148m                    kubelet            Successfully pulled image "quay.io/rhceph-dev/odf4-ocs-metrics-exporter-rhel8@sha256:72c02ff0dbf796fe821ab0358c294af19daa2023347b7f50d9a856d32a2e84b1" in 23.9199064s
  Normal   Created         6m35s (x5 over 148m)    kubelet            Created container ocs-metrics-exporter
  Normal   Started         6m35s (x5 over 148m)    kubelet            Started container ocs-metrics-exporter
  Warning  BackOff         4m57s (x13 over 9m16s)  kubelet            Back-off restarting failed container
  Normal   Pulled          4m46s (x5 over 10m)     kubelet            Container image "quay.io/rhceph-dev/odf4-ocs-metrics-exporter-rhel8@sha256:72c02ff0dbf796fe821ab0358c294af19daa2023347b7f50d9a856d32a2e84b1" already present on machine

Comment 10 arun kumar mohan 2022-09-12 06:42:54 UTC
Thanks Oded for the detailed comment.
In `ClusterAdvanceFeatureCollector` we are using a single cache.Indexer for all types of objects (like CephCluster, CephObjectStore, StorageClass etc...) and while calling the `List` function on CephObjectStoreLister (or CephObjectStore namespace lister) we are getting already cached `CephCluster` objects. Taking this BZ

Comment 11 Mudit Agarwal 2022-09-12 06:45:44 UTC
Oded, we should raise a separate BZ for the issue Arun is working on.
Let this BZ remain for the original issue.

Comment 17 Malay Kumar parida 2022-10-06 05:29:30 UTC
@tnielsen, in the above comment the link[1] actually creates the ns in the ocs operator e2e tests only. It doesn't come into play during operator installation. During operator installation, if done via CLI the user creates the Namespace, or if done via the UI the UI creates the Namespace.
So we don't have control over the NS itself. Tagging Nitin also for better clarification on this. @nigoyal .

Comment 19 Travis Nielsen 2022-10-06 19:32:35 UTC
(In reply to Malay Kumar parida from comment #17)
> @tnielsen, in the above comment the link[1] actually creates the
> ns in the ocs operator e2e tests only. It doesn't come into play during
> operator installation. During operator installation, if done via CLI the
> user creates the Namespace, or if done via the UI the UI creates the
> Namespace.
> So we don't have control over the NS itself. Tagging Nitin also for better
> clarification on this. @nigoyal .

Got it, if that's only for testing it won't help the product to update the namespace labels there. Thanks for the explanation.

Comment 21 Malay Kumar parida 2022-10-21 20:38:12 UTC
Starting from ocp 4.12.0-0.nightly-2022-10-05-053337 it contains https://issues.redhat.com/browse/OLM-2695, where OLM itself will enable label syncer even on namespaces with openshift- prefix.

I am on ocp 4.12.0-0.nightly-2022-10-20-104328, And I can successfully install odf operator/ocs operator without the need of any explicit namespace labeling. All the csvs, deployments, pods do succeed. 

The same changes are now also included in ci builds. Someone from QE please confirm and we can decide on the bug accordingly.

Comment 22 Martin Bukatovic 2022-10-27 18:23:09 UTC
Steps to reproduce are incomplete: after creating catalog source for the dev buils/images with ODF 4.12, the output of `oc get csv -A` is just:

```
$ oc get csv -A
NAMESPACE                              NAME            DISPLAY          VERSION   REPLACES   PHASE
openshift-operator-lifecycle-manager   packageserver   Package Server   0.19.0               Succeeded
```

While the output in the original bug report seems to show a state when ODF operator is already installed and StorageCluster created.

Comment 23 Martin Bukatovic 2022-10-27 18:35:07 UTC
Verifying on vSphere platform with:

OCP 4.12.0-0.nightly-2022-10-25-210451
ODF 4.12.0-82

And after manual UI driven installation of OCP operator, I see that both ocs and mcg operators were installed without any problems:

```
$  oc get csv -A
NAMESPACE                              NAME                              DISPLAY                       VERSION   REPLACES   PHASE
openshift-operator-lifecycle-manager   packageserver                     Package Server                0.19.0               Succeeded
openshift-storage                      mcg-operator.v4.12.0              NooBaa Operator               4.12.0               Succeeded
openshift-storage                      ocs-operator.v4.12.0              OpenShift Container Storage   4.12.0               Succeeded
openshift-storage                      odf-csi-addons-operator.v4.12.0   CSI Addons                    4.12.0               Succeeded
openshift-storage                      odf-operator.v4.12.0              OpenShift Data Foundation     4.12.0               Succeeded
```

```
$ oc describe csv/ocs-operator.v4.12.0 -n openshift-storage | tail 
Events:
  Type    Reason               Age                    From                        Message
  ----    ------               ----                   ----                        -------
  Normal  RequirementsUnknown  9m24s (x2 over 9m24s)  operator-lifecycle-manager  requirements not yet checked
  Normal  RequirementsNotMet   9m20s (x2 over 9m22s)  operator-lifecycle-manager  one or more requirements couldn't be found
  Normal  AllRequirementsMet   9m                     operator-lifecycle-manager  all requirements found, attempting install
  Normal  InstallSucceeded     8m59s                  operator-lifecycle-manager  waiting for install components to report healthy
  Normal  InstallWaiting       8m59s                  operator-lifecycle-manager  installing: waiting for deployment ocs-operator to become ready: deployment "ocs-operator" not available: Deployment does not have minimum availability.
  Normal  InstallWaiting       8m27s                  operator-lifecycle-manager  installing: waiting for deployment rook-ceph-operator to become ready: deployment "rook-ceph-operator" not available: Deployment does not have minimum availability.
  Normal  InstallSucceeded     8m17s                  operator-lifecycle-manager  install strategy completed with no errors
```


Note You need to log in before you can comment on or make changes to this bug.