Created attachment 1759068 [details] log of the csi-rdbplugin container failing inside the pod csi-rbdplugin-provisioner Description of problem (please be detailed as possible and provide log snippests): OCP 4.6.17 installed on top of VMWARE OCS 4.6.2 Operator installed ibm-block-csi installed Version of all relevant components (if applicable): OCS 4.6.2 ibm-block-csi installed productName: ibm-block-csi-driver productVersion: 1.4.0 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Issue in the csi-rbdplugin container inside csi-rbdplugin-provisioner pod openshift-storage csi-rbdplugin-provisioner-79cffcd6df-6zn78 5/6 CrashLoopBackOff 46 3h32m openshift-storage csi-rbdplugin-provisioner-79cffcd6df-szgz4 5/6 CrashLoopBackOff 46 OCS is not installed correctly I uploaded the log of the container csi-rdbplugin inside the pod csi-rbdplugin-provisioner Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Certainly Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1759069 [details] csi-rbdplugin-provisioner deployment yaml file used Source of the code generated the issue https://github.com/ceph/ceph-csi/blob/49cf5efdd5663986c1c23c66163576bf77fdccb4/cmd/cephcsi.go#L170 https://github.com/ceph/ceph-csi/blob/49cf5efdd5663986c1c23c66163576bf77fdccb4/internal/rbd/driver.go#L102 https://github.com/ceph/ceph-csi/blob/f4d5fdf11484fef4e5c0d7ffe7beb8ab9531e8b5/internal/util/cephconf.go#L36
Content of the csi-rdbplugin container log file of the csi-rbdplugin-provisioner pod I0224 13:23:24.065229 1 cephcsi.go:124] Driver version: release-4.6 and Git version: 49cf5efdd5663986c1c23c66163576bf77fdccb4 I0224 13:23:24.065470 1 cephcsi.go:142] Initial PID limit is set to 1024 E0224 13:23:24.065517 1 cephcsi.go:145] Failed to set new PID limit to -1: open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod513172a2_eac9_4e7e_8201_6a99f60a47ce.slice/crio-7b84fec8bc7da6f3a815fce003915caffa20cbc2f783318da311c64295bbb996.scope/pids.max: permission denied I0224 13:23:24.065546 1 cephcsi.go:170] Starting driver type: rbd with name: openshift-storage.rbd.csi.ceph.com F0224 13:23:24.065588 1 driver.go:102] failed to write ceph configuration file (open /etc/ceph/ceph.conf: permission denied) goroutine 1 [running]: k8s.io/klog/v2.stacks(0xc00013a001, 0xc0003da820, 0x83, 0xc7) /remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:996 +0xb8 k8s.io/klog/v2.(*loggingT).output(0x253b980, 0xc000000003, 0x0, 0x0, 0xc0003c6d20, 0x22944e5, 0x9, 0x66, 0x414900) /remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:945 +0x19d k8s.io/klog/v2.(*loggingT).printDepth(0x253b980, 0x3, 0x0, 0x0, 0x1, 0xc0001bbc30, 0x1, 0x1) /remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:718 +0x15e k8s.io/klog/v2.FatalDepth(...) /remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:1449 github.com/ceph/ceph-csi/internal/util.FatalLogMsg(0x1662450, 0x2c, 0xc0001bbd18, 0x1, 0x1) /remote-source/app/internal/util/log.go:58 +0xe5 github.com/ceph/ceph-csi/internal/rbd.(*Driver).Run(0xc0001bbeb8, 0x253b880) /remote-source/app/internal/rbd/driver.go:102 +0x9b main.main() /remote-source/app/cmd/cephcsi.go:176 +0x4bb goroutine 22 [chan receive]: k8s.io/klog.(*loggingT).flushDaemon(0x253b7a0) /remote-source/deps/gomod/pkg/mod/k8s.io/klog.0/klog.go:1010 +0x8b created by k8s.io/klog.init.0 /remote-source/deps/gomod/pkg/mod/k8s.io/klog.0/klog.go:411 +0xd6 goroutine 23 [chan receive]: k8s.io/klog/v2.(*loggingT).flushDaemon(0x253b980) /remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:1131 +0x8b created by k8s.io/klog/v2.init.0 /remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:416 +0xd6
Created attachment 1759070 [details] storageclass v7k-dedup used by the OCS instance created
Created attachment 1759081 [details] StorageCluster ocs-storagecluster yaml file created using the storageclass v7k-dedup
Madhu, is this some pod privilege issue specific to this platform?
hi @Madhu The famous : disk.EnableUUID' parameter from VM in vSphere ?
I did a simpel mkdir test inside the pod by using $ oc rsh csi-rbdplugin-provisioner-79cffcd6df-6zn78 Defaulting container name to csi-provisioner. Use 'oc describe pod/csi-rbdplugin-provisioner-79cffcd6df-6zn78 -n openshift-storage' to see all of the containers in this pod. sh-4.4$ cd /etc sh-4.4$ id uid=1001(1001) gid=0(root) groups=0(root),1000670000 sh-4.4$ mkdir ceph mkdir: cannot create directory 'ceph': Permission denied sh-4.4$ sh-4.4$ cd / sh-4.4$ ls -ltr total 0 drwxr-xr-x. 2 root root 6 Aug 12 2018 srv lrwxrwxrwx. 1 root root 8 Aug 12 2018 sbin -> usr/sbin drwxr-xr-x. 2 root root 6 Aug 12 2018 opt drwxr-xr-x. 2 root root 6 Aug 12 2018 mnt drwxr-xr-x. 2 root root 6 Aug 12 2018 media lrwxrwxrwx. 1 root root 9 Aug 12 2018 lib64 -> usr/lib64 lrwxrwxrwx. 1 root root 7 Aug 12 2018 lib -> usr/lib drwxr-xr-x. 2 root root 6 Aug 12 2018 home dr-xr-xr-x. 2 root root 6 Aug 12 2018 boot lrwxrwxrwx. 1 root root 7 Aug 12 2018 bin -> usr/bin drwx------. 2 root root 6 Dec 16 15:38 lost+found drwxr-xr-x. 1 root root 17 Dec 16 15:38 usr drwxr-xr-x. 1 root root 52 Dec 16 15:38 var dr-xr-x---. 1 root root 23 Dec 16 15:42 root drwxr-xr-x. 1 root root 18 Jan 9 00:43 run drwxrwxrwt. 1 root root 6 Jan 9 01:37 tmp drwxr-xr-x. 1 root root 25 Jan 9 01:55 etc dr-xr-xr-x. 13 root root 0 Feb 23 18:02 sys drwxrwsrwt. 2 root 1000670000 40 Feb 24 09:47 csi dr-xr-xr-x. 483 root root 0 Feb 24 09:47 proc drwxr-xr-x. 5 root root 360 Feb 24 09:47 dev sh-4.4$ 1000670000 is the id of the openshift-storage namespace see $ oc describe project openshift-storage Name: openshift-storage Created: 18 hours ago Labels: olm.operatorgroup.uid/34c298b3-e17d-4d5e-88a7-62720ddfce2b= Annotations: openshift.io/sa.scc.mcs=s0:c26,c10 openshift.io/sa.scc.supplemental-groups=1000670000/10000 openshift.io/sa.scc.uid-range=1000670000/10000
I did a similar test but into another pod part of the same openshift-storage project $ oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-556759b69b7t2 sh-4.4# id uid=0(root) gid=0(root) groups=0(root),1000670000 sh-4.4# cd /etc sh-4.4# mkdir test1 sh-4.4# in the second test the id of the user is : sh-4.4# id uid=0(root) gid=0(root) groups=0(root),1000670000 and into the pod csi-rbdplugin-provisioner-79cffcd6df-6zn78 sh-4.4$ id uid=1001(1001) gid=0(root) groups=0(root),1000670000 1001 for csi-rbdplugin-provisoner and 0(root) for rook-ceph-mds-ocs-storagecluster-cephfilesystem Maybe the reason of the problem.
Created attachment 1759095 [details] csi-rbdplugin-provisioner file with securityContext: privileged: true
I found a workaround by adding securityContext: privileged: true inside deployment file of csi-rbdplugin-provisioner yaml file. (file attached). Now : csi-rbdplugin-provisioner-7bf6fb5596-6wgbk 6/6 Running 0 2m6s csi-rbdplugin-provisioner-7bf6fb5596-qb7ts 6/6 Running 0 2m6s
in this link https://github.com/rook/rook/commit/47cc1adf62a4b5c8f0ff3e1cb29f629da759acb6#diff-94612bfa354b036a2f57b1771f8b73f02d287d6a72e62b3dc21bcb3a0771d7ed I saw we removed running plugin container as privileged but without root privilege we can create directory inside /etc/ like /etc/ceph as expected. By adding privileged: true for Securitycontext in the container csi-rbdplugin for csi-rbdplugin-provisioner deployment. $ oc rsh --container csi-rbdplugin csi-rbdplugin-provisioner-7bf6fb5596-6wgbk sh-4.4# id uid=0(root) gid=0(root) groups=0(root) context=system_u:system_r:spc_t:s0 sh-4.4# cd /etc/ceph sh-4.4# ls -ltr total 8 -rw-r--r--. 1 root root 92 Dec 17 20:23 rbdmap -rw-r--r--. 1 root root 0 Feb 24 15:28 keyring -rw-------. 1 root root 182 Feb 24 15:28 ceph.conf keyring and ceph.conf files are created now.
Sorry we can't create directory inside /etc/ . I wrote we can
Are you able to reproduce this reliably, intermittently, or only once thus far? This seems to be a duplicate of an issue we've seen before. In that instance, it turned out that there had been modifications to the SCC that was being used on the RBD Provisioner Pod. The previous attachment you provided was for the Deployments, not the Pods. Please inspect the failing Pods and look in the Pod Annotations for `openshift.io/scc`. The value should be `rook-ceph-csi`, and there should be a corresponding SCC created with that same name.
When securityContext: privileged: false into csi-rbdplugin-provisioner deployment YAML file the value is : openshift.io/scc: ibm-spectrum-scale-restricted $ oc get pod csi-rbdplugin-provisioner-7bfdd6d98-6zbd7 -n openshift-storage -o yaml | grep 'openshift.io/scc' openshift.io/scc: ibm-spectrum-scale-restricted and when securityContext: privileged: true the result is oc get pod csi-rbdplugin-provisioner-b9b8f9647-fd2k7 -n openshift-storage -o yaml | grep 'openshift.io/scc' openshift.io/scc: rook-ceph-csi
when the value is false with openshift.io/scc: ibm-spectrum-scale-restricted the pod is failing csi-rbdplugin-provisioner-7bfdd6d98-6zbd7 5/6 CrashLoopBackOff 5 3m22s
On my environment see value of the 2 scc $ oc get scc ibm-spectrum-scale-restricted NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP PRIORITY READONLYROOTFS VOLUMES ibm-spectrum-scale-restricted false <no value> MustRunAs MustRunAs MustRunAs RunAsAny <no value> false ["configMap","downwardAPI","emptyDir","hostPath","persistentVolumeClaim","projected","secret"] $ oc get scc rook-ceph-csi NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP PRIORITY READONLYROOTFS VOLUMES rook-ceph-csi true ["*"] RunAsAny RunAsAny RunAsAny RunAsAny <no value> false ["*"] [didier@console ~]$
I see... what is the default value for "privileged" in the Pod Spec, true or false? I would imagine since the rook-ceph-csi SCC if priv, that *must* be true. Is it also true for all other Pods running with the rook-ceph-csi SCC?
When the Operators of OCP was installed this part didn't exist into csi-rbdplugin-provisioner yaml file.: securityContext: privileged: true I added it as a workaround of my issue. I'm working with an IBMers and he is working with the development team of ibm-spectrum-scale. I will send you their answer.
The Last news I received this morning is : IBM lab solved the problem. The problem was generated with the IBM SPECTRUM SCALE part. See the email sent yesterday. This ticket can be closed.
Hi, I'm facing the same problem in my OCS deployment. Could I know what causes this issue?
@Shirisha Do you have the cluster which I can use to check few things?
Attaching the must-gather for this cluster
Created attachment 1798608 [details] ocs must gather Attaching must-gather of the cluster
Shirisha, The SCC here is `openshift.io/scc: hostaccess` which cause the failure. Can you tell me what accesses are on it ? and if possible revert the SCC to `rook-ceph-csi`?
(In reply to Humble Chirammal from comment #29) > Shirisha, The SCC here is `openshift.io/scc: hostaccess` which cause the > failure. > > Can you tell me what accesses are on it ? and if possible revert the SCC to > `rook-ceph-csi`? Considering `default` hostaccess is the SCC mentioned here, below are the default perm set on this: hostaccess false [] MustRunAs MustRunAsRange MustRunAs RunAsAny <none> false [configMap downwardAPI emptyDir hostPath persistentVolumeClaim projected secret] This looks as a replica of `ibm-spectrum-scale-restricted` scc where similar issue got reported in this bugzilla.
Hi, these commands were run on the cluster before the install oc adm policy add-scc-to-group anyuid system:authenticated oc adm policy add-scc-to-group hostaccess system:authenticated oc adm policy add-scc-to-user anyuid system:serviceaccount:myproject:mysvcacct Could this be the cause of the issue?