Bug 1932312 - csi-rbdplugin-provisioner- CrashloopBack [driver.go:102] failed to write ceph configuration file (open /etc/ceph/ceph.conf: permission denied)
Summary: csi-rbdplugin-provisioner- CrashloopBack [driver.go:102] failed to write cep...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: ocs-operator
Version: 4.6
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Jose A. Rivera
QA Contact: Raz Tamir
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-24 13:27 UTC by didier wojciechowski
Modified: 2021-07-12 14:22 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-15 11:52:28 UTC
Embargoed:


Attachments (Terms of Use)
log of the csi-rdbplugin container failing inside the pod csi-rbdplugin-provisioner (2.19 KB, text/plain)
2021-02-24 13:27 UTC, didier wojciechowski
no flags Details
csi-rbdplugin-provisioner deployment yaml file used (18.61 KB, application/octet-stream)
2021-02-24 13:29 UTC, didier wojciechowski
no flags Details
storageclass v7k-dedup used by the OCS instance created (2.87 KB, application/octet-stream)
2021-02-24 13:35 UTC, didier wojciechowski
no flags Details
StorageCluster ocs-storagecluster yaml file created using the storageclass v7k-dedup (4.12 KB, application/octet-stream)
2021-02-24 13:36 UTC, didier wojciechowski
no flags Details
csi-rbdplugin-provisioner file with securityContext: privileged: true (19.73 KB, application/octet-stream)
2021-02-24 15:31 UTC, didier wojciechowski
no flags Details
ocs must gather (9.34 MB, application/zip)
2021-07-06 12:48 UTC, Shirisha S Rao
no flags Details

Description didier wojciechowski 2021-02-24 13:27:01 UTC
Created attachment 1759068 [details]
log of the csi-rdbplugin container failing inside the pod csi-rbdplugin-provisioner

Description of problem (please be detailed as possible and provide log
snippests):

OCP 4.6.17 installed on top of VMWARE 
OCS 4.6.2 Operator installed 
ibm-block-csi installed 


Version of all relevant components (if applicable):
OCS 4.6.2 

ibm-block-csi installed 
productName: ibm-block-csi-driver
productVersion: 1.4.0

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

Issue in the csi-rbdplugin container inside csi-rbdplugin-provisioner pod 
openshift-storage                                  csi-rbdplugin-provisioner-79cffcd6df-6zn78                        5/6     CrashLoopBackOff   46         3h32m
openshift-storage                                  csi-rbdplugin-provisioner-79cffcd6df-szgz4                        5/6     CrashLoopBackOff   46

OCS is not installed correctly

I uploaded the log of the container csi-rdbplugin inside the pod csi-rbdplugin-provisioner

Is there any workaround available to the best of your knowledge?
No 

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1 

Can this issue reproducible?
Certainly

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results:


Expected results:


Additional info:

Comment 3 didier wojciechowski 2021-02-24 13:31:12 UTC
Content of the  csi-rdbplugin container log file of the csi-rbdplugin-provisioner pod 

I0224 13:23:24.065229       1 cephcsi.go:124] Driver version: release-4.6 and Git version: 49cf5efdd5663986c1c23c66163576bf77fdccb4
I0224 13:23:24.065470       1 cephcsi.go:142] Initial PID limit is set to 1024
E0224 13:23:24.065517       1 cephcsi.go:145] Failed to set new PID limit to -1: open /sys/fs/cgroup/pids/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod513172a2_eac9_4e7e_8201_6a99f60a47ce.slice/crio-7b84fec8bc7da6f3a815fce003915caffa20cbc2f783318da311c64295bbb996.scope/pids.max: permission denied
I0224 13:23:24.065546       1 cephcsi.go:170] Starting driver type: rbd with name: openshift-storage.rbd.csi.ceph.com
F0224 13:23:24.065588       1 driver.go:102] failed to write ceph configuration file (open /etc/ceph/ceph.conf: permission denied)
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00013a001, 0xc0003da820, 0x83, 0xc7)
	/remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:996 +0xb8
k8s.io/klog/v2.(*loggingT).output(0x253b980, 0xc000000003, 0x0, 0x0, 0xc0003c6d20, 0x22944e5, 0x9, 0x66, 0x414900)
	/remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:945 +0x19d
k8s.io/klog/v2.(*loggingT).printDepth(0x253b980, 0x3, 0x0, 0x0, 0x1, 0xc0001bbc30, 0x1, 0x1)
	/remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:718 +0x15e
k8s.io/klog/v2.FatalDepth(...)
	/remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:1449
github.com/ceph/ceph-csi/internal/util.FatalLogMsg(0x1662450, 0x2c, 0xc0001bbd18, 0x1, 0x1)
	/remote-source/app/internal/util/log.go:58 +0xe5
github.com/ceph/ceph-csi/internal/rbd.(*Driver).Run(0xc0001bbeb8, 0x253b880)
	/remote-source/app/internal/rbd/driver.go:102 +0x9b
main.main()
	/remote-source/app/cmd/cephcsi.go:176 +0x4bb

goroutine 22 [chan receive]:
k8s.io/klog.(*loggingT).flushDaemon(0x253b7a0)
	/remote-source/deps/gomod/pkg/mod/k8s.io/klog.0/klog.go:1010 +0x8b
created by k8s.io/klog.init.0
	/remote-source/deps/gomod/pkg/mod/k8s.io/klog.0/klog.go:411 +0xd6

goroutine 23 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x253b980)
	/remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:1131 +0x8b
created by k8s.io/klog/v2.init.0
	/remote-source/deps/gomod/pkg/mod/k8s.io/klog/v2.0/klog.go:416 +0xd6

Comment 4 didier wojciechowski 2021-02-24 13:35:23 UTC
Created attachment 1759070 [details]
storageclass v7k-dedup used by the OCS instance created

Comment 5 didier wojciechowski 2021-02-24 13:36:59 UTC
Created attachment 1759081 [details]
StorageCluster ocs-storagecluster yaml file created using the storageclass v7k-dedup

Comment 6 Mudit Agarwal 2021-02-24 13:39:17 UTC
Madhu, is this some pod privilege issue specific to this platform?

Comment 7 didier wojciechowski 2021-02-24 13:42:10 UTC
hi @Madhu 
The famous : disk.EnableUUID' parameter from VM in vSphere ?

Comment 8 didier wojciechowski 2021-02-24 13:46:48 UTC
I did a simpel mkdir test inside the pod by using 
$ oc rsh csi-rbdplugin-provisioner-79cffcd6df-6zn78
Defaulting container name to csi-provisioner.
Use 'oc describe pod/csi-rbdplugin-provisioner-79cffcd6df-6zn78 -n openshift-storage' to see all of the containers in this pod.
sh-4.4$ cd /etc
sh-4.4$ id
uid=1001(1001) gid=0(root) groups=0(root),1000670000
sh-4.4$ mkdir ceph
mkdir: cannot create directory 'ceph': Permission denied
sh-4.4$

sh-4.4$ cd /
sh-4.4$ ls -ltr
total 0
drwxr-xr-x.   2 root root         6 Aug 12  2018 srv
lrwxrwxrwx.   1 root root         8 Aug 12  2018 sbin -> usr/sbin
drwxr-xr-x.   2 root root         6 Aug 12  2018 opt
drwxr-xr-x.   2 root root         6 Aug 12  2018 mnt
drwxr-xr-x.   2 root root         6 Aug 12  2018 media
lrwxrwxrwx.   1 root root         9 Aug 12  2018 lib64 -> usr/lib64
lrwxrwxrwx.   1 root root         7 Aug 12  2018 lib -> usr/lib
drwxr-xr-x.   2 root root         6 Aug 12  2018 home
dr-xr-xr-x.   2 root root         6 Aug 12  2018 boot
lrwxrwxrwx.   1 root root         7 Aug 12  2018 bin -> usr/bin
drwx------.   2 root root         6 Dec 16 15:38 lost+found
drwxr-xr-x.   1 root root        17 Dec 16 15:38 usr
drwxr-xr-x.   1 root root        52 Dec 16 15:38 var
dr-xr-x---.   1 root root        23 Dec 16 15:42 root
drwxr-xr-x.   1 root root        18 Jan  9 00:43 run
drwxrwxrwt.   1 root root         6 Jan  9 01:37 tmp
drwxr-xr-x.   1 root root        25 Jan  9 01:55 etc
dr-xr-xr-x.  13 root root         0 Feb 23 18:02 sys
drwxrwsrwt.   2 root 1000670000  40 Feb 24 09:47 csi
dr-xr-xr-x. 483 root root         0 Feb 24 09:47 proc
drwxr-xr-x.   5 root root       360 Feb 24 09:47 dev
sh-4.4$

1000670000 is the id of the openshift-storage namespace
see
$ oc describe project openshift-storage
Name:			openshift-storage
Created:		18 hours ago
Labels:			olm.operatorgroup.uid/34c298b3-e17d-4d5e-88a7-62720ddfce2b=
Annotations:		openshift.io/sa.scc.mcs=s0:c26,c10
			openshift.io/sa.scc.supplemental-groups=1000670000/10000
			openshift.io/sa.scc.uid-range=1000670000/10000

Comment 9 didier wojciechowski 2021-02-24 13:53:39 UTC
I did a similar test but into another pod part of the same openshift-storage project 
$ oc rsh rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-556759b69b7t2
sh-4.4# id
uid=0(root) gid=0(root) groups=0(root),1000670000
sh-4.4# cd /etc
sh-4.4# mkdir test1
sh-4.4#

in the second test the id of the user is :
sh-4.4# id
uid=0(root) gid=0(root) groups=0(root),1000670000

and into the pod csi-rbdplugin-provisioner-79cffcd6df-6zn78
sh-4.4$ id
uid=1001(1001) gid=0(root) groups=0(root),1000670000

1001 for csi-rbdplugin-provisoner and 0(root) for rook-ceph-mds-ocs-storagecluster-cephfilesystem
Maybe the reason of the problem.

Comment 12 didier wojciechowski 2021-02-24 15:31:59 UTC
Created attachment 1759095 [details]
csi-rbdplugin-provisioner  file with securityContext:             privileged: true

Comment 13 didier wojciechowski 2021-02-24 15:33:48 UTC
I found a workaround by adding 
securityContext:
            privileged: true
inside deployment file of csi-rbdplugin-provisioner yaml file. (file attached).
Now :

csi-rbdplugin-provisioner-7bf6fb5596-6wgbk                        6/6     Running   0          2m6s
csi-rbdplugin-provisioner-7bf6fb5596-qb7ts                        6/6     Running   0          2m6s

Comment 14 didier wojciechowski 2021-02-24 15:51:33 UTC
in this link https://github.com/rook/rook/commit/47cc1adf62a4b5c8f0ff3e1cb29f629da759acb6#diff-94612bfa354b036a2f57b1771f8b73f02d287d6a72e62b3dc21bcb3a0771d7ed I saw we removed running plugin container as privileged but without root privilege we can create directory inside /etc/ like /etc/ceph as expected. By adding privileged: true for Securitycontext in the container csi-rbdplugin for csi-rbdplugin-provisioner deployment.

$ oc rsh --container csi-rbdplugin csi-rbdplugin-provisioner-7bf6fb5596-6wgbk
sh-4.4# id
uid=0(root) gid=0(root) groups=0(root) context=system_u:system_r:spc_t:s0
sh-4.4# cd /etc/ceph
sh-4.4# ls -ltr
total 8
-rw-r--r--. 1 root root  92 Dec 17 20:23 rbdmap
-rw-r--r--. 1 root root   0 Feb 24 15:28 keyring
-rw-------. 1 root root 182 Feb 24 15:28 ceph.conf

keyring and ceph.conf files are created now.

Comment 15 didier wojciechowski 2021-02-24 15:52:40 UTC
Sorry we can't create directory inside /etc/  . I wrote we can

Comment 17 Jose A. Rivera 2021-03-04 15:29:48 UTC
Are you able to reproduce this reliably, intermittently, or only once thus far?

This seems to be a duplicate of an issue we've seen before. In that instance, it turned out that there had been modifications to the SCC that was being used on the RBD Provisioner Pod. The previous attachment you provided was for the Deployments, not the Pods. Please inspect the failing Pods and look in the Pod Annotations for `openshift.io/scc`. The value should be `rook-ceph-csi`, and there should be a corresponding SCC created with that same name.

Comment 18 didier wojciechowski 2021-03-05 13:44:09 UTC
When 

         securityContext:
            privileged: false 
into csi-rbdplugin-provisioner deployment YAML file 

the value is :
  openshift.io/scc: ibm-spectrum-scale-restricted

$ oc get pod csi-rbdplugin-provisioner-7bfdd6d98-6zbd7 -n openshift-storage -o yaml | grep 'openshift.io/scc'
    openshift.io/scc: ibm-spectrum-scale-restricted

and when 
         securityContext:
            privileged: true
the result is 

oc get pod csi-rbdplugin-provisioner-b9b8f9647-fd2k7 -n openshift-storage -o yaml | grep 'openshift.io/scc'
    openshift.io/scc: rook-ceph-csi

Comment 19 didier wojciechowski 2021-03-05 13:45:34 UTC
when the value is false with openshift.io/scc: ibm-spectrum-scale-restricted
the pod is failing 
csi-rbdplugin-provisioner-7bfdd6d98-6zbd7                         5/6     CrashLoopBackOff   5          3m22s

Comment 20 didier wojciechowski 2021-03-05 13:47:44 UTC
On my environment see value of the 2 scc
$ oc get scc  ibm-spectrum-scale-restricted
NAME                            PRIV    CAPS         SELINUX     RUNASUSER   FSGROUP     SUPGROUP   PRIORITY     READONLYROOTFS   VOLUMES
ibm-spectrum-scale-restricted   false   <no value>   MustRunAs   MustRunAs   MustRunAs   RunAsAny   <no value>   false            ["configMap","downwardAPI","emptyDir","hostPath","persistentVolumeClaim","projected","secret"]

$ oc get scc  rook-ceph-csi
NAME            PRIV   CAPS    SELINUX    RUNASUSER   FSGROUP    SUPGROUP   PRIORITY     READONLYROOTFS   VOLUMES
rook-ceph-csi   true   ["*"]   RunAsAny   RunAsAny    RunAsAny   RunAsAny   <no value>   false            ["*"]
[didier@console ~]$

Comment 21 Jose A. Rivera 2021-03-11 18:09:54 UTC
I see... what is the default value for "privileged" in the Pod Spec, true or false? I would imagine since the rook-ceph-csi SCC if priv, that *must* be true. Is it also true for all other Pods running with the rook-ceph-csi SCC?

Comment 22 didier wojciechowski 2021-03-11 18:44:47 UTC
When the Operators of OCP was installed this part didn't exist into csi-rbdplugin-provisioner yaml file.:

securityContext:
            privileged: true


I added it as a workaround of my issue. 

I'm working with an IBMers and he is working with the development team of ibm-spectrum-scale. I will send you their answer.

Comment 23 didier wojciechowski 2021-03-12 09:36:28 UTC
The Last news I received this morning is :
 IBM lab solved the problem. The problem was generated with the IBM SPECTRUM SCALE part. See the email sent yesterday. This ticket can be closed.

Comment 24 Shirisha S Rao 2021-07-06 12:08:19 UTC
Hi, I'm facing the same problem in my OCS deployment. Could I know what causes this issue?

Comment 25 Shirisha S Rao 2021-07-06 12:08:41 UTC
Hi, I'm facing the same problem in my OCS deployment. Could I know what causes this issue?

Comment 26 Madhu Rajanna 2021-07-06 12:32:42 UTC
@Shirisha Do you have the cluster which I can use to check few things?

Comment 27 Shirisha S Rao 2021-07-06 12:41:32 UTC
Attaching the must-gather for this cluster

Comment 28 Shirisha S Rao 2021-07-06 12:48:21 UTC
Created attachment 1798608 [details]
ocs must gather

Attaching must-gather of the cluster

Comment 29 Humble Chirammal 2021-07-07 13:08:32 UTC
Shirisha, The SCC here is `openshift.io/scc: hostaccess` which cause the failure. 

Can you tell me what accesses are on it ? and if possible revert the SCC to `rook-ceph-csi`?

Comment 30 Humble Chirammal 2021-07-07 13:14:56 UTC
(In reply to Humble Chirammal from comment #29)
> Shirisha, The SCC here is `openshift.io/scc: hostaccess` which cause the
> failure. 
> 
> Can you tell me what accesses are on it ? and if possible revert the SCC to
> `rook-ceph-csi`?


Considering `default` hostaccess is the SCC mentioned here, below are the default perm set on this:

hostaccess         false   []     MustRunAs   MustRunAsRange     MustRunAs   RunAsAny    <none>     false            [configMap downwardAPI emptyDir hostPath persistentVolumeClaim projected secret]

This looks as a replica of  `ibm-spectrum-scale-restricted` scc where similar issue got reported in this bugzilla.

Comment 31 Shirisha S Rao 2021-07-12 14:22:05 UTC
Hi, these commands were run on the cluster before the install

oc adm policy add-scc-to-group anyuid system:authenticated
oc adm policy add-scc-to-group hostaccess system:authenticated
oc adm policy add-scc-to-user anyuid system:serviceaccount:myproject:mysvcacct

Could this be the cause of the issue?


Note You need to log in before you can comment on or make changes to this bug.