1936545 – [Tracker for BZ #1938669] setuid and setgid file bits are not retained after a OCS CephFS CSI restore

Bug 1936545 - [Tracker for BZ #1938669] setuid and setgid file bits are not retained after a OCS CephFS CSI restore

Summary: [Tracker for BZ #1938669] setuid and setgid file bits are not retained after ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	OCS 4.7.0
Assignee:	Venky Shankar
QA Contact:	Jilju Joy
Docs Contact:
URL:
Whiteboard:
Depends On:	1938669
Blocks:	1951600
TreeView+	depends on / blocked

Reported:	2021-03-08 17:42 UTC by Yip Ng
Modified:	2021-09-24 09:47 UTC (History)
CC List:	19 users (show)
Fixed In Version:	4.7.0-344.ci
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1938669 1951600 (view as bug list)
Environment:
Last Closed:	2021-05-19 09:20:08 UTC
Embargoed:
Dependent Products:
Flags:	khiremat: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2041	0	None	None	None	2021-05-19 09:20:59 UTC

Description Yip Ng 2021-03-08 17:42:08 UTC

Description of problem (please be detailed as possible and provide log
snippests):

We have SUID binaries in our persistent volumes that we restore using OADP with OCS 4.6.3 CephFS CSI snapshot. However, after the restore, the setuid and setgid bits are cleared. 

Version of all relevant components (if applicable):
OADP 0.2.0 with CSI plugin
Local Storage 4.6.0-202102200141.p0
OpenShift Container Storage(OCS) 4.6.3

Kubernetes Version: v1.19.0+7070803


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

This is a critical blocker for us. Our services cannot start up properly since the setuid/setgid bits are cleared. 


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1 - very simple

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:

0. Follow the instructions here to create an instance of Velero in the oadp-operator namespace
   - https://github.com/konveyor/oadp-operator/blob/master/README.md
   default_velero_plugins should have openshift, csi and aws
 
1. Create a Kubernetes namespace called testns
 
2. Create a PVC with CephFS as its storage class (use for CSI snapshotting later), e.g.:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: demo-cephfs-pvc
spec:
  storageClassName: ocs-storagecluster-cephfs
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

3. Create a Kubernetes Pod that mounts CephFS persistent volume in step 2, e.g.:
apiVersion: v1
kind: Pod
metadata:
  name: testpod
spec:
  containers:
  - command:
    - sleep
    - infinity
    image: registry.access.redhat.com/ubi8/ubi:latest
    imagePullPolicy: Always
    name: main
    resources: {}
    volumeMounts:
    - mountPath: /mnt
      name: cpd-data-vol
  restartPolicy: Never
  volumes:
  - name: cpd-data-vol
    persistentVolumeClaim:
      claimName: demo-cephfs-pvc

4. oc copy an executable (with setuid and setgid permission set) to the pod's mount path /mnt, e.g.:
    Here we have an executable named testsuid

   chmod +s testsuid
   -rwsr-sr-x 1 root root 2103852 Mar  3 08:55 testsuid
 
    oc cp testsuid testpod:/mnt
 
5. Exec into the pod and ensure the executable still have the setuid and setgid bit set
 
    oc rsh testpod
    ls -l /mnt
    -rwsr-sr-x 1 root root 2103852 Mar  3 08:55 testsuid
 
6. Backup the testns namespace using velero CLI, e.g.:
 
    velero backup create -n oadp-operator --include-namespaces=testns --exclude-resources='Event,Event.events.k8s.io' testbackup1
 
7. After a successful backup, delete namespace testns to prepare for restore
 
8. Restore the testns using velero CLI, e.g.:
   
    velero restore create -n oadp-operator --from-backup testbackup1 --exclude-resources='ImageTag' testrestore1
 
9. After restore is successful, exec into the pod and the setuid and setgid 's' bits are cleared

 
Actual results:
   The setuid and setgid 's' bits are cleared
   -rwxr-xr-x 1 root root 2103852 Mar  3 09:00 testsuid

Expected results:

We expect that CephFS CSI snapshot restore operation preserves the setuid and setgid permission bits of the executable binaries.

Additional info:

Comment 3 Humble Chirammal 2021-03-09 07:29:40 UTC

>velero restore create -n oadp-operator --from-backup testbackup1 --exclude-resources='ImageTag' testrestore1

I am not sure about the backend actions valero perform at restore time. Can you please explain the internals of the same ? Is it 'creating' a new PVC and then filling the contents from backup (ex object store ) location ? how does ( which pod mount the volume and filll ..etc) the content restoration happens ?

fyi, At time of mount, Ceph CSI driver does below:

----------- snip --------------------------------- 



	if !csicommon.MountOptionContains(kernelMountOptions, readOnly) && !csicommon.MountOptionContains(fuseMountOptions, readOnly) {
		// #nosec - allow anyone to write inside the stagingtarget path
		err = os.Chmod(stagingTargetPath, 0777)       ----------------------------------------------------------------------------------> [1]
		if err != nil {
			util.ErrorLog(ctx, "failed to change stagingtarget path %s permission for volume %s: %v", stagingTargetPath, volID, err)
			uErr := unmountVolume(ctx, stagingTargetPath)
			if uErr != nil {
				util.ErrorLog(ctx, "failed to umount stagingtarget path %s for volume %s: %v", stagingTargetPath, volID, uErr)
			}
			return status.Error(codes.Internal, err.Error())
		}
	}
	return nil

-------------------- /snip-------------------- 

[1] so the staging path of the volume converted to above permission.

Comment 4 Humble Chirammal 2021-03-09 07:45:17 UTC

Kubernetes also does a recursive ownership change at time of mounting, may be it could be an issue.

Comment 5 Humble Chirammal 2021-03-09 07:52:51 UTC

(In reply to Humble Chirammal from comment #4)
> Kubernetes also does a recursive ownership change at time of mounting, may
> be it could be an issue.

https://github.com/kubernetes/kubernetes/blob/aa28a3563b5161a7a07d1a07afedc4d1d248bc47/pkg/volume/volume_linux.go#L96
https://github.com/kubernetes/kubernetes/blob/aa28a3563b5161a7a07d1a07afedc4d1d248bc47/pkg/volume/volume_linux.go#L141

Comment 6 Niels de Vos 2021-03-09 08:01:27 UTC

Comment #3 should not be a problem, that only changes the permissions of the root-directory of the volume.

Comment #4 might be a problem, as the following procedure would drop the s-bit too:

    # cp /bin/true /mnt/testsuid
    # chgrp yipng /mnt/testsuid
    # chmod g+s /mnt/testsuid
    # ls -l /mnt/testsuid
    -rwxr-sr-x. 1 root root 28680 Feb 18 15:54 /mnt/testsuid


The s-bit will be set at this point in time. But when the PVC gets mounted to an other pod, with different uids/gids, Kubelet changes the (outside container namespace) ownership of the files. This is similar to

    # chgrp hchiramm /mnt/testsuid
    -rwxr-xr-x. 1 root root 28680 Feb 18 15:54 /mnt/testsuid

As comment #5 points out, Kubelet does a recursive ownership change. A quick glance over those pieces of the code, do not show any special handling for the s-bit on non-directories.

Comment 7 Humble Chirammal 2021-03-09 08:23:07 UTC

I am setting needinfo on Hemanth ( OCP storage team) for his input too as he worked on fsgroup which aims to address these areas of code/problems in kubernetes.
https://kubernetes.io/blog/2020/12/14/kubernetes-release-1.20-fsgroupchangepolicy-fsgrouppolicy/

Comment 8 Humble Chirammal 2021-03-09 08:32:45 UTC

ooc, Is this issue reproducible with other CSI PVCs too? or you are hitting this issue only with Ceph CSI driver?

Comment 10 Yip Ng 2021-03-09 15:07:35 UTC

(In reply to Humble Chirammal from comment #8)
> ooc, Is this issue reproducible with other CSI PVCs too? or you are hitting
> this issue only with Ceph CSI driver?

We are hitting this when using CephFS CSI but have not encounter issue with Ceph RBD CSI.

Comment 11 Humble Chirammal 2021-03-10 07:22:33 UTC

(In reply to Yip Ng from comment #10)
> (In reply to Humble Chirammal from comment #8)
> > ooc, Is this issue reproducible with other CSI PVCs too? or you are hitting
> > this issue only with Ceph CSI driver?
> 
> We are hitting this when using CephFS CSI but have not encounter issue with
> Ceph RBD CSI.

Thanks, however I was mainly asking this in terms of `other CSI` driver/PVCs than Ceph ( FS/RBD) , the reason is that, if its a kubernetes issue/bug/behaviour I expect it to be hit in other Filesystem type PVCs restore too. This can confirm or help us isolate the issue to a component/layer? 

Also for RBD, can you please confirm which volumemode ( file/block) is referred ( as the issue has not been hit) here?

Comment 12 Hemant Kumar 2021-03-10 16:29:51 UTC

Given that this is CephFS which is a RWX type with no filesystem on it, kubelet should not be peforming recursive chown/chmod of the said volume and hence should not be the one that is removing suid/sgid bits.

Having said that - for Ceph RBD volumes, kubelet *may* remove suid bits (I am not sure yet, I have not tested it).

Comment 13 Yip Ng 2021-03-10 18:55:52 UTC

(In reply to Humble Chirammal from comment #11)
> (In reply to Yip Ng from comment #10)
> > (In reply to Humble Chirammal from comment #8)
> > > ooc, Is this issue reproducible with other CSI PVCs too? or you are hitting
> > > this issue only with Ceph CSI driver?
> > 
> > We are hitting this when using CephFS CSI but have not encounter issue with
> > Ceph RBD CSI.
> 
> Thanks, however I was mainly asking this in terms of `other CSI` driver/PVCs
> than Ceph ( FS/RBD) , the reason is that, if its a kubernetes
> issue/bug/behaviour I expect it to be hit in other Filesystem type PVCs
> restore too. This can confirm or help us isolate the issue to a
> component/layer? 
>

This is on my colleague's system and it only have Ceph and NFS storage; hence, he only worked with Ceph CSI driver.


> Also for RBD, can you please confirm which volumemode ( file/block) is
> referred ( as the issue has not been hit) here?

The volumeMode is Filesystem.

Comment 20 Mudit Agarwal 2021-03-15 07:22:16 UTC

Have created a ceph BZ.

Comment 21 Mudit Agarwal 2021-03-16 07:02:18 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1938669 is targeted for RHCS4.2z2 which means it won't make it in OCS 4.7 as OCS4.7 is consuming RHCS 4.2z1.
I am moving this BZ to OCS4.8, we can pick it in the next z stream of OCS4.7 which might use RHCS4.2z2.

Comment 22 Mudit Agarwal 2021-03-17 01:50:24 UTC

Moving this back to 4.7 as we need to deliver this fix in OCS4.7

Comment 25 James Espy 2021-03-24 16:59:58 UTC

@khiremat any update on the PR?

Comment 27 James Espy 2021-03-25 13:08:36 UTC

Please see https://bugzilla.redhat.com/show_bug.cgi?id=1938669#c7.

Comment 28 Mudit Agarwal 2021-03-29 11:32:46 UTC

Because the fix for Ceph BZ https://bugzilla.redhat.com/show_bug.cgi?id=1938669 is already there in 4.2z1 we can take this BZ back.
Fix should be available with the next 4.7 build.

Hope this is the last ping pong :)

Comment 31 Jilju Joy 2021-04-15 12:36:18 UTC

Verified in version:

OCS 4.7.0-351.ci
OCP 4.7.0-0.nightly-2021-04-15-035247
ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)

Steps performed:
1. Create CephFS PVC 'pvc1' and attach it to an app pod running on node1
2. Copy two executable files to the app pod.
Output from app pod:
# ls -lrt print_hello.py calc_one_plus_five.py
-rwsrwsr-x. 1 1000 1000 15 Apr 15 11:43 print_hello.py
-rwsrwsr-x. 1 1000 1000 11 Apr 15 11:43 calc_one_plus_five.py

3. Create a snapshot of the PVC 'pvc1'
4. Restore the snapshot to create a PVC 'pvc1-snapshot-restore' and attach it to a pod 'pod2' running on node2.
Output from pod2:
# ls -lrt print_hello.py calc_one_plus_five.py
-rwsrwsr-x. 1 1000 1000 15 Apr 15 11:43 print_hello.py
-rwsrwsr-x. 1 1000 1000 11 Apr 15 11:43 calc_one_plus_five.py

5. Clone the restored PVC 'pvc1-snapshot-restore' to create PVC 'pvc1-snapshot-restore-clone'.
6. Attach the cloned PVC 'pvc1-snapshot-restore-clone' to a pod 'pod3' running on node3.
Output from pod3:
# ls -lrt print_hello.py calc_one_plus_five.py
-rwsrwsr-x. 1 1000 1000 15 Apr 15 11:43 print_hello.py
-rwsrwsr-x. 1 1000 1000 11 Apr 15 11:43 calc_one_plus_five.py

's' bits are retained in restored (step 4) and cloned volume(step 6). Moving this bug to verified state.

Comment 37 errata-xmlrpc 2021-05-19 09:20:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041

Note You need to log in before you can comment on or make changes to this bug.