Description of problem (please be detailed as possible and provide log snippests): We have SUID binaries in our persistent volumes that we restore using OADP with OCS 4.6.3 CephFS CSI snapshot. However, after the restore, the setuid and setgid bits are cleared. Version of all relevant components (if applicable): OADP 0.2.0 with CSI plugin Local Storage 4.6.0-202102200141.p0 OpenShift Container Storage(OCS) 4.6.3 Kubernetes Version: v1.19.0+7070803 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? This is a critical blocker for us. Our services cannot start up properly since the setuid/setgid bits are cleared. Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 - very simple Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 0. Follow the instructions here to create an instance of Velero in the oadp-operator namespace - https://github.com/konveyor/oadp-operator/blob/master/README.md default_velero_plugins should have openshift, csi and aws 1. Create a Kubernetes namespace called testns 2. Create a PVC with CephFS as its storage class (use for CSI snapshotting later), e.g.: apiVersion: v1 kind: PersistentVolumeClaim metadata: name: demo-cephfs-pvc spec: storageClassName: ocs-storagecluster-cephfs accessModes: - ReadWriteMany resources: requests: storage: 10Gi 3. Create a Kubernetes Pod that mounts CephFS persistent volume in step 2, e.g.: apiVersion: v1 kind: Pod metadata: name: testpod spec: containers: - command: - sleep - infinity image: registry.access.redhat.com/ubi8/ubi:latest imagePullPolicy: Always name: main resources: {} volumeMounts: - mountPath: /mnt name: cpd-data-vol restartPolicy: Never volumes: - name: cpd-data-vol persistentVolumeClaim: claimName: demo-cephfs-pvc 4. oc copy an executable (with setuid and setgid permission set) to the pod's mount path /mnt, e.g.: Here we have an executable named testsuid chmod +s testsuid -rwsr-sr-x 1 root root 2103852 Mar 3 08:55 testsuid oc cp testsuid testpod:/mnt 5. Exec into the pod and ensure the executable still have the setuid and setgid bit set oc rsh testpod ls -l /mnt -rwsr-sr-x 1 root root 2103852 Mar 3 08:55 testsuid 6. Backup the testns namespace using velero CLI, e.g.: velero backup create -n oadp-operator --include-namespaces=testns --exclude-resources='Event,Event.events.k8s.io' testbackup1 7. After a successful backup, delete namespace testns to prepare for restore 8. Restore the testns using velero CLI, e.g.: velero restore create -n oadp-operator --from-backup testbackup1 --exclude-resources='ImageTag' testrestore1 9. After restore is successful, exec into the pod and the setuid and setgid 's' bits are cleared Actual results: The setuid and setgid 's' bits are cleared -rwxr-xr-x 1 root root 2103852 Mar 3 09:00 testsuid Expected results: We expect that CephFS CSI snapshot restore operation preserves the setuid and setgid permission bits of the executable binaries. Additional info:
>velero restore create -n oadp-operator --from-backup testbackup1 --exclude-resources='ImageTag' testrestore1 I am not sure about the backend actions valero perform at restore time. Can you please explain the internals of the same ? Is it 'creating' a new PVC and then filling the contents from backup (ex object store ) location ? how does ( which pod mount the volume and filll ..etc) the content restoration happens ? fyi, At time of mount, Ceph CSI driver does below: ----------- snip --------------------------------- if !csicommon.MountOptionContains(kernelMountOptions, readOnly) && !csicommon.MountOptionContains(fuseMountOptions, readOnly) { // #nosec - allow anyone to write inside the stagingtarget path err = os.Chmod(stagingTargetPath, 0777) ----------------------------------------------------------------------------------> [1] if err != nil { util.ErrorLog(ctx, "failed to change stagingtarget path %s permission for volume %s: %v", stagingTargetPath, volID, err) uErr := unmountVolume(ctx, stagingTargetPath) if uErr != nil { util.ErrorLog(ctx, "failed to umount stagingtarget path %s for volume %s: %v", stagingTargetPath, volID, uErr) } return status.Error(codes.Internal, err.Error()) } } return nil -------------------- /snip-------------------- [1] so the staging path of the volume converted to above permission.
Kubernetes also does a recursive ownership change at time of mounting, may be it could be an issue.
(In reply to Humble Chirammal from comment #4) > Kubernetes also does a recursive ownership change at time of mounting, may > be it could be an issue. https://github.com/kubernetes/kubernetes/blob/aa28a3563b5161a7a07d1a07afedc4d1d248bc47/pkg/volume/volume_linux.go#L96 https://github.com/kubernetes/kubernetes/blob/aa28a3563b5161a7a07d1a07afedc4d1d248bc47/pkg/volume/volume_linux.go#L141
Comment #3 should not be a problem, that only changes the permissions of the root-directory of the volume. Comment #4 might be a problem, as the following procedure would drop the s-bit too: # cp /bin/true /mnt/testsuid # chgrp yipng /mnt/testsuid # chmod g+s /mnt/testsuid # ls -l /mnt/testsuid -rwxr-sr-x. 1 root root 28680 Feb 18 15:54 /mnt/testsuid The s-bit will be set at this point in time. But when the PVC gets mounted to an other pod, with different uids/gids, Kubelet changes the (outside container namespace) ownership of the files. This is similar to # chgrp hchiramm /mnt/testsuid -rwxr-xr-x. 1 root root 28680 Feb 18 15:54 /mnt/testsuid As comment #5 points out, Kubelet does a recursive ownership change. A quick glance over those pieces of the code, do not show any special handling for the s-bit on non-directories.
I am setting needinfo on Hemanth ( OCP storage team) for his input too as he worked on fsgroup which aims to address these areas of code/problems in kubernetes. https://kubernetes.io/blog/2020/12/14/kubernetes-release-1.20-fsgroupchangepolicy-fsgrouppolicy/
ooc, Is this issue reproducible with other CSI PVCs too? or you are hitting this issue only with Ceph CSI driver?
(In reply to Humble Chirammal from comment #8) > ooc, Is this issue reproducible with other CSI PVCs too? or you are hitting > this issue only with Ceph CSI driver? We are hitting this when using CephFS CSI but have not encounter issue with Ceph RBD CSI.
(In reply to Yip Ng from comment #10) > (In reply to Humble Chirammal from comment #8) > > ooc, Is this issue reproducible with other CSI PVCs too? or you are hitting > > this issue only with Ceph CSI driver? > > We are hitting this when using CephFS CSI but have not encounter issue with > Ceph RBD CSI. Thanks, however I was mainly asking this in terms of `other CSI` driver/PVCs than Ceph ( FS/RBD) , the reason is that, if its a kubernetes issue/bug/behaviour I expect it to be hit in other Filesystem type PVCs restore too. This can confirm or help us isolate the issue to a component/layer? Also for RBD, can you please confirm which volumemode ( file/block) is referred ( as the issue has not been hit) here?
Given that this is CephFS which is a RWX type with no filesystem on it, kubelet should not be peforming recursive chown/chmod of the said volume and hence should not be the one that is removing suid/sgid bits. Having said that - for Ceph RBD volumes, kubelet *may* remove suid bits (I am not sure yet, I have not tested it).
(In reply to Humble Chirammal from comment #11) > (In reply to Yip Ng from comment #10) > > (In reply to Humble Chirammal from comment #8) > > > ooc, Is this issue reproducible with other CSI PVCs too? or you are hitting > > > this issue only with Ceph CSI driver? > > > > We are hitting this when using CephFS CSI but have not encounter issue with > > Ceph RBD CSI. > > Thanks, however I was mainly asking this in terms of `other CSI` driver/PVCs > than Ceph ( FS/RBD) , the reason is that, if its a kubernetes > issue/bug/behaviour I expect it to be hit in other Filesystem type PVCs > restore too. This can confirm or help us isolate the issue to a > component/layer? > This is on my colleague's system and it only have Ceph and NFS storage; hence, he only worked with Ceph CSI driver. > Also for RBD, can you please confirm which volumemode ( file/block) is > referred ( as the issue has not been hit) here? The volumeMode is Filesystem.
Have created a ceph BZ.
https://bugzilla.redhat.com/show_bug.cgi?id=1938669 is targeted for RHCS4.2z2 which means it won't make it in OCS 4.7 as OCS4.7 is consuming RHCS 4.2z1. I am moving this BZ to OCS4.8, we can pick it in the next z stream of OCS4.7 which might use RHCS4.2z2.
Moving this back to 4.7 as we need to deliver this fix in OCS4.7
@khiremat any update on the PR?
Please see https://bugzilla.redhat.com/show_bug.cgi?id=1938669#c7.
Because the fix for Ceph BZ https://bugzilla.redhat.com/show_bug.cgi?id=1938669 is already there in 4.2z1 we can take this BZ back. Fix should be available with the next 4.7 build. Hope this is the last ping pong :)
Verified in version: OCS 4.7.0-351.ci OCP 4.7.0-0.nightly-2021-04-15-035247 ceph version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable) Steps performed: 1. Create CephFS PVC 'pvc1' and attach it to an app pod running on node1 2. Copy two executable files to the app pod. Output from app pod: # ls -lrt print_hello.py calc_one_plus_five.py -rwsrwsr-x. 1 1000 1000 15 Apr 15 11:43 print_hello.py -rwsrwsr-x. 1 1000 1000 11 Apr 15 11:43 calc_one_plus_five.py 3. Create a snapshot of the PVC 'pvc1' 4. Restore the snapshot to create a PVC 'pvc1-snapshot-restore' and attach it to a pod 'pod2' running on node2. Output from pod2: # ls -lrt print_hello.py calc_one_plus_five.py -rwsrwsr-x. 1 1000 1000 15 Apr 15 11:43 print_hello.py -rwsrwsr-x. 1 1000 1000 11 Apr 15 11:43 calc_one_plus_five.py 5. Clone the restored PVC 'pvc1-snapshot-restore' to create PVC 'pvc1-snapshot-restore-clone'. 6. Attach the cloned PVC 'pvc1-snapshot-restore-clone' to a pod 'pod3' running on node3. Output from pod3: # ls -lrt print_hello.py calc_one_plus_five.py -rwsrwsr-x. 1 1000 1000 15 Apr 15 11:43 print_hello.py -rwsrwsr-x. 1 1000 1000 11 Apr 15 11:43 calc_one_plus_five.py 's' bits are retained in restored (step 4) and cloned volume(step 6). Moving this bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041