Created attachment 1772550 [details] Example showing rbd does not do this Description of problem (please be detailed as possible and provide log snippests): Each time a cephfs PV is mounted, all files' ctime changes. This behavior causes problems for backup software, specifically Restic, causing incremental backups to generate a lot of metadata changes and take longer than expected since they must save the (now updated) ctime attribute. This behavior is not seen for RBD volumes, nor for gp2 or gp2-csi volumes. Version of all relevant components (if applicable): OCS 4.6 running in OSD OCP: Server Version: 4.7.5 Kubernetes Version: v1.20.0+bafe72f Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? It makes backup applications much less efficient to use with cephfs than with other file systems. Is there any workaround available to the best of your knowledge? no Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? yes Can this issue reproduce from the UI? N/A If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create cephfs PVC w/ OCS 2. Start pod using PVC 3. Touch a file on the mounted volume 4. `stat` the file and note the ctime 5. Stop the pod 6. Start the pod again 7. `stat` the file created previously and note updated ctime 8. Repeat 5-7, noting increasing ctime w/ each pod restart Additional info: RBD and gp2 (EBS) volumes will show an updated ctime on the 1st remount of the volume, but not on subsequent remounts. Example of reproducing steps w/ cephfs: [jstrunk osd]$ oc -n cephfs apply -f cephfs-test.yaml persistentvolumeclaim/datavol created pod/centos created [jstrunk osd]$ oc -n cephfs wait --for=condition=Ready po/centos pod/centos condition met [jstrunk osd]$ oc -n cephfs exec centos -- touch /mnt/testfile [jstrunk osd]$ oc -n cephfs exec centos -- stat /mnt/testfile File: /mnt/testfile Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 30002fh/3145775d Inode: 1099511628792 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-04-16 17:07:39.688115623 +0000 Modify: 2021-04-16 17:07:39.688115623 +0000 Change: 2021-04-16 17:07:39.688115623 +0000 Birth: - [jstrunk osd]$ oc -n cephfs delete po/centos pod "centos" deleted [jstrunk osd]$ oc -n cephfs apply -f cephfs-test.yaml persistentvolumeclaim/datavol unchanged pod/centos created [jstrunk osd]$ oc -n cephfs wait --for=condition=Ready po/centos pod/centos condition met [jstrunk osd]$ oc -n cephfs exec centos -- stat /mnt/testfile File: /mnt/testfile Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 30002fh/3145775d Inode: 1099511628792 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-04-16 17:07:39.688115623 +0000 Modify: 2021-04-16 17:07:39.688115623 +0000 Change: 2021-04-16 17:08:18.637972717 +0000 Birth: - [jstrunk osd]$ oc -n cephfs delete po/centos pod "centos" deleted [jstrunk osd]$ oc -n cephfs apply -f cephfs-test.yaml persistentvolumeclaim/datavol unchanged pod/centos created [jstrunk osd]$ oc -n cephfs wait --for=condition=Ready po/centos pod/centos condition met [jstrunk osd]$ oc -n cephfs exec centos -- stat /mnt/testfile File: /mnt/testfile Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 30002fh/3145775d Inode: 1099511628792 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-04-16 17:07:39.688115623 +0000 Modify: 2021-04-16 17:07:39.688115623 +0000 Change: 2021-04-16 17:08:44.853875749 +0000 Birth: - [jstrunk osd]$ oc -n cephfs delete po/centos pod "centos" deleted [jstrunk osd]$ oc -n cephfs apply -f cephfs-test.yaml persistentvolumeclaim/datavol unchanged pod/centos created [jstrunk osd]$ oc -n cephfs wait --for=condition=Ready po/centos pod/centos condition met [jstrunk osd]$ oc -n cephfs exec centos -- stat /mnt/testfile File: /mnt/testfile Size: 0 Blocks: 0 IO Block: 4194304 regular empty file Device: 30002fh/3145775d Inode: 1099511628792 Links: 1 Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2021-04-16 17:07:39.688115623 +0000 Modify: 2021-04-16 17:07:39.688115623 +0000 Change: 2021-04-16 17:09:15.849765124 +0000 Birth: -
Created attachment 1772551 [details] Example showing gp2 does not do this
Looks like a Ceph problem to me, not a blocker for 4.7 as it exists in 4.5 also. Moving it out, if required we can take it in 4.7.z Niels, Can you please take a look.
I neglected to post the yamls I was using to test. Below is what I have been using: --- kind: PersistentVolumeClaim apiVersion: v1 metadata: name: datavol spec: storageClassName: ocs-storagecluster-cephfs accessModes: - ReadWriteMany resources: requests: storage: 1Gi --- kind: Pod apiVersion: v1 metadata: name: centos spec: containers: - name: centos image: centos:8 command: ["/bin/bash", "-c"] args: ["sleep 999999"] volumeMounts: - name: data mountPath: "/mnt" terminationGracePeriodSeconds: 5 volumes: - name: data persistentVolumeClaim: claimName: datavol
I can reproduce this, on CephFS, both with ReadWriteMany and ReadWriteOnce volumes. Without any changes to the provided yaml, I can see differences to the extended attribute that relates to SElinux. It seems that the file is relabled every time a Pod is started. $ kubectl exec -ti centos-rwo -- ls -Z /mnt/bz1950495 system_u:object_r:container_file_t:s0:c187,c377 /mnt/bz1950495 $ kubectl delete pod/centos-rwo pod "centos-rwo" deleted $ kubectl apply -f cephfs-rwo.yaml persistentvolumeclaim/bz1950495-rwo unchanged pod/centos-rwo created $ kubectl exec -ti centos-rwo -- ls -Z /mnt/bz1950495 system_u:object_r:container_file_t:s0:c374,c563 /mnt/bz1950495 Setting a securityContext with seLinuxOptions for the container should remove the need to relabel the contents of the volume (and update the xattrs/ctime): securityContext: seLinuxOptions: level: "s0:c123,c321" $ kubectl exec -ti centos-selinux -- ls -Z /mnt/bz1950495 system_u:object_r:container_file_t:s0:c123,c321 /mnt/bz1950495 The output shows that the change has the expected result in the SElinux xattr. Unfortunately this does not change the modification of the ctime, it still gets updated. Manually running `chcon --range=....` on a file that has the same SElinux-context, does not update the ctime (on Fedora 33). However, running `chmod` repeatedly on the file, where the attributes are the same every invocation, cause the ctime to be updated. So, at the moment I am guessing that somewhere in the stach, (l)chmod is executed.
Niels, what is the next action for this bug?
I have the suspicion that this runs into pkg/volume/volume_linux.go:SetVolumeOwnership in Kubelet. fsGroup Support is a new feature that CSI drivers can use to indicate the need (or not) for changing ownership of mounted volumes. Currently Ceph-CSI does not seem to handle this well. Upstream testing suggests that Ceph-CSI needs to improve on the fsGroup support: https://github.com/ceph/ceph-csi/issues/2017
fsGroup Support: https://kubernetes-csi.github.io/docs/support-fsgroup.html
No plans to work on this feature in 4.9
I do believe this is a bug since neither RBD, nor EBS require disabling SELinux to avoid the metadata changes. While you are probably right that there's nothing the CSI driver can do, this should be investigated by cephfs folks since it presents a usability problem with the product.
I agree this is a major issue and I believe it's come up in several places, but changing SELinux labels is in fact a ctime-causing change and this is not behavior CephFS is doing anything to induce AFAIK. Issues with relabel on mount need to be addressed in ODF or possibly OSP proper. Madhu, I'm happy to discuss the details if needed, but I'm not sure why you kicked this to us.
I'm not really sure why this is needinfo on me... The situation: RBD & GP2 don't have this behavior, cephfs does. So, there's something different in the storage stack between cephfs and xfs/ext4 (I don't remember which we're using these days) that leads to ctime changes. The result is poor user experience. There is acknowledgement in kube upstream that relabeling on mount is time consuming. Note that their focus is on the mount delay caused, not a ctime change. It's also unclear whether the upstream solution [1] would be usable by cephfs when it becomes available due to sub-volume mounting. Short of figuring out the storage stack difference, this seems to just leave disabling selinux for cephfs volumes. [1] https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling
Stumbled across this again... (In reply to John Strunk from comment #19) > I'm not really sure why this is needinfo on me... > > The situation: RBD & GP2 don't have this behavior, cephfs does. The original report states: > Additional info: >RBD and gp2 (EBS) volumes will show an updated ctime on the 1st remount of the volume, but not on subsequent remounts. So this is true: > So, there's > something different in the storage stack between cephfs and xfs/ext4 (I > don't remember which we're using these days) that leads to ctime changes. > The result is poor user experience. But putting that together, it seems like somehow OpenShift/ODF is knowing that it's already done the relabel on rbd/gp2, but not knowing that on CephFS. I have no idea how that's possible or what the difference is — the most obvious distinction is that CephFS is used for RWX pods and RBD is used for RWO, so perhaps there's different logic in Kubernetes itself somewhere. So if this needs improvement beyond the other ongoing SELinux relabel improvements, it will either need to be another Kubernetes-side fix or else the OpenShift team will need to tell us how CephFS is behaving differently that induces this behavior. :) > > There is acknowledgement in kube upstream that relabeling on mount is time > consuming. Note that their focus is on the mount delay caused, not a ctime > change. It's also unclear whether the upstream solution [1] would be usable > by cephfs when it becomes available due to sub-volume mounting. > > Short of figuring out the storage stack difference, this seems to just leave > disabling selinux for cephfs volumes. > > [1] > https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710- > selinux-relabeling
Madhu, PTAL
I can see ctime is changed on AWS EBS volumes too, most probably during SELinux relabeling. In these cases it heavily depends who creates a pod and how: - Unprivileged Pods (i.e. with `restricted` SCC, typically created by "regular users", get fsGroup + SELinux context from namespaces where they're running. And since they're constant, then we have ways how to skip chown / chmod / chcon, see https://access.redhat.com/solutions/6221251. - Pods with `privileged` SCC, typically created by kubeadmin or a privileged operator, may not get fsGroup / SELinux context from the namespace. fsGroup won't be applied (that's good in this case), but empty SELinux context means that CRI-O will allocate a new one + recursively chcon all files on all volumes. Since this context is chosen randomly for each Pod startup, there is no optimization possible. Use an explicit SELinux context in such pods, e.g. spc_t, or run the pods as privileged (which implies spc_t too) + the workarounds linked above.