Bug 1950495

Summary: ctime for files in cephfs PV is updated each time volume is mounted
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: John Strunk <jstrunk>
Component: unclassifiedAssignee: Mudit Agarwal <muagarwa>
Status: CLOSED WONTFIX QA Contact: Elad <ebenahar>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.10CC: bniver, ceph-eng-bugs, gfarnum, hekumar, hyelloji, jsafrane, madam, mrajanna, muagarwa, ndevos, ocs-bugs, odf-bz-bot, rar, sostapov, tmuthami
Target Milestone: ODF Feature FreezeKeywords: AutomationBackLog, Reopened
Target Release: ---Flags: mrajanna: needinfo? (hekumar)
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-30 11:13:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Example showing rbd does not do this
none
Example showing gp2 does not do this none

Description John Strunk 2021-04-16 18:09:00 UTC
Created attachment 1772550 [details]
Example showing rbd does not do this

Description of problem (please be detailed as possible and provide log
snippests):

Each time a cephfs PV is mounted, all files' ctime changes. This behavior causes problems for backup software, specifically Restic, causing incremental backups to generate a lot of metadata changes and take longer than expected since they must save the (now updated) ctime attribute.

This behavior is not seen for RBD volumes, nor for gp2 or gp2-csi volumes.



Version of all relevant components (if applicable):
OCS 4.6 running in OSD
OCP:
  Server Version: 4.7.5
  Kubernetes Version: v1.20.0+bafe72f



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

It makes backup applications much less efficient to use with cephfs than with other file systems.



Is there any workaround available to the best of your knowledge?
no

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
yes

Can this issue reproduce from the UI?
N/A

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create cephfs PVC w/ OCS
2. Start pod using PVC
3. Touch a file on the mounted volume
4. `stat` the file and note the ctime
5. Stop the pod
6. Start the pod again
7. `stat` the file created previously and note updated ctime
8. Repeat 5-7, noting increasing ctime w/ each pod restart


Additional info:

RBD and gp2 (EBS) volumes will show an updated ctime on the 1st remount of the volume, but not on subsequent remounts.


Example of reproducing steps w/ cephfs:

[jstrunk osd]$ oc -n cephfs apply -f cephfs-test.yaml
persistentvolumeclaim/datavol created
pod/centos created

[jstrunk osd]$ oc -n cephfs wait --for=condition=Ready po/centos
pod/centos condition met

[jstrunk osd]$ oc -n cephfs exec centos -- touch /mnt/testfile

[jstrunk osd]$ oc -n cephfs exec centos -- stat /mnt/testfile
  File: /mnt/testfile
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: 30002fh/3145775d	Inode: 1099511628792  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-04-16 17:07:39.688115623 +0000
Modify: 2021-04-16 17:07:39.688115623 +0000
Change: 2021-04-16 17:07:39.688115623 +0000
 Birth: -

[jstrunk osd]$ oc -n cephfs delete po/centos
pod "centos" deleted

[jstrunk osd]$ oc -n cephfs apply -f cephfs-test.yaml
persistentvolumeclaim/datavol unchanged
pod/centos created

[jstrunk osd]$ oc -n cephfs wait --for=condition=Ready po/centos
pod/centos condition met

[jstrunk osd]$ oc -n cephfs exec centos -- stat /mnt/testfile
  File: /mnt/testfile
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: 30002fh/3145775d	Inode: 1099511628792  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-04-16 17:07:39.688115623 +0000
Modify: 2021-04-16 17:07:39.688115623 +0000
Change: 2021-04-16 17:08:18.637972717 +0000
 Birth: -

[jstrunk osd]$ oc -n cephfs delete po/centos
pod "centos" deleted

[jstrunk osd]$ oc -n cephfs apply -f cephfs-test.yaml
persistentvolumeclaim/datavol unchanged
pod/centos created

[jstrunk osd]$ oc -n cephfs wait --for=condition=Ready po/centos
pod/centos condition met

[jstrunk osd]$ oc -n cephfs exec centos -- stat /mnt/testfile
  File: /mnt/testfile
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: 30002fh/3145775d	Inode: 1099511628792  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-04-16 17:07:39.688115623 +0000
Modify: 2021-04-16 17:07:39.688115623 +0000
Change: 2021-04-16 17:08:44.853875749 +0000
 Birth: -

[jstrunk osd]$ oc -n cephfs delete po/centos
pod "centos" deleted

[jstrunk osd]$ oc -n cephfs apply -f cephfs-test.yaml
persistentvolumeclaim/datavol unchanged
pod/centos created

[jstrunk osd]$ oc -n cephfs wait --for=condition=Ready po/centos
pod/centos condition met

[jstrunk osd]$ oc -n cephfs exec centos -- stat /mnt/testfile
  File: /mnt/testfile
  Size: 0         	Blocks: 0          IO Block: 4194304 regular empty file
Device: 30002fh/3145775d	Inode: 1099511628792  Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-04-16 17:07:39.688115623 +0000
Modify: 2021-04-16 17:07:39.688115623 +0000
Change: 2021-04-16 17:09:15.849765124 +0000
 Birth: -

Comment 2 John Strunk 2021-04-16 18:10:21 UTC
Created attachment 1772551 [details]
Example showing gp2 does not do this

Comment 3 Mudit Agarwal 2021-04-19 04:30:33 UTC
Looks like a Ceph problem to me, not a blocker for 4.7 as it exists in 4.5 also. Moving it out, if required we can take it in 4.7.z

Niels, Can you please take a look.

Comment 4 John Strunk 2021-04-19 13:51:10 UTC
I neglected to post the yamls I was using to test. Below is what I have been using:

---

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: datavol
spec:
  storageClassName: ocs-storagecluster-cephfs
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi

---

kind: Pod
apiVersion: v1
metadata:
  name: centos
spec:
  containers:
    - name: centos
      image: centos:8
      command: ["/bin/bash", "-c"]
      args: ["sleep 999999"]
      volumeMounts:
        - name: data
          mountPath: "/mnt"
  terminationGracePeriodSeconds: 5
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: datavol

Comment 5 Niels de Vos 2021-04-20 14:13:26 UTC
I can reproduce this, on CephFS, both with ReadWriteMany and ReadWriteOnce volumes.

Without any changes to the provided yaml, I can see differences to the extended attribute that relates to SElinux. It seems that the file is relabled every time a Pod is started.

$ kubectl exec -ti centos-rwo -- ls -Z /mnt/bz1950495
system_u:object_r:container_file_t:s0:c187,c377 /mnt/bz1950495
$ kubectl delete pod/centos-rwo 
pod "centos-rwo" deleted
$ kubectl apply -f cephfs-rwo.yaml
persistentvolumeclaim/bz1950495-rwo unchanged
pod/centos-rwo created
$ kubectl exec -ti centos-rwo -- ls -Z /mnt/bz1950495
system_u:object_r:container_file_t:s0:c374,c563 /mnt/bz1950495


Setting a securityContext with seLinuxOptions for the container should remove the need to relabel the contents of the volume (and update the xattrs/ctime):

      securityContext:
        seLinuxOptions:
          level: "s0:c123,c321"

$ kubectl exec -ti centos-selinux -- ls -Z /mnt/bz1950495
system_u:object_r:container_file_t:s0:c123,c321 /mnt/bz1950495

The output shows that the change has the expected result in the SElinux xattr.

Unfortunately this does not change the modification of the ctime, it still gets updated.

Manually running `chcon --range=....` on a file that has the same SElinux-context, does not update the ctime (on Fedora 33).

However, running `chmod` repeatedly on the file, where the attributes are the same every invocation, cause the ctime to be updated. So, at the moment I am guessing that somewhere in the stach, (l)chmod is executed.

Comment 6 Mudit Agarwal 2021-05-27 03:15:34 UTC
Niels, what is the next action for this bug?

Comment 7 Niels de Vos 2021-06-02 16:33:19 UTC
I have the suspicion that this runs into pkg/volume/volume_linux.go:SetVolumeOwnership in Kubelet.

fsGroup Support is a new feature that CSI drivers can use to indicate the need (or not) for changing ownership of mounted volumes. Currently Ceph-CSI does not seem to handle this well. Upstream testing suggests that Ceph-CSI needs to improve on the fsGroup support: https://github.com/ceph/ceph-csi/issues/2017

Comment 8 Niels de Vos 2021-06-02 16:40:26 UTC
fsGroup Support: https://kubernetes-csi.github.io/docs/support-fsgroup.html

Comment 9 Mudit Agarwal 2021-09-21 11:48:00 UTC
No plans to work on this feature in 4.9

Comment 13 John Strunk 2022-01-12 12:39:29 UTC
I do believe this is a bug since neither RBD, nor EBS require disabling SELinux to avoid the metadata changes. While you are probably right that there's nothing the CSI driver can do, this should be investigated by cephfs folks since it presents a usability problem with the product.

Comment 15 Greg Farnum 2022-03-09 15:46:46 UTC
I agree this is a major issue and I believe it's come up in several places, but changing SELinux labels is in fact a ctime-causing change and this is not behavior CephFS is doing anything to induce AFAIK. Issues with relabel on mount need to be addressed in ODF or possibly OSP proper.

Madhu, I'm happy to discuss the details if needed, but I'm not sure why you kicked this to us.

Comment 19 John Strunk 2022-03-15 15:49:49 UTC
I'm not really sure why this is needinfo on me...

The situation: RBD & GP2 don't have this behavior, cephfs does. So, there's something different in the storage stack between cephfs and xfs/ext4 (I don't remember which we're using these days) that leads to ctime changes. The result is poor user experience.

There is acknowledgement in kube upstream that relabeling on mount is time consuming. Note that their focus is on the mount delay caused, not a ctime change. It's also unclear whether the upstream solution [1] would be usable by cephfs when it becomes available due to sub-volume mounting.

Short of figuring out the storage stack difference, this seems to just leave disabling selinux for cephfs volumes.

[1] https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-selinux-relabeling

Comment 20 Greg Farnum 2022-05-18 15:52:21 UTC
Stumbled across this again...

(In reply to John Strunk from comment #19)
> I'm not really sure why this is needinfo on me...
> 
> The situation: RBD & GP2 don't have this behavior, cephfs does.

The original report states:
> Additional info:

>RBD and gp2 (EBS) volumes will show an updated ctime on the 1st remount of the volume, but not on subsequent remounts.

So this is true:
> So, there's
> something different in the storage stack between cephfs and xfs/ext4 (I
> don't remember which we're using these days) that leads to ctime changes.
> The result is poor user experience.

But putting that together, it seems like somehow OpenShift/ODF is knowing that it's already done the relabel on rbd/gp2, but not knowing that on CephFS. I have no idea how that's possible or what the difference is — the most obvious distinction is that CephFS is used for RWX pods and RBD is used for RWO, so perhaps there's different logic in Kubernetes itself somewhere.

So if this needs improvement beyond the other ongoing SELinux relabel improvements, it will either need to be another Kubernetes-side fix or else the OpenShift team will need to tell us how CephFS is behaving differently that induces this behavior. :)

> 
> There is acknowledgement in kube upstream that relabeling on mount is time
> consuming. Note that their focus is on the mount delay caused, not a ctime
> change. It's also unclear whether the upstream solution [1] would be usable
> by cephfs when it becomes available due to sub-volume mounting.
> 
> Short of figuring out the storage stack difference, this seems to just leave
> disabling selinux for cephfs volumes.
> 
> [1]
> https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1710-
> selinux-relabeling

Comment 21 Mudit Agarwal 2022-05-24 05:38:14 UTC
Madhu, PTAL

Comment 26 Jan Safranek 2022-05-31 13:36:39 UTC
I can see ctime is changed on AWS EBS volumes too, most probably during SELinux relabeling.

In these cases it heavily depends who creates a pod and how:

- Unprivileged Pods (i.e. with `restricted` SCC, typically created by "regular users", get fsGroup + SELinux context from namespaces where they're running. And since they're constant, then we have ways how to skip chown / chmod / chcon, see https://access.redhat.com/solutions/6221251.

- Pods with `privileged` SCC, typically created by kubeadmin or a privileged operator, may  not get fsGroup / SELinux context from the namespace. fsGroup won't be applied (that's good in this case), but empty SELinux context means that CRI-O will allocate a new one + recursively chcon all files on all volumes. Since this context is chosen randomly for each Pod startup, there is no optimization possible. Use an explicit SELinux context in such pods, e.g. spc_t, or run the pods as privileged (which implies spc_t too) + the workarounds linked above.