2189936 – [GSS] FSGroup is not correctly set on subPath volume for CephFS CSI

Bug 2189936 - [GSS] FSGroup is not correctly set on subPath volume for CephFS CSI

Summary: [GSS] FSGroup is not correctly set on subPath volume for CephFS CSI

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	CephFS
Sub Component:
Version:	5.3
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	5.3z3
Assignee:	Xiubo Li
QA Contact:	Hemanth Kumar
Docs Contact:	lysanche
URL:
Whiteboard:
Depends On:	2182943
Blocks:	2203283
TreeView+	depends on / blocked

Reported:	2023-04-26 13:43 UTC by Venky Shankar
Modified:	2023-05-23 00:19 UTC (History)
CC List:	25 users (show)
Fixed In Version:	ceph-16.2.10-165.el8cp
Doc Type:	Bug Fix
Doc Text:	.The sub-directories inherit the correct metadata Previously, when a sub-directory was created, it would always use its parent’s non-projected `gid`/`uid` metadata, to set-up its own `gid`/`uid` metadata. If the journal logs were not flushed, it would always retrieve the old `gid`/`uid` metadata. With this fix, use the projected `gid`/`uid` metadata, else use the non-projected ones. The sub-directories inherit the correct `gid`/`uid` metadata from its parent.
Clone Of:	2182943
Environment:
Last Closed:	2023-05-23 00:19:11 UTC
Embargoed:
Dependent Products:
Flags:	hyelloji: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	56010	None	None	None	2023-04-26 13:43:07 UTC
Red Hat Issue Tracker	RHCEPH-6574	None	None	None	2023-04-26 13:47:03 UTC
Red Hat Product Errata	RHBA-2023:3259	None	None	None	2023-05-23 00:19:41 UTC

Description Venky Shankar 2023-04-26 13:43:07 UTC

+++ This bug was initially created as a clone of Bug #2182943 +++

Description of problem (please be detailed as possible and provide log
snippests):

The subPath volume permission is not correctly set for CephFS volume

Inside the Pod both directories should have fsGroup:

sh-4.2$ ls -l /etc/healing-controller.d/
total 0
drwxrwsr-x. 2 root 9999 0 Mar 30 01:49 critical-containers-logs
drwxrwsr-x. 2 root root 0 Mar 30 01:49 record

Version of all relevant components (if applicable):

OCP 4.12
ODF 4.12

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?

The customer starts to see this problem since OCP 4.12. 

Is there any workaround available to the best of your knowledge?

No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?

1

Can this issue reproducible?

Yes, in my environment 90% reproduce rate. In customer's site it is 50% rate

Can this issue reproduce from the UI?

No

If this is a regression, please provide more details to justify this:


Steps to Reproduce:

1. oc adm policy add-scc-to-user privileged -z default

2. Create the Pod and the CephFS CSI PVC

$ cat /tmp/test-pv.yaml
apiVersion: v1
kind: Pod
metadata:
  name: rhel7
  labels:
    app: rhel7
spec:
  containers:
  - name: myapp-container
    image: registry.access.redhat.com/ubi7/ubi
    command: ['sh', '-c', 'mkdir /etc/healing-controller.d -p && echo The app is running! && sleep 3600']
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      seLinuxOptions:
        level: s0
    volumeMounts:
    - mountPath: /etc/healing-controller.d/record
      name: local-disks
      subPath: record
    - mountPath: /etc/healing-controller.d/critical-containers-logs
      name: local-disks
      subPath: critical-containers-logs
  volumes:
    - name: local-disks
      persistentVolumeClaim:
        claimName: local-pvc-name
  securityContext:
    fsGroup: 9999
    runAsGroup: 9999
    runAsUser: 9999

---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: local-pvc-name
  namespace: test-pv
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: ocs-storagecluster-cephfs
  volumeMode: Filesystem


3. Login to the Pod and check /etc/healing-controller.d/* permissions

sh-4.2$ ls -l /etc/healing-controller.d/


Actual results:

sh-4.2$ ls -l /etc/healing-controller.d/
total 0
drwxrwsr-x. 2 root 9999 0 Mar 30 01:49 critical-containers-logs
drwxrwsr-x. 2 root root 0 Mar 30 01:49 record

Expected results:

sh-4.2$ ls -l /etc/healing-controller.d/
total 0
drwxrwsr-x. 2 root 9999 0 Mar 30 01:47 critical-containers-logs
drwxrwsr-x. 2 root 9999 0 Mar 30 01:47 record


Additional info:

This issue can not be reproduced by using other CSIs such as gp3-csi

--- Additional comment from RHEL Program Management on 2023-03-30 01:58:45 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.13.0' to '?', and so is being proposed to be fixed at the ODF 4.13.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Chen on 2023-03-30 12:34:22 UTC ---

Hi team,

In case the message is missed, seems this issue can only be reproduced by using Cephfs CSI. I can't reproduce the problem by using gp3-csi (aws).

Thanks a lot!

Best Regards,
Chen

--- Additional comment from Rakshith on 2023-03-30 13:13:15 UTC ---

hey, 

I was able to reproduce it too.
The directories record and critical-containers-logs does not exit on the PVC when the
PVC is just created.
They are created when it is first mounted.
This may have been the reason for difference in fsgroup. 

- Both root and custom fsgroup provides access to the PVC content.
- and on restart of pod, directories get the correct fsgroup assigned
Check the logs below.

```
[rakshith@fedora]$ k apply -f t.yml 
pod/rhel7 created
persistentvolumeclaim/local-pvc-name created
bash-4.2$ ls /etc/healing-controller.d/ -l 
total 0
drwxrwsr-x. 2 root 9999 0 Mar 30 12:22 critical-containers-logs
drwxrwsr-x. 2 root root 0 Mar 30 12:22 record
bash-4.2$ exit
exit
[rakshith@fedora]$ k delete po rhel7
pod "rhel7" deleted
[rakshith@fedora]$ k apply -f t.yml 
pod/rhel7 created
persistentvolumeclaim/local-pvc-name unchanged
bash-4.2$ ls /etc/healing-controller.d/ -l
total 0
drwxrwsr-x. 2 root 9999 0 Mar 30 12:22 critical-containers-logs
drwxrwsr-x. 2 root 9999 0 Mar 30 12:22 record
```

Role of Setting fsgroup belongs to the kublet.

@hekumar , Can you please verify this behavior ?

--- Additional comment from Chen on 2023-03-30 13:21:42 UTC ---

Hi Rakshith,

Thank you for your reply!

Just adding another point that this issue couldn't be reproduced by using RBD CSI (in my side). So not sure what's the difference between Cephfs and RBD when kubelet changes the permission according to fsGroup...

Best Regards,
Chen

--- Additional comment from Hemant Kumar on 2023-04-03 15:22:25 UTC ---

Yes that is because kubelet does not apply fsgroup changes for cephfs volumes (because they are shared filesystems) and hence users are expected to run the pod with same supplemental group as original volume permissions. For RBD volumes kubelet does apply permissions because those are block devices and only made available to a single node usually.

--- Additional comment from Chen on 2023-04-04 01:03:57 UTC ---

Hi Hemant,

Thank you for your reply.

If kubelet doesn't apply fsgroup at all for cephfs, then perhaps both directories should show root:root am I correct? 

Now the problem is, the first directory will always appear root:fsgroup, but the second directory sometimes appear root:root (50% in customer site and 80~90% in my test environment). 

My point is, if kubelet never applies fsgroup to cephfs, then perhaps I assume both directories should always appear root:root ?

Thanks again!

Best Regards,
Chen

--- Additional comment from Rakshith on 2023-04-04 05:41:36 UTC ---

(In reply to Hemant Kumar from comment #5)
> Yes that is because kubelet does not apply fsgroup changes for cephfs
> volumes (because they are shared filesystems) and hence users are expected
> to run the pod with same supplemental group as original volume permissions.
> For RBD volumes kubelet does apply permissions because those are block
> devices and only made available to a single node usually.

hey,

Cephfs csidriver object has fsgrouppolicy set to "File" mode.
https://github.com/rook/rook/pull/10503

and Cephfs subvolumes are  now created with permissions 755 instead of 777
so they require fsgroup set for NonRoot containers
https://github.com/ceph/ceph-csi/pull/3204

According to my tests, fsgroup was applied to files when pod was restarted, 
see https://bugzilla.redhat.com/show_bug.cgi?id=2182943#c3.

If multiple pods are using the same Cephfs PVC, they are expected to use the
same fsgroup in podsecurity spec (if my understanding is correct) ? 

Thanks,
Rakshith

--- Additional comment from  on 2023-04-10 13:33:34 UTC ---

Hi, pinging for updates here. Thanks

--- Additional comment from Colum Gaynor on 2023-04-11 05:48:18 UTC ---

@rar
@scorcora ---> Nokia CloudRAN are needing a fix for this issue urgently into ODF 4.12.z as they are locked to OCP 4.12.z for important trial sin 2023
from comments earlier in the Bugzilla it is stated that the fix is relatively easy to implement

Could you please provide an update when a fix could be expected fro NokiaCloudRAN team
We are reviewing a set of critical bugs weekly with them and this issue is on that list and therefore has a lot of management attention

Colum Gaynor - Partner Acceleration Lead, Nokia Global Account

--- Additional comment from Rakshith on 2023-04-11 06:15:18 UTC ---

(In reply to Colum Gaynor from comment #9)
> @rar
> @scorcora ---> Nokia CloudRAN are needing a fix for this issue
> urgently into ODF 4.12.z as they are locked to OCP 4.12.z for important
> trial sin 2023
> from comments earlier in the Bugzilla it is stated that the fix is
> relatively easy to implement
> 
> Could you please provide an update when a fix could be expected fro
> NokiaCloudRAN team
> We are reviewing a set of critical bugs weekly with them and this issue is
> on that list and therefore has a lot of management attention
> 
> Colum Gaynor - Partner Acceleration Lead, Nokia Global Account

Hey,

As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=2182943#c6
& https://bugzilla.redhat.com/show_bug.cgi?id=2182943#c7,

Kubelet sets the FSGroup, nothing can be done at ODF level.

Needinfo is set on Hemanth for more information regarding the same.
Please open a Bug on OCP storage for more attention/visibility.

Thanks,
Rakshith

--- Additional comment from Colum Gaynor on 2023-04-12 05:08:34 UTC ---

@cchen ---> I am a bit confused.

Rakshith is suggesting above ? that a new Bug on OCP storage might be needed here ???
Is this current bug now on the wrong area or am I misunderstanding the situation.

Colum Gaynor - Partner Acceleration Lead, Nokia Global Account

--- Additional comment from Chen on 2023-04-12 05:24:22 UTC ---

Hi Colum,

I believe Rakshith suggested to open Kubelet bug. I raised this one https://issues.redhat.com/browse/OCPBUGS-11676. Clearing my needsinfo and leaves needsinfo with Hermant for comment #6 and #7.

Best Regards,
Chen

--- Additional comment from Hemant Kumar on 2023-04-12 16:46:24 UTC ---

Okay I did not realize ceph-fs CSI driver is explicitly setting File type as fsgroupchangepolicy. In this case, does subpath already exists on the directory or kubelet is creating it?  I suspect the behaviour of kubelet can be different based on who creates the subpath (or subdirectory) on the root of the file system. 

Let me take a deeper look at this behavior and propose an fix.

--- Additional comment from Jordi Claret on 2023-04-12 20:38:21 UTC ---


I tried to replicate the problem by setting the verbosity of kubelet to Environment="KUBELET_LOG_LEVEL=8". So, kubelet applied fsgroup 9999 and generated two subdirectories called "critical-containers-logs" and "record", however only the "record" directory was created with appropriate permissions

- fsGroup 9999 being applied

Apr 12 19:47:35 ip-10-0-229-165 kubenswrapper[2436]: I0412 19:47:35.984066    2436 csi_mounter.go:303] kubernetes.io/csi: mounter.SetupAt fsGroup [9999] applied successfully to 0001-0011-openshift-storage-0000000000000001-14958c69-0a15-40ab-8d67-371816b25c4c

$ oc rsh rhel7-cephfs ls -l /etc/healing-controller.d
total 0
drwxrwsr-x. 2 root root 0 Apr 12 19:47 critical-containers-logs
drwxrwsr-x. 2 root 9999 0 Apr 12 19:47 record

- Creating "critical-containers-logs" dir

Apr 12 19:47:36 ip-10-0-229-165 kubenswrapper[2436]: I0412 19:47:36.913832    2436 subpath_linux.go:379] Creating directory "/var/lib/kubelet/pods/20ac1d6c-2bf4-4281-a675-6f1a821f2cfd/volumes/kubernetes.io~csi/pvc-642a998b-af3a-4ccf-a091-b7427a440614/mount/critical-containers-logs" within base "/var/lib/kubelet/pods/20ac1d6c-2bf4-4281-a675-6f1a821f2cfd/volumes/kubernetes.io~csi/pvc-642a998b-af3a-4ccf-a091-b7427a440614/mount"
Apr 12 19:47:36 ip-10-0-229-165 kubenswrapper[2436]: I0412 19:47:36.915250    2436 subpath_linux.go:412] "/var/lib/kubelet/pods/20ac1d6c-2bf4-4281-a675-6f1a821f2cfd/volumes/kubernetes.io~csi/pvc-642a998b-af3a-4ccf-a091-b7427a440614/mount" already exists, "critical-containers-logs" to create
Apr 12 19:47:36 ip-10-0-229-165 kubenswrapper[2436]: I0412 19:47:36.916079    2436 subpath_linux.go:580] Opening path /var/lib/kubelet/pods/20ac1d6c-2bf4-4281-a675-6f1a821f2cfd/volumes/kubernetes.io~csi/pvc-642a998b-af3a-4ccf-a091-b7427a440614/mount
Apr 12 19:47:36 ip-10-0-229-165 kubenswrapper[2436]: I0412 19:47:36.916674    2436 subpath_linux.go:436] Creating critical-containers-logs

- Creating "record" dir

Apr 12 19:47:36 ip-10-0-229-165 kubenswrapper[2436]: I0412 19:47:36.928830    2436 subpath_linux.go:379] Creating directory "/var/lib/kubelet/pods/20ac1d6c-2bf4-4281-a675-6f1a821f2cfd/volumes/kubernetes.io~csi/pvc-642a998b-af3a-4ccf-a091-b7427a440614/mount/record" within base "/var/lib/kubelet/pods/20ac1d6c-2bf4-4281-a675-6f1a821f2cfd/volumes/kubernetes.io~csi/pvc-642a998b-af3a-4ccf-a091-b7427a440614/mount"
Apr 12 19:47:36 ip-10-0-229-165 kubenswrapper[2436]: I0412 19:47:36.928961    2436 subpath_linux.go:412] "/var/lib/kubelet/pods/20ac1d6c-2bf4-4281-a675-6f1a821f2cfd/volumes/kubernetes.io~csi/pvc-642a998b-af3a-4ccf-a091-b7427a440614/mount" already exists, "record" to create
Apr 12 19:47:36 ip-10-0-229-165 kubenswrapper[2436]: I0412 19:47:36.928999    2436 subpath_linux.go:580] Opening path /var/lib/kubelet/pods/20ac1d6c-2bf4-4281-a675-6f1a821f2cfd/volumes/kubernetes.io~csi/pvc-642a998b-af3a-4ccf-a091-b7427a440614/mount
Apr 12 19:47:36 ip-10-0-229-165 kubenswrapper[2436]: I0412 19:47:36.929012    2436 subpath_linux.go:436] Creating record

--- Additional comment from Hemant Kumar on 2023-04-13 18:09:06 UTC ---

I have been able to reproduce the bug, but the thing is we rely on SETGID bit set by kubelet on parent directory to cause any new directory created inside to have group permissions - https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/volume_linux.go#L95

So there is nothing specific we are doing in kubelet to set group permissions for subpath. But it appears that somehow setgid bit is not working when new directory is created for cephfs in some cases. There were some existing bugs about setgid implementation in cephfs. Do we know of any existing issues?

--- Additional comment from Hemant Kumar on 2023-04-13 19:53:21 UTC ---

Can someone also loop in a person who maintains cephfs in Linux kernel? I suspect that this could be a kernel bug because of how it handles sgid. 

As far as kubelet is concerned, both ceph-rbd and cephfs are processed the same way once you enable fsgroupchangepolicy: File. And since fsgroup works correctly for ceph-rbd every time, I am leaning towards this being a bug in cephfs stack (but we also need to ensure that CSI driver is not doing anything funky with perms).

--- Additional comment from Madhu Rajanna on 2023-04-14 07:57:27 UTC ---

(In reply to Hemant Kumar from comment #16)
> Can someone also loop in a person who maintains cephfs in Linux kernel? I
> suspect that this could be a kernel bug because of how it handles sgid. 
> 
> As far as kubelet is concerned, both ceph-rbd and cephfs are processed the
> same way once you enable fsgroupchangepolicy: File. And since fsgroup works
> correctly for ceph-rbd every time, I am leaning towards this being a bug in
> cephfs stack 

@Venky/@Xiubo/@kotresh can you please help here?

(but we also need to ensure that CSI driver is not doing
> anything funky with perms).

CSI will not change any permission when/before mounting, CSI will create a subvolume with 755 permission (default in ceph) as its ODF 4.12

--- Additional comment from  on 2023-04-17 12:07:05 UTC ---

Team, checking in here, do we have someone looking at the case?

--- Additional comment from Venky Shankar on 2023-04-17 13:03:22 UTC ---

(In reply to Hemant Kumar from comment #16)
> Can someone also loop in a person who maintains cephfs in Linux kernel? I
> suspect that this could be a kernel bug because of how it handles sgid. 

Do you have a reproducer with a standalone ceph file system using kclient, Hemant?

--- Additional comment from Xiubo Li on 2023-04-18 02:41:22 UTC ---

Hi Hemant,

Please provide one method to reproduce this with a standalone ceph file system using kclient as Venky mentioned.

Thanks

--- Additional comment from Colum Gaynor on 2023-04-18 09:27:41 UTC ---

@hekumar ----> The Bugzilla seems to be on ODF but the comments suggest it's kernel related or did I misunderstand ?
Customer (Nokia) is looking for an estimate when this issue can be fixed so would like to understand which group is responsible to fix the defect.


Colum Gaynor - PAL Nokia Global Account

--- Additional comment from Madhu Rajanna on 2023-04-18 13:24:10 UTC ---

For better movement moving this to cephfs for now as it looks like something is wrong. Please feel free to move it back if you think otherwise around.

--- Additional comment from Venky Shankar on 2023-04-18 13:39:23 UTC ---

Can someone help us understand what calls are made to cephfs (if any) when fsgroupchangepolicy is enabled? What is fsgroupchangepolicy supposed to do? Sorry, for me fsgroupchangepolicy is a CSI jargon which I do not understand. I tried reading the BZ comments regarding the same, but I couldn't come across a clear description on what its supposed to achieve. Also, are there are relevant (cephfs) logs that we can look into.

We asked for a reproducer with standalone cephfs, which is the easiest way that we can identify what's going on.

--- Additional comment from Madhu Rajanna on 2023-04-18 13:49:16 UTC ---

>securityContext:
    fsGroup: 9999
    runAsGroup: 9999
    runAsUser: 9999
>Actual results:
sh-4.2$ ls -l /etc/healing-controller.d/
total 0
drwxrwsr-x. 2 root 9999 0 Mar 30 01:49 critical-containers-logs
drwxrwsr-x. 2 root root 0 Mar 30 01:49 record
Expected results:
sh-4.2$ ls -l /etc/healing-controller.d/
total 0
drwxrwsr-x. 2 root 9999 0 Mar 30 01:47 critical-containers-logs
drwxrwsr-x. 2 root 9999 0 Mar 30 01:47 record

The above is from comment #1 the permission of the record directory is root:root its expected to be root:9999 as mentioned, we dont do anything at csi we mount the filesystem to a directory. Kubelet is the one who does take care of changing it. @Hemant mentioned they dont do anything specific to cephfs they just rely on SETGID bit set by kubelet on the parent directory to cause any new directory created inside to have group permissions. He thinks there could be a bug in the cephfs kernel client. 

>We asked for a reproducer with standalone cephfs, which is the easiest way that we can identify what's going on.

Am not sure how exactly it can be reproduced, need to wait for Hemant response.

--- Additional comment from Madhu Rajanna on 2023-04-18 14:40:34 UTC ---

If I am understanding correctly, what's happening here is cephfs filesystem is mounted to a directory named `/a`
and kubelet is doing Lchown to set the group to 9999 and setting doing chmod to set dgrwxrwxr-x (bitwise or operation 0660|0110|setgid). After that, it creates 2 more
directory /a/b and `/a/c now the `b` directory doesn't have the right group set, which should be 9999. It's set to `root`, which is not expected but `c` directory has the right group (9999) set properly.

@hemant, please correct me if am wrong. The above is based on the code reading from the pointer function in comment #15

--- Additional comment from Hemant Kumar on 2023-04-18 16:16:48 UTC ---

@Madhu that seems correct. The permissions of subdirectory is correct (because kubelet is anyways doing chmod on it), but ownership via group is wrong - which is unexpected, because parent *does* have SGID bit set and hence any new directory created should automatically have group ownership.

I will try and reproduce this without OCP but if someone beats to me it, I will be very happy. :-)

--- Additional comment from Xiubo Li on 2023-04-19 01:26:00 UTC ---

(In reply to Madhu Rajanna from comment #25)
> If I am understanding correctly, what's happening here is cephfs filesystem
> is mounted to a directory named `/a`
> and kubelet is doing Lchown to set the group to 9999 and setting doing chmod
> to set dgrwxrwxr-x (bitwise or operation 0660|0110|setgid). After that, it
> creates 2 more
> directory /a/b and `/a/c now the `b` directory doesn't have the right group
> set, which should be 9999. It's set to `root`, which is not expected but `c`
> directory has the right group (9999) set properly.
> 
> @hemant, please correct me if am wrong. The above is based on the code
> reading from the pointer function in comment #15


"""
The setgid bit

The setgid affects both files as well as directories. When used on a file, it executes with the privileges of the group of the user who owns it instead of executing with those of the group of the user who executed it.
When the bit is set for a directory, the set of files in that directory will have the same group as the group of the parent directory, and not that of the user who created those files. This is used for file sharing since they can be now modified by all the users who are part of the group of the parent directory.

"""

So for directory 'b/', the group should be '9999' too, which is the same with the parent directory 'a/', but it's not.

Locally I have tried this with the kclient:

[xiubli@ceph kcephfs]$ ls
volumes
[xiubli@ceph kcephfs]$ sudo mkdir a
[xiubli@ceph kcephfs]$ ll
total 0
drwxr-xr-x 2 root root 0 Apr 19 09:05 a
drwxr-xr-x 3 root root 2 Apr 18 13:10 volumes
           
[xiubli@ceph kcephfs]$ sudo chgrp xiubli a
[xiubli@ceph kcephfs]$ ll
total 0
drwxr-xr-x 2 root xiubli 0 Apr 19 09:05 a
drwxr-xr-x 3 root root   2 Apr 18 13:10 volumes

[xiubli@ceph kcephfs]$ sudo chmod g+s,a+rwx a 
[xiubli@ceph kcephfs]$ ll
total 0
drwxrwsrwx 2 root xiubli 0 Apr 19 09:05 a
drwxr-xr-x 3 root root   2 Apr 18 13:10 volumes

[xiubli@ceph kcephfs]$ mkdir a/b
[xiubli@ceph kcephfs]$ mkdir a/c
[xiubli@ceph kcephfs]$ ll a
total 0
drwxr-sr-x 2 xiubli xiubli 0 Apr 19 09:09 b
drwxr-sr-x 2 xiubli xiubli 0 Apr 19 09:09 c

[xiubli@ceph kcephfs]$ sudo mkdir a/c1
[xiubli@ceph kcephfs]$ sudo mkdir a/b1
[xiubli@ceph kcephfs]$ ll a
total 0
drwxr-sr-x 2 xiubli xiubli 0 Apr 19 09:09 b
drwxr-sr-x 2 root   xiubli 0 Apr 19 09:10 b1
drwxr-sr-x 2 xiubli xiubli 0 Apr 19 09:09 c
drwxr-sr-x 2 root   xiubli 0 Apr 19 09:10 c1
[xiubli@ceph kcephfs]$ 

You can see that I followed the steps you mentioned above, the group will always be correctly set no matter creating with root or non-root user.

Please let me know if I miss something here.

Madhu, Hemant

Currently I am using the upstream ceph and kernel. And have no resource to setup the same version with cu.

BTW, do you have the setup, which is using the same ceph and kernel version with cu ? Could you have a try with the same steps ?

Thanks
- Xiubo

--- Additional comment from Madhu Rajanna on 2023-04-19 07:12:24 UTC ---

Hi Xiubo

i tried below script in upstream Rook+ceph and with downstream OCP+ODF 4.12


#!/bin/bash
mon_endpoints=$(grep mon_host /etc/ceph/ceph.conf | awk '{print $3}')
my_secret=$(grep key /etc/ceph/keyring | awk '{print $3}')
for i in 1 2
do
	ceph fs subvolume create ocs-storagecluster-cephfilesystem test$i csi
	path=$(ceph fs subvolume getpath ocs-storagecluster-cephfilesystem test$i csi)
	mkdir -p /tmp/registry$i
	mount -t ceph -o mds_namespace=ocs-storagecluster-cephfilesystem,name=admin,secret=$my_secret $mon_endpoints:/$path /tmp/registry$i
	chgrp 9999 /tmp/registry$i
        chmod g+s,a+rwx /tmp/registry$i
        mkdir -p /tmp/registry$i/a
	mkdir -p /tmp/registry$i/b
	ls -lrt /tmp/registry$i
	umount /tmp/registry$i
        ceph fs subvolume rm ocs-storagecluster-cephfilesystem test$i csi
done


----------------------- output from OCP cluster---------------
Client Version: 4.11.7
Kustomize Version: v4.5.4
Server Version: 4.12.0-0.nightly-2023-04-18-151010
Kubernetes Version: v1.25.8+27e744f

sh-4.4# ceph version
ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)
sh-4.4# ceph --version
ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)

sh-4.4# ./test.sh 
total 0
drwxr-xr-x. 2 root root 0 Apr 19 07:08 a
drwxr-sr-x. 2 root 9999 0 Apr 19 07:08 b
total 0
drwxr-xr-x. 2 root root 0 Apr 19 07:08 a
drwxr-sr-x. 2 root 9999 0 Apr 19 07:08 b
sh-4.4# 
sh-4.4# uname -a
Linux ip-10-0-187-19 4.18.0-372.51.1.el8_6.x86_64 #1 SMP Fri Mar 24 01:34:10 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------

------------------------ output from upstream Rook+ceph cluster ---------------



---------------------------------------------------------------
sh-4.4# ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
sh-4.4# ceph --version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

sh-4.4# ./test.sh 
total 0
drwxr-sr-x 2 root 9999 0 Apr 19 07:05 a
drwxr-sr-x 2 root 9999 0 Apr 19 07:05 b
total 0
drwxr-sr-x 2 root 9999 0 Apr 19 07:05 a
drwxr-sr-x 2 root 9999 0 Apr 19 07:05 b
sh-4.4# uname -a
Linux minikube 5.10.57 #1 SMP Mon Apr 3 23:35:10 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------


If the steps are correct, Looks like its reproducible with OCP cluster

--- Additional comment from Xiubo Li on 2023-04-19 07:22:09 UTC ---

(In reply to Madhu Rajanna from comment #28)
> Hi Xiubo
> 
> i tried below script in upstream Rook+ceph and with downstream OCP+ODF 4.12
> 
> 
> #!/bin/bash
> mon_endpoints=$(grep mon_host /etc/ceph/ceph.conf | awk '{print $3}')
> my_secret=$(grep key /etc/ceph/keyring | awk '{print $3}')
> for i in 1 2
> do
> 	ceph fs subvolume create ocs-storagecluster-cephfilesystem test$i csi
> 	path=$(ceph fs subvolume getpath ocs-storagecluster-cephfilesystem test$i
> csi)
> 	mkdir -p /tmp/registry$i
> 	mount -t ceph -o
> mds_namespace=ocs-storagecluster-cephfilesystem,name=admin,secret=$my_secret
> $mon_endpoints:/$path /tmp/registry$i
> 	chgrp 9999 /tmp/registry$i
>         chmod g+s,a+rwx /tmp/registry$i
>         mkdir -p /tmp/registry$i/a
> 	mkdir -p /tmp/registry$i/b
> 	ls -lrt /tmp/registry$i
> 	umount /tmp/registry$i
>         ceph fs subvolume rm ocs-storagecluster-cephfilesystem test$i csi
> done
> 
> 
> ----------------------- output from OCP cluster---------------
> Client Version: 4.11.7
> Kustomize Version: v4.5.4
> Server Version: 4.12.0-0.nightly-2023-04-18-151010
> Kubernetes Version: v1.25.8+27e744f
> 
> sh-4.4# ceph version
> ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9)
> pacific (stable)
> sh-4.4# ceph --version
> ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9)
> pacific (stable)
> 
> sh-4.4# ./test.sh 
> total 0
> drwxr-xr-x. 2 root root 0 Apr 19 07:08 a
> drwxr-sr-x. 2 root 9999 0 Apr 19 07:08 b
> total 0
> drwxr-xr-x. 2 root root 0 Apr 19 07:08 a
> drwxr-sr-x. 2 root 9999 0 Apr 19 07:08 b
> sh-4.4# 
> sh-4.4# uname -a
> Linux ip-10-0-187-19 4.18.0-372.51.1.el8_6.x86_64 #1 SMP Fri Mar 24 01:34:10
> EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
> -----------------------------------------------------------------
> 
> ------------------------ output from upstream Rook+ceph cluster
> ---------------
> 
> 
> 
> ---------------------------------------------------------------
> sh-4.4# ceph version
> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
> (stable)
> sh-4.4# ceph --version
> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
> (stable)
> 
> sh-4.4# ./test.sh 
> total 0
> drwxr-sr-x 2 root 9999 0 Apr 19 07:05 a
> drwxr-sr-x 2 root 9999 0 Apr 19 07:05 b
> total 0
> drwxr-sr-x 2 root 9999 0 Apr 19 07:05 a
> drwxr-sr-x 2 root 9999 0 Apr 19 07:05 b
> sh-4.4# uname -a
> Linux minikube 5.10.57 #1 SMP Mon Apr 3 23:35:10 UTC 2023 x86_64 x86_64
> x86_64 GNU/Linux
> -----------------------------------------------------------------
> 
> 
> If the steps are correct, Looks like its reproducible with OCP cluster

Hi Madhu,

Thanks very much for your feedback.

BTW, could you try this with the ceph-fuse mount ?

Let's see where the problem located.

--- Additional comment from Madhu Rajanna on 2023-04-19 07:56:39 UTC ---

sh-4.4# ./test.sh 
++ grep mon_host /etc/ceph/ceph.conf
++ awk '{print $3}'
+ mon_endpoints=172.30.227.108:6789,172.30.40.75:6789,172.30.114.250:6789
++ grep key /etc/ceph/keyring
++ awk '{print $3}'
+ my_secret=AQDHjD9kDgR6AhAAEdlX3qO3tb2PZqx/4USf5g==
+ for i in 1 2
+ ceph fs subvolume create ocs-storagecluster-cephfilesystem test1 csi
++ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem test1 csi
+ path=/volumes/csi/test1/70dcae52-f545-4dbc-93ee-04bebfc5915a
+ mkdir -p /tmp/registry1
+ ceph-fuse /tmp/registry1 -m=172.30.227.108:6789 --key=AQDHjD9kDgR6AhAAEdlX3qO3tb2PZqx/4USf5g== -n=client.admin -r /volumes/csi/test1/70dcae52-f545-4dbc-93ee-04bebfc5915a -o nonempty --client_mds_namespace=ocs-storagecluster-cephfilesystem
2023-04-19T07:54:40.094+0000 7f4c6aa31540 -1 init, newargv = 0x55ecc5740800 newargc=17
ceph-fuse[1599]: starting ceph client
ceph-fuse[1599]: starting fuse
+ chgrp 9999 /tmp/registry1
+ chmod g+s,a+rwx /tmp/registry1
+ mkdir -p /tmp/registry1/a
+ mkdir -p /tmp/registry1/b
+ mkdir -p /tmp/registry1/b/x
+ ls -lrt /tmp/registry1/b/
total 1
drwxrwsrwx. 2 root 9999 0 Apr 19 07:54 x
+ mkdir /tmp/registry1/c
+ mkdir /tmp/registry1/d
+ ls -lrt /tmp/registry1
total 2
drwxrwxrwx. 2 root root 0 Apr 19 07:54 a
drwxrwsrwx. 3 root 9999 0 Apr 19 07:54 b
drwxrwsrwx. 2 root 9999 0 Apr 19 07:54 c
drwxrwsrwx. 2 root 9999 0 Apr 19 07:54 d
+ umount /tmp/registry1
+ rm -rf /tmp/registry1
+ ceph fs subvolume rm ocs-storagecluster-cephfilesystem test1 csi
+ for i in 1 2
+ ceph fs subvolume create ocs-storagecluster-cephfilesystem test2 csi
++ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem test2 csi
+ path=/volumes/csi/test2/3fad86af-1bfc-4581-a818-ed2949878c43
+ mkdir -p /tmp/registry2
+ ceph-fuse /tmp/registry2 -m=172.30.227.108:6789 --key=AQDHjD9kDgR6AhAAEdlX3qO3tb2PZqx/4USf5g== -n=client.admin -r /volumes/csi/test2/3fad86af-1bfc-4581-a818-ed2949878c43 -o nonempty --client_mds_namespace=ocs-storagecluster-cephfilesystem
2023-04-19T07:54:41.127+0000 7f061533c540 -1 init, newargv = 0x560e3dfec800 newargc=17
ceph-fuse[1717]: starting ceph client
ceph-fuse[1717]: starting fuse
+ chgrp 9999 /tmp/registry2
+ chmod g+s,a+rwx /tmp/registry2
+ mkdir -p /tmp/registry2/a
+ mkdir -p /tmp/registry2/b
+ mkdir -p /tmp/registry2/b/x
+ ls -lrt /tmp/registry2/b/
total 1
drwxrwsrwx. 2 root 9999 0 Apr 19 07:54 x
+ mkdir /tmp/registry2/c
+ mkdir /tmp/registry2/d
+ ls -lrt /tmp/registry2
total 2
drwxrwxrwx. 2 root root 0 Apr 19 07:54 a
drwxrwsrwx. 3 root 9999 0 Apr 19 07:54 b
drwxrwsrwx. 2 root 9999 0 Apr 19 07:54 c
drwxrwsrwx. 2 root 9999 0 Apr 19 07:54 d
+ umount /tmp/registry2
+ rm -rf /tmp/registry2
+ ceph fs subvolume rm ocs-storagecluster-cephfilesystem test2 csi
sh-4.4# ceph version
ceph version 16.2.10-138.el8cp (a63ae467c8e1f7503ea3855893f1e5ca189a71b9) pacific (stable)
sh-4.4# ceph --version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
sh-4.4# 



The above is the result from the OCP cluster as the fuse client was not present, i used the upstream ceph image to test it out as it includes ceph.fuse binary

--- Additional comment from Madhu Rajanna on 2023-04-19 08:02:03 UTC ---

sh-4.4# ./test.sh 
++ grep mon_host /etc/ceph/ceph.conf
++ awk '{print $3}'
+ mon_endpoints=172.30.227.108:6789,172.30.40.75:6789,172.30.114.250:6789
++ awk '{print $3}'
++ grep key /etc/ceph/keyring
+ my_secret=AQDHjD9kDgR6AhAAEdlX3qO3tb2PZqx/4USf5g==
+ for i in 1 2
+ ceph fs subvolume create ocs-storagecluster-cephfilesystem test1 csi
++ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem test1 csi
+ path=/volumes/csi/test1/492940a4-ccbc-4c9d-a6b1-43a490efab8e
+ mkdir -p /tmp/registry1
+ ceph-fuse /tmp/registry1 -m=172.30.227.108:6789 --key=AQDHjD9kDgR6AhAAEdlX3qO3tb2PZqx/4USf5g== -n=client.admin -r /volumes/csi/test1/492940a4-ccbc-4c9d-a6b1-43a490efab8e -o nonempty --client_mds_namespace=ocs-storagecluster-cephfilesystem
2023-04-19T07:59:08.086+0000 7f3f63fbd540 -1 init, newargv = 0x56213485a800 newargc=17
ceph-fuse[1970]: starting ceph client
ceph-fuse[1970]: starting fuse
+ chgrp 9999 /tmp/registry1
+ chmod g+s,a+rwx /tmp/registry1
+ sleep 5
+ mkdir -p /tmp/registry1/a
+ mkdir -p /tmp/registry1/b
+ mkdir -p /tmp/registry1/b/x
+ ls -lrt /tmp/registry1/b/
total 1
drwxrwsrwx. 2 root 9999 0 Apr 19 07:59 x
+ mkdir /tmp/registry1/c
+ mkdir /tmp/registry1/d
+ ls -lrt /tmp/registry1
total 2
drwxrwsrwx. 2 root 9999 0 Apr 19 07:59 a
drwxrwsrwx. 3 root 9999 0 Apr 19 07:59 b
drwxrwsrwx. 2 root 9999 0 Apr 19 07:59 c
drwxrwsrwx. 2 root 9999 0 Apr 19 07:59 d
+ umount /tmp/registry1
+ rm -rf /tmp/registry1
+ ceph fs subvolume rm ocs-storagecluster-cephfilesystem test1 csi
+ for i in 1 2
+ ceph fs subvolume create ocs-storagecluster-cephfilesystem test2 csi
++ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem test2 csi
+ path=/volumes/csi/test2/9cbc323e-9993-454c-8746-1b2bc4ace5c3
+ mkdir -p /tmp/registry2
+ ceph-fuse /tmp/registry2 -m=172.30.227.108:6789 --key=AQDHjD9kDgR6AhAAEdlX3qO3tb2PZqx/4USf5g== -n=client.admin -r /volumes/csi/test2/9cbc323e-9993-454c-8746-1b2bc4ace5c3 -o nonempty --client_mds_namespace=ocs-storagecluster-cephfilesystem
2023-04-19T07:59:14.063+0000 7ff6c34c3540 -1 init, newargv = 0x56163def7800 newargc=17
ceph-fuse[2093]: starting ceph client
ceph-fuse[2093]: starting fuse
+ chgrp 9999 /tmp/registry2
+ chmod g+s,a+rwx /tmp/registry2
+ sleep 5
+ mkdir -p /tmp/registry2/a
+ mkdir -p /tmp/registry2/b
+ mkdir -p /tmp/registry2/b/x
+ ls -lrt /tmp/registry2/b/
total 1
drwxrwsrwx. 2 root 9999 0 Apr 19 07:59 x
+ mkdir /tmp/registry2/c
+ mkdir /tmp/registry2/d
+ ls -lrt /tmp/registry2
total 2
drwxrwsrwx. 2 root 9999 0 Apr 19 07:59 a
drwxrwsrwx. 3 root 9999 0 Apr 19 07:59 b
drwxrwsrwx. 2 root 9999 0 Apr 19 07:59 c
drwxrwsrwx. 2 root 9999 0 Apr 19 07:59 d
+ umount /tmp/registry2
+ rm -rf /tmp/registry2
+ ceph fs subvolume rm ocs-storagecluster-cephfilesystem test2 csi
sh-4.4# 
sh-4.4# 
sh-4.4# 
sh-4.4# vi test.sh 
.bash_logout       .bash_profile      .bashrc            .cshrc             .tcshrc            anaconda-ks.cfg    anaconda-post.log  original-ks.cfg    test.sh
sh-4.4# vi test.sh 
sh-4.4# ./test.sh 
++ grep mon_host /etc/ceph/ceph.conf
++ awk '{print $3}'
+ mon_endpoints=172.30.227.108:6789,172.30.40.75:6789,172.30.114.250:6789
++ grep key /etc/ceph/keyring
++ awk '{print $3}'
+ my_secret=AQDHjD9kDgR6AhAAEdlX3qO3tb2PZqx/4USf5g==
+ for i in 1 2
+ ceph fs subvolume create ocs-storagecluster-cephfilesystem test1 csi
++ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem test1 csi
+ path=/volumes/csi/test1/1251b28f-ca6a-4b0b-860f-ee3ebdbad933
+ mkdir -p /tmp/registry1
+ mount -t ceph -o mds_namespace=ocs-storagecluster-cephfilesystem,name=admin,secret=AQDHjD9kDgR6AhAAEdlX3qO3tb2PZqx/4USf5g== 172.30.227.108:6789,172.30.40.75:6789,172.30.114.250:6789://volumes/csi/test1/1251b28f-ca6a-4b0b-860f-ee3ebdbad933 /tmp/registry1
+ chgrp 9999 /tmp/registry1
+ chmod g+s,a+rwx /tmp/registry1
+ sleep 5
+ mkdir -p /tmp/registry1/a
+ mkdir -p /tmp/registry1/b
+ mkdir -p /tmp/registry1/b/x
+ ls -lrt /tmp/registry1/b/
total 0
drwxrwsrwx. 2 root 9999 0 Apr 19 08:00 x
+ mkdir /tmp/registry1/c
+ mkdir /tmp/registry1/d
+ ls -lrt /tmp/registry1
total 0
drwxrwsrwx. 2 root 9999 0 Apr 19 08:00 a
drwxrwsrwx. 3 root 9999 1 Apr 19 08:00 b
drwxrwsrwx. 2 root 9999 0 Apr 19 08:00 c
drwxrwsrwx. 2 root 9999 0 Apr 19 08:00 d
+ umount /tmp/registry1
+ rm -rf /tmp/registry1
+ ceph fs subvolume rm ocs-storagecluster-cephfilesystem test1 csi
+ for i in 1 2
+ ceph fs subvolume create ocs-storagecluster-cephfilesystem test2 csi
++ ceph fs subvolume getpath ocs-storagecluster-cephfilesystem test2 csi
+ path=/volumes/csi/test2/884662eb-86d9-421f-b853-d008334ae93b
+ mkdir -p /tmp/registry2
+ mount -t ceph -o mds_namespace=ocs-storagecluster-cephfilesystem,name=admin,secret=AQDHjD9kDgR6AhAAEdlX3qO3tb2PZqx/4USf5g== 172.30.227.108:6789,172.30.40.75:6789,172.30.114.250:6789://volumes/csi/test2/884662eb-86d9-421f-b853-d008334ae93b /tmp/registry2
+ chgrp 9999 /tmp/registry2
+ chmod g+s,a+rwx /tmp/registry2
+ sleep 5
+ mkdir -p /tmp/registry2/a
+ mkdir -p /tmp/registry2/b
+ mkdir -p /tmp/registry2/b/x
+ ls -lrt /tmp/registry2/b/
total 0
drwxrwsrwx. 2 root 9999 0 Apr 19 08:00 x
+ mkdir /tmp/registry2/c
+ mkdir /tmp/registry2/d
+ ls -lrt /tmp/registry2
total 0
drwxrwsrwx. 2 root 9999 0 Apr 19 08:00 a
drwxrwsrwx. 3 root 9999 1 Apr 19 08:00 b
drwxrwsrwx. 2 root 9999 0 Apr 19 08:00 c
drwxrwsrwx. 2 root 9999 0 Apr 19 08:00 d
+ umount /tmp/registry2
+ rm -rf /tmp/registry2
+ ceph fs subvolume rm ocs-storagecluster-cephfilesystem test2 csi
sh-4.4# 

--------------------------------------------

Note if i put a 5 seconds delay between chmod and mkdir of 1st directory looks like right permission are set. not sure it matters but pasting here for reference

--- Additional comment from Xiubo Li on 2023-04-19 09:02:12 UTC ---

Thanks Madhu for your tests.

I think the problem is that after changing registry1's mode the clients will cache it if the 'Fx' caps is issued. While the following mkdir operation will send the request to MDS directly, because the new modes are still cached in client side, so in MDS side when creating the new inode for 'a' directory it will miss setting the S_ISGID bit:

 3350 CInode* Server::prepare_new_inode(MDRequestRef& mdr, CDir *dir, inodeno_t useino, unsigned mode,
 3351                                   const file_layout_t *layout)
 3352 { 
      ...
 3416   if (pip->mode & S_ISGID) { 
 3417     dout(10) << " dir is sticky" << dendl;
 3418     _inode->gid = pip->gid;
 3419     if (S_ISDIR(mode)) {
 3420       dout(10) << " new dir also sticky" << dendl; 
 3421       _inode->mode |= S_ISGID;
 3422     }
 3423   } else {
 3424     _inode->gid = mdr->client_request->get_caller_gid();
 3425   }

For ceph-fuse it will flush the dirty caps periodically per 5 seconds, while the kclient could cache it up to 60 seconds at most. So this why sleeping 5 seconds it will work, but in theory for kclient it still could fail.

I need to wait for your logs to dig it further.
Thanks
- Xiubo

--- Additional comment from Madhu Rajanna on 2023-04-19 09:04:37 UTC ---



--- Additional comment from Xiubo Li on 2023-04-19 09:36:48 UTC ---

(In reply to Madhu Rajanna from comment #33)
> Created attachment 1958238 [details]
> ceph logs with/without timeout

Thanks again Madhu.

I think I found the root cause, this is one known bug and have been fixed by https://tracker.ceph.com/issues/56010.

The root cause is that:

The following command will send a 'setattr' request to MDS immediately and then the MDS projected it in cache.

$ chmod g+s,a+rwx /tmp/registry2

And then when the first mkdir comes, in MDS when preparing the new CInode the old ceph code will use CInode's metadata, which is 'diri', instead of the projected metadata, which is 'pip', because the previous 'setattr' just journal it in local cache and hasn't been flushed out it, so the projected metadata won't be applied to the CInode metadata:

$ mkdir -p /tmp/registry2/a

Old ceph code for prepare_new_inode():

 3334   CInode *diri = dir->get_inode();
 3335  
 3336   dout(10) << oct << " dir mode 0" << diri->get_inode()->mode << " new mode 0" << mode << dec << dendl;
 3337 
 3338   if (diri->get_inode()->mode & S_ISGID) {                                                                                                                             
 3339     dout(10) << " dir is sticky" << dendl;
 3340     _inode->gid = diri->get_inode()->gid;
 3341     if (S_ISDIR(mode)) {
 3342       dout(10) << " new dir also sticky" << dendl;      
 3343       _inode->mode |= S_ISGID;
 3344     }
 3345   } else 
 3346     _inode->gid = mdr->client_request->get_caller_gid();


The new ceph code for prepare_new_inode() after that fix:

 3411   CInode *diri = dir->get_inode();
 3412   auto pip = diri->get_projected_inode();
 3413  
 3414   dout(10) << oct << " dir mode 0" << pip->mode << " new mode 0" << mode << dec << dendl;
 3415  
 3416   if (pip->mode & S_ISGID) {                                                                                                                                           
 3417     dout(10) << " dir is sticky" << dendl;
 3418     _inode->gid = pip->gid;
 3419     if (S_ISDIR(mode)) {
 3420       dout(10) << " new dir also sticky" << dendl; 
 3421       _inode->mode |= S_ISGID;
 3422     }
 3423   } else {
 3424     _inode->gid = mdr->client_request->get_caller_gid();
 3425   }

Why sleeping 5 seconds could work ? That's because the mds will flush the journal logs periodically per 5 seconds.

Thanks
- Xiubo

--- Additional comment from Xiubo Li on 2023-04-19 09:44:22 UTC ---

I have downstreamed it to 5.3 branch: https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/288

And the 6.0 branch has already included this fix.

--- Additional comment from Venky Shankar on 2023-04-19 09:55:38 UTC ---

(In reply to Xiubo Li from comment #34)
> (In reply to Madhu Rajanna from comment #33)
> > Created attachment 1958238 [details]
> > ceph logs with/without timeout
> 
> Thanks again Madhu.
> 
> I think I found the root cause, this is one known bug and have been fixed by
> https://tracker.ceph.com/issues/56010.
> 
> The root cause is that:
> 
> The following command will send a 'setattr' request to MDS immediately and
> then the MDS projected it in cache.

Correct. That's the bug we are running into.

(In reply to Xiubo Li from comment #35)
> I have downstreamed it to 5.3 branch:
> https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/288
> 
> And the 6.0 branch has already included this fix.

Did we miss this downstream backport (to RHCS5)?

--- Additional comment from Eran Tamir on 2023-04-19 10:03:07 UTC ---

@xiubli Any workaround we can suggest for the customer?

--- Additional comment from Xiubo Li on 2023-04-19 10:04:30 UTC ---

(In reply to Venky Shankar from comment #36)
> (In reply to Xiubo Li from comment #34)
> > (In reply to Madhu Rajanna from comment #33)
> > > Created attachment 1958238 [details]
> > > ceph logs with/without timeout
> > 
> > Thanks again Madhu.
> > 
> > I think I found the root cause, this is one known bug and have been fixed by
> > https://tracker.ceph.com/issues/56010.
> > 
> > The root cause is that:
> > 
> > The following command will send a 'setattr' request to MDS immediately and
> > then the MDS projected it in cache.
> 
> Correct. That's the bug we are running into.
> 
> (In reply to Xiubo Li from comment #35)
> > I have downstreamed it to 5.3 branch:
> > https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/288
> > 
> > And the 6.0 branch has already included this fix.
> 
> Did we miss this downstream backport (to RHCS5)?

It seems we missed this to any downstream version. And just backported it to upstream quincy and pacific versions. For the RHCS6 it just included it after rebased to upstream quincy release.

Thanks
- Xiubo

--- Additional comment from Xiubo Li on 2023-04-19 10:08:30 UTC ---

(In reply to Eran Tamir from comment #37)
> @xiubli Any workaround we can suggest for the customer?

As a workaround, you can just wait for 5 seconds after changing the setgid mode. Or just fire a request to flush mds journal log after changing the setgid.

Thanks
- Xiubo

--- Additional comment from Madhu Rajanna on 2023-04-19 10:13:15 UTC ---

(In reply to Xiubo Li from comment #39)
> (In reply to Eran Tamir from comment #37)
> > @xiubli Any workaround we can suggest for the customer?
> 
> As a workaround, you can just wait for 5 seconds after changing the setgid
> mode. Or just fire a request to flush mds journal log after changing the
> setgid.
> 
> Thanks
> - Xiubo

I think this workaround is impossible because kubelet does all this work nothing is done in cephcsi. is it possible to have configuration changes at the cluster level to overcome this problem?

@Hemant can you please confirm?

--- Additional comment from daniel on 2023-04-19 10:15:08 UTC ---

Can someone help me to understand where the problem is, looking at https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/288 I am unsure if this is in mds or kernel cephfs client

--- Additional comment from Chen on 2023-04-19 10:17:49 UTC ---

Hi team,

Currently the customer stops using subPath in k8s and instead they are using two separate volume mounts to workaround the issue.

In customer's own quote: 

"Currently, we have a workaround to change our helm chart, not to mount two folders by `subpath`, but mount one folder and create separate folder in our SW code."

Best Regards,
Chen

--- Additional comment from Venky Shankar on 2023-04-19 10:22:41 UTC ---

(In reply to daniel from comment #41)
> Can someone help me to understand where the problem is, looking at
> https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/288 I am unsure if
> this is in mds or kernel cephfs client

Its an MDS bug.

--- Additional comment from Hemant Kumar on 2023-04-19 14:01:00 UTC ---

Yes we can't arbitrarily put 5s sleep just for cephfs in kubelet between time we set setgid and first call to `mkdir`. The kubelet code is designed to handle all volume types.

Also if theoretically this were to be worked around in kubelet, it will take awhile for fix to be merged and backported in downstream OCP releases (I really doubt we could do it faster than backporting fix from ceph upstream).

--- Additional comment from Chen on 2023-04-21 03:02:49 UTC ---

Hi team,

I'm trying to make an update to the customer but before that I'd like to confirm where we are now. By looking at https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/288 seems we don't need approvers to approve the MR even though the pipeline fails [1] right? The main reason that the MR can't be merged is because it needs rebase so once the MR rebases then the downstream will get the fix right?

Another thing I'd like to confirm is, the ODF 4.12 is based on Ceph 5.3.1. So once this fix is backported to Ceph 5.3.1, how can I know which ODF contains the fix?

[1] https://jenkins.ceph.redhat.com/job/ceph-test-build/607/console

Thank you so much!

Best Regards,
Chen

--- Additional comment from Mudit Agarwal on 2023-04-21 08:01:42 UTC ---

Hi Chen,
Yes, ODF 4.12 is based on RHCS5.3z1

Once the fix is merged in RHCS5.3, ODF will need another z-stream of RHCS 5.3 (e.g. 5.3z2) to consume and we will do another minor version of ODF 4.12 with that z-stream.
It will not automatically come in 4.12 just because 4.12 is based out of 5.3, Ceph has to release another z-stream of 5.3 with this fix

--- Additional comment from Chen on 2023-04-21 08:18:09 UTC ---

Hi Mudit,

Thank you for your reply!

So ODF 4.12 will bind to Ceph 5.3. And here should be the sequence to happen and if my understanding is wrong please let me know.

1. Make sure this fix will be included in Ceph 5.3z2
2. Once Ceph 5.3z2 is released, ODF will come out a new z-stream version (Say ODF 4.12.3 or 4.12.4, current GA version is 4.12.2) to use Ceph 5.3z2 as code base

Best Regards,
Chen

--- Additional comment from Xiubo Li on 2023-04-21 09:10:47 UTC ---

(In reply to Chen from comment #45)
> Hi team,
> 
> I'm trying to make an update to the customer but before that I'd like to
> confirm where we are now. By looking at
> https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/288 seems we don't
> need approvers to approve the MR even though the pipeline fails [1] right?
> The main reason that the MR can't be merged is because it needs rebase so
> once the MR rebases then the downstream will get the fix right?
> 

I have rebased the MR.

Thanks

--- Additional comment from Venky Shankar on 2023-04-24 08:59:06 UTC ---

(In reply to Chen from comment #47)
> Hi Mudit,
> 
> Thank you for your reply!
> 
> So ODF 4.12 will bind to Ceph 5.3. And here should be the sequence to happen
> and if my understanding is wrong please let me know.
> 
> 1. Make sure this fix will be included in Ceph 5.3z2

5.3z2 is already GAd. This should be 5.3z3.

--- Additional comment from Venky Shankar on 2023-04-24 13:44:37 UTC ---

(In reply to Chen from comment #45)
> Hi team,
> 
> I'm trying to make an update to the customer but before that I'd like to
> confirm where we are now. By looking at
> https://gitlab.cee.redhat.com/ceph/ceph/-/merge_requests/288 seems we don't
> need approvers to approve the MR even though the pipeline fails [1] right?
> The main reason that the MR can't be merged is because it needs rebase so
> once the MR rebases then the downstream will get the fix right?

The pipeline failures happen for all MRs (the downstream gitlab jenkins tests are broken from long - I _do not_ look at them as a prerequisite to merge an MR).

That said, the MR is fine and the fix will be available in 5.3z3 release.

--- Additional comment from Colum Gaynor on 2023-04-24 16:42:03 UTC ---

@vshankar or 
@etamir       ---> Is there any estimate when the fix coming in 5.3z3 release would be available for Nokia CloudRAN using OCP 4.12.z/ODF4.12.z and which minor release ?

Colum Gaynor - Senior PAL, Nokia Global Account

--- Additional comment from Dariush Eslimi on 2023-04-25 16:38:22 UTC ---

I am trying to map when we are going to have this fix as well, so far based on : https://pp.engineering.redhat.com/pp/product/ceph/release/ceph-5-3/schedule/tasks
5.3z3 will be GA in May 23rd, assuming it is on schedule.
How do we map this what release in RHEL and then OCP 4.12.z this will land?
cc: @blitton

--- Additional comment from Mudit Agarwal on 2023-04-26 12:46:36 UTC ---

Based on the ceph 5.3z3 schedule, most probably 4.12.4 will be having this which would be released in early June

Venky,
can you please create a 5.3z3 ceph bug to track this fix in ceph.
Also, should we assume that the fix will be a part of RHCS 6.1 as well so that ODF 4.13 doesn't miss it?

--- Additional comment from Venky Shankar on 2023-04-26 13:41:15 UTC ---

(In reply to Mudit Agarwal from comment #53)
> Based on the ceph 5.3z3 schedule, most probably 4.12.4 will be having this
> which would be released in early June
> 
> Venky,
> can you please create a 5.3z3 ceph bug to track this fix in ceph.

ACK.

> Also, should we assume that the fix will be a part of RHCS 6.1 as well so
> that ODF 4.13 doesn't miss it?

The fix is present in RHCS6.

Comment 14 errata-xmlrpc 2023-05-23 00:19:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 5.3 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:3259

Note You need to log in before you can comment on or make changes to this bug.

bkunal
bniver
cchen
ceph-eng-bugs
cephqe-warriors
cgaynor
deslimi
dmoessne
ebenahar
etamir
hekumar
hyelloji
jclaretm
khiremat
mrajanna
muagarwa
ocs-bugs
rar
rmandyam
scollier
sostapov
tserlin
vereddy
vshankar
xiubli