Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2233673

Summary: Observed continuous messages related to "rados_cluster_grace_enforcing :CLIENT ID :EVENT :rados_cluster_grace_enforcing: ret=-45" on 2 out of 4 nodes running ganesha service and server keep on going in 90 sec grace period
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manisha Saini <msaini>
Component: CephadmAssignee: Shweta Bhosale <shbhosal>
Status: MODIFIED --- QA Contact: Hemanth Kumar <hyelloji>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.0CC: adking, brgardne, cephqe-warriors, ffilz, kkeithle, ngangadh, saraut, vdas
Target Milestone: ---   
Target Release: 9.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Manisha Saini 2023-08-22 21:26:21 UTC
Description of problem:
=======================

On the scale lab cluster, 

 --> On Node 3 and Node 4, observed multiple logs related to "rados_cluster_grace_enforcing: ret=-45" followed by Sever going in 90 sec of GRACE period

[root@e24-h25-740xd ~]# cat f1 | grep "Server Now IN GRACE, duration 90" | wc -l
67

[root@e24-h25-740xd ~]# cat f1 | grep ":rados_cluster_grace_enforcing" | wc -l
66

[root@e24-h27-740xd ~]# cat f1 | grep ":rados_cluster_grace_enforcing" | wc -l
73

[root@e24-h27-740xd ~]#  cat f1 | grep "Server Now IN GRACE, duration 90" | wc -l
73

--------------------------------------

4-h27-740xd-mfelxg[21201]: 19/08/2023 19:59:20 : epoch 64e11e0d : e24-h27-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[main] rados_cluster_grace_enforcing :CLIENT ID :EVENT :rados_cluster_grace_enforcing: ret=-45
4-h27-740xd-mfelxg[21201]: 19/08/2023 19:59:24 : epoch 64e11e0d : e24-h27-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
4-h27-740xd-mfelxg[21201]: 19/08/2023 19:59:24 : epoch 64e11e0d : e24-h27-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend
4-h27-740xd-mfelxg[21201]: 19/08/2023 19:59:24 : epoch 64e11e0d : e24-h27-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
4-h27-740xd-mfelxg[21201]: 19/08/2023 19:59:24 : epoch 64e11e0d : e24-h27-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[main] rados_cluster_grace_enforcing :CLIENT ID :EVENT :rados_cluster_grace_enforcing: ret=-45
4-h27-740xd-mfelxg[21201]: 19/08/2023 19:59:28 : epoch 64e11e0d : e24-h27-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[main] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
4-h27-740xd-mfelxg[21201]: 19/08/2023 19:59:28 : epoch 64e11e0d : e24-h27-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[main] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend
4-h27-740xd-mfelxg[21201]: 19/08/2023 19:59:28 : epoch 64e11e0d : e24-h27-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[main] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
4-h27-740xd-mfelxg[21201]: 19/08/2023 19:59:28 : epoch 64e11e0d : e24-h27-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[main] rados_cluster_grace_enforcing :CLIENT ID :EVENT :rados_cluster_grace_enforcing: ret=-45

-----------------------------------


 --> Also, On Node 1 , observed multiple logs related to "nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(2)"

[root@e24-h21-740xd var]# cat f1 | grep "check grace:reclaim complete(0)" | wc -l
1770

---------------------------------------

4-h21-740xd-jevbro[44672]: 19/08/2023 09:26:20 : epoch 64dff537 : e24-h21-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(2)
4-h21-740xd-jevbro[44672]: 19/08/2023 09:26:30 : epoch 64dff537 : e24-h21-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(2)
4-h21-740xd-jevbro[44672]: 19/08/2023 09:26:40 : epoch 64dff537 : e24-h21-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(2)
4-h21-740xd-jevbro[44672]: 19/08/2023 09:26:50 : epoch 64dff537 : e24-h21-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(2)
4-h21-740xd-jevbro[44672]: 19/08/2023 09:27:00 : epoch 64dff537 : e24-h21-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(2)
4-h21-740xd-jevbro[44672]: 19/08/2023 09:27:10 : epoch 64dff537 : e24-h21-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(2)
4-h21-740xd-jevbro[44672]: 19/08/2023 09:27:20 : epoch 64dff537 : e24-h21-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(2)
4-h21-740xd-jevbro[44672]: 19/08/2023 09:27:30 : epoch 64dff537 : e24-h21-740xd.alias.bos.scalelab.redhat.com : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(2)

--------------------------------------


Version-Release number of selected component (if applicable):
=============================================================

# rpm -qa | grep ganesha
nfs-ganesha-selinux-5.4-1.el9cp.noarch
nfs-ganesha-5.4-1.el9cp.x86_64
nfs-ganesha-rgw-5.4-1.el9cp.x86_64
nfs-ganesha-ceph-5.4-1.el9cp.x86_64
nfs-ganesha-rados-grace-5.4-1.el9cp.x86_64
nfs-ganesha-rados-urls-5.4-1.el9cp.x86_64


How reproducible:
==================

1/1


Steps to Reproduce:
==================
1.Configure ganesha on 4 node with HA
2.Create 1 cephFS volume and mount the volume on 3 clients with single VIP
3.On client 1, run the following workload

Client 1,2 and 3:
------
Create 1G file in loop using dd command along with fio 



Actual results:
=============
Ganesha server is going in grace period continuously 

rados_cluster_grace_enforcing :CLIENT ID :EVENT :rados_cluster_grace_enforcing: ret=-45
nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90 

On another server following logs were observed - 

nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(2)

Expected results:
==============
Continuous messages will flood ganesha.log
Ganesha server should not go in GRACE period multiple times. 

Additional info:

Comment 2 Manisha Saini 2023-09-05 08:44:30 UTC
Any Update on RCA?

Comment 5 Frank Filz 2023-09-06 20:04:01 UTC
Could you share the full config, not just the EXPORT config, but NFS_CORE_PARAM and NFSV4 etc.?

Comment 11 Frank Filz 2023-09-21 16:40:02 UTC
I really don't have anything more. Someone who has knowledge about the administration of the cluster needs to look into what might be going on.

The Ganesha logs show clear evidence that the Ganesha instances are NOT all starting at the same time, and from the logs, it appears that the messages clear up when ALL the Ganesha instances have started. This strongly suggests the cluster admin infrastructure is not doing the right thing.

Comment 12 Frank Filz 2023-09-21 17:41:18 UTC
Redirecting to cephadm as this really seems to be a cluster admin issue, not a Ganesha code issue.