Bug 2216442

Summary:	[NFS-Ganesha] File locking test from multiple nfs client on same file is failing with ganesha service crash
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Manisha Saini <msaini>
Component:	NFS-Ganesha	Assignee:	Frank Filz <ffilz>
Status:	CLOSED ERRATA	QA Contact:	Manisha Saini <msaini>
Severity:	urgent	Docs Contact:	Rivka Pollack <rpollack>
Priority:	unspecified
Version:	6.1	CC:	akraj, cephqe-warriors, ffilz, kkeithle, rpollack, tserlin, vdas
Target Milestone:	---
Target Release:	7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-18.2.0-10.el9cp, nfs-ganesha-5.5-1.el9cp	Doc Type:	Enhancement
Doc Text:	.Prevent Mutex from failing when unlocking. Previously, when a Mutex that was not locked was attempted to be unlocked, the Mutex crashed. As a workaround, verify if the Mutex is locked in the first place before unlocking.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-12-13 15:20:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2237662

Description Manisha Saini 2023-06-21 11:50:34 UTC

Description of problem:
===========

Taking lock on same file from 2 different nfs clients is failing on RHCS 6.1. Core is getting generated and ganesha process crashed and restarted



Version-Release number of selected component (if applicable):
========

[ceph: root@ceph-mani-vt0a3y-node1-installer /]# rpm -qa | grep ganesha
nfs-ganesha-selinux-5.1-1.el9cp.noarch
nfs-ganesha-5.1-1.el9cp.x86_64
nfs-ganesha-ceph-5.1-1.el9cp.x86_64
nfs-ganesha-rados-grace-5.1-1.el9cp.x86_64
nfs-ganesha-rados-urls-5.1-1.el9cp.x86_64
nfs-ganesha-rgw-5.1-1.el9cp.x86_64

How reproducible:
============
3/3


Steps to Reproduce:
===============
1. Deploy RHCS 6.1 cluster

2. Deploy NFS-Ganesha cluster on 2 of the nodes

3. Create a subvolume and export it via ganesha

4. Mount the volume on 2 clients
#sudo mount -t nfs 10.0.210.192:/ceph /mnt/ganesha/

5. Create a file

6. Take lock on file 1 from Client 1 --> Success
   Take lock on file 1 from Client 2 --> Failed which is expected
   Release lock on File 1 from client 1 --> Success
   Try to take lock on file 1 from client 2 --> It got failed. But ideally it should allow the client the acquire lock from client 2

Client 1
---
# ./a.out /mnt/ganesha/5G 
opening /mnt/ganesha/5G
opened; hit Enter to lock... 
locking
locked; hit Enter to write... 
Write succeeeded 
locked; hit Enter to unlock... 
unlocking


Client 2
----
$ ./a.out /mnt/ganesha/4G 
opening /mnt/ganesha/4G
opened; hit Enter to lock... 
locking
fcntl failed (Bad file descriptor)
locked; hit Enter to write... 
Write failed 
locked; hit Enter to unlock... 
unlocking
fcntl UNLOCK failed (Bad file descriptor)

-----
Even after the lock is released from client 1 , client 2 is unable to take lock
-----
$ ./a.out /mnt/ganesha/4G 
opening /mnt/ganesha/4G
opened; hit Enter to lock... 
locking
fcntl failed (Bad file descriptor)
locked; hit Enter to write... 
Write failed 
locked; hit Enter to unlock... 
unlocking
fcntl UNLOCK failed (Bad file descriptor)





Actual results:
======
[ceph: root@ceph-mani-vt0a3y-node1-installer /]# ceph nfs cluster info nfsganesha
{
    "nfsganesha": {
        "virtual_ip": null,
        "backend": [
            {
                "hostname": "ceph-mani-vt0a3y-node4",
                "ip": "10.0.210.192",
                "port": 2049
            },
            {
                "hostname": "ceph-mani-vt0a3y-node6",
                "ip": "10.0.209.80",
                "port": 2049
            }
        ]
    }
}

---------
]# ceph orch ps --daemon_type=nfs
NAME                                              HOST                    PORTS   STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID  
nfs.nfsganesha.0.0.ceph-mani-vt0a3y-node4.xwjsss  ceph-mani-vt0a3y-node4  *:2049  running (97s)    94s ago  97s    17.0M        -  5.1      b765ab0f02a7  e47becaf8f35  
nfs.nfsganesha.1.0.ceph-mani-vt0a3y-node6.fhhzii  ceph-mani-vt0a3y-node6  *:2049  running (96s)    94s ago  95s    19.0M        -  5.1      b765ab0f02a7  4704f64c9c5e 
-------

[ceph: root@ceph-mani-vt0a3y-node1-installer /]# ceph fs subvolume ls cephfs --group_name nfs
[
    {
        "name": "ganesha1"
    },
    {
        "name": "ganesha2"
    }
]

-----
# ceph fs subvolume getpath cephfs ganesha2 --group_name nfs
/volumes/nfs/ganesha2/fd964897-7046-4062-af4b-f3d0eb191de0


# ceph nfs export create cephfs nfsganesha /ceph cephfs --path=/volumes/nfs/ganesha2/fd964897-7046-4062-af4b-f3d0eb191de0
{
    "bind": "/ceph",
    "fs": "cephfs",
    "path": "/volumes/nfs/ganesha2/fd964897-7046-4062-af4b-f3d0eb191de0",
    "cluster": "nfsganesha",
    "mode": "RW"
}



# ceph nfs export get nfsganesha /ceph
{
  "export_id": 1,
  "path": "/volumes/nfs/ganesha2/fd964897-7046-4062-af4b-f3d0eb191de0",
  "cluster_id": "nfsganesha",
  "pseudo": "/ceph",
  "access_type": "RW",
  "squash": "none",
  "security_label": true,
  "protocols": [
    4
  ],
  "transports": [
    "TCP"
  ],
  "fsal": {
    "name": "CEPH",
    "user_id": "nfs.nfsganesha.1",
    "fs_name": "cephfs"
  },
  "clients": []
}



Expected results:
==============
Ganesha service should not crash and it should allow to take lock from client 2 if the lock is already release from client 1


Additional info:
==============
ganesha.log
---------

P :EVENT :-------------------------------------------------
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[reaper] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[reaper] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
Jun 20 07:02:56 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 11:02:56 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[sigmgr] sigmgr_thread :MAIN :EVENT :SIGHUP_HANDLER: Received SIGHUP.... initiating export list reload
Jun 20 07:02:56 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 11:02:56 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[sigmgr] reclaim_reset :FSAL :EVENT :start_reclaim failed: -2
Jun 20 07:02:57 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 11:02:57 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[sigmgr] reread_exports :CONFIG :EVENT :Reread exports complete
Jun 20 14:41:06 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:06 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_27] nfs4_op_create_session :CLIENT ID :WARN :Invalid channel attributes for st_create_session.py:testTooSmallMaxReq
Jun 20 14:41:06 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:06 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_27] nfs4_op_create_session :CLIENT ID :WARN :Invalid channel attributes for st_create_session.py:testTooSmallMaxRS
Jun 20 14:41:07 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:07 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_24] rpc :TIRPC :EVENT :xdr_array_decode:783 ERROR size 2 > max 1
Jun 20 14:41:23 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:23 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_24] nfs4_Compound :NFS4 :MAJ :An empty COMPOUND (no operation in it) was received
Jun 20 14:41:23 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:23 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_34] rpc :TIRPC :EVENT :xdr_opaque_decode:441 ERROR opaque
Jun 20 14:41:23 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:23 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_24] nfs4_Compound :NFS4 :CRIT :Bad Minor Version 50
Jun 20 14:41:23 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:23 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_24] nfs4_Compound :NFS4 :CRIT :Bad Minor Version 50
Jun 20 14:41:24 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:24 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_24] rpc :TIRPC :EVENT :xdr_array_decode:783 ERROR size 2 > max 1
Jun 20 15:21:36 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 19:21:36 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_170] destroy_fsal_fd :RW LOCK :CRIT :Error 16, Destroy mutex 0x7efec8084fd0 (&fsal_fd->work_mutex) at /builddir/build/BUILD/nfs-ganesha-5.1/src/include/fsal_types.h:1029
Jun 20 15:21:36 ceph-mani-vt0a3y-node4 systemd-coredump[140045]: Process 24819 (ganesha.nfsd) of user 0 dumped core.
Jun 20 15:21:36 ceph-mani-vt0a3y-node4 podman[140051]: 2023-06-20 15:21:36.768359053 -0400 EDT m=+0.043847334 container died e47becaf8f35726d808d19c6d0eee431e81c034301c4f3ea8e454e50a9dd681d (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:406b0370e6931584eefa37039eedd2255168b789f31e2d21f13bec9936b9d3ae, name=ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss, io.openshift.tags=rhceph ceph, io.buildah.version=1.27.3, io.openshift.expose-services=, CEPH_POINT_RELEASE=, com.redhat.license_terms=https://www.redhat.com/agreements, distribution-scope=public, name=rhceph, com.redhat.component=rhceph-container, maintainer=Guillaume Abrioux <gabrioux>, GIT_BRANCH=main, release=176, vcs-ref=dec93361f6f7a22d929d690d9002f0df9a8f6805, GIT_COMMIT=0727c855af939c6f3709e73be703026388413744, RELEASE=main, io.k8s.description=Red Hat Ceph Storage 6, vcs-type=git, version=6, build-date=2023-05-30T00:53:52, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/6-176, GIT_CLEAN=True, description=Red Hat Ceph Storage 6, ceph=True, summary=Provides the latest Red Hat Ceph Storage 6 on RHEL 9 in a fully featured and supported base image., architecture=x86_64, io.k8s.display-name=Red Hat Ceph Storage 6 on RHEL 9, GIT_REPO=https://github.com/ceph/ceph-container.git, vendor=Red Hat, Inc.)
Jun 20 15:21:36 ceph-mani-vt0a3y-node4 podman[140051]: 2023-06-20 15:21:36.800127174 -0400 EDT m=+0.075615456 container remove e47becaf8f35726d808d19c6d0eee431e81c034301c4f3ea8e454e50a9dd681d (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:406b0370e6931584eefa37039eedd2255168b789f31e2d21f13bec9936b9d3ae, name=ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss, io.openshift.tags=rhceph ceph, com.redhat.license_terms=https://www.redhat.com/agreements, maintainer=Guillaume Abrioux <gabrioux>, version=6, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/6-176, architecture=x86_64, release=176, vendor=Red Hat, Inc., GIT_REPO=https://github.com/ceph/ceph-container.git, io.k8s.display-name=Red Hat Ceph Storage 6 on RHEL 9, distribution-scope=public, GIT_COMMIT=0727c855af939c6f3709e73be703026388413744, vcs-ref=dec93361f6f7a22d929d690d9002f0df9a8f6805, ceph=True, CEPH_POINT_RELEASE=, GIT_BRANCH=main, io.openshift.expose-services=, GIT_CLEAN=True, description=Red Hat Ceph Storage 6, io.k8s.description=Red Hat Ceph Storage 6, io.buildah.version=1.27.3, com.redhat.component=rhceph-container, name=rhceph, summary=Provides the latest Red Hat Ceph Storage 6 on RHEL 9 in a fully featured and supported base image., RELEASE=main, vcs-type=git, build-date=2023-05-30T00:53:52)
Jun 20 15:21:36 ceph-mani-vt0a3y-node4 systemd[1]: ceph-20c36492-0f49-11ee-96cd-fa163e5538cc.0.0.ceph-mani-vt0a3y-node4.xwjsss.service: Main process exited, code=exited, status=134/n/a
Jun 20 15:21:37 ceph-mani-vt0a3y-node4 systemd[1]: ceph-20c36492-0f49-11ee-96cd-fa163e5538cc.0.0.ceph-mani-vt0a3y-node4.xwjsss.service: Failed with result 'exit-code'.
Jun 20 15:21:37 ceph-mani-vt0a3y-node4 systemd[1]: ceph-20c36492-0f49-11ee-96cd-fa163e5538cc.0.0.ceph-mani-vt0a3y-node4.xwjsss.service: Consumed 27.818s CPU time.
Jun 20 15:21:47 ceph-mani-vt0a3y-node4 systemd[1]: ceph-20c36492-0f49-11ee-96cd-fa163e5538cc.0.0.ceph-mani-vt0a3y-node4.xwjsss.service: Scheduled restart job, restart counter is at 1.
Jun 20 15:21:47 ceph-mani-vt0a3y-node4 systemd[1]: Stopped Ceph nfs.nfsganesha.0.0.ceph-mani-vt0a3y-node4.xwjsss for 20c36492-0f49-11ee-96cd-fa163e5538cc.
Jun 20 15:21:47 ceph-mani-vt0a3y-node4 systemd[1]: ceph-20c36492-0f49-11ee-96cd-fa163e5538cc.0.0.ceph-mani-vt0a3y-node4.xwjsss.service: Consumed 27.818s CPU time.
Jun 20 15:21:47 ceph-mani-vt0a3y-node4 systemd[1]: Starting Ceph nfs.nfsganesha.0.0.ceph-mani-vt0a3y-node4.xwjsss for 20c36492-0f49-11ee-96cd-fa163e5538cc...
Jun 20 15:21:47 ceph-mani-vt0a3y-node4 podman[140138]:

Comment 1 Frank Filz 2023-06-21 14:06:29 UTC

Can we get a tcpdump trace of the two clients?

Also, do you have a stack backtrace from the coredump?

Comment 6 Manisha Saini 2023-07-04 18:48:15 UTC

Tested this on

# rpm -qa | grep ganesha
nfs-ganesha-selinux-5.3-1.el9cp.noarch
nfs-ganesha-5.3-1.el9cp.x86_64
nfs-ganesha-ceph-5.3-1.el9cp.x86_64
nfs-ganesha-rados-grace-5.3-1.el9cp.x86_64
nfs-ganesha-rados-urls-5.3-1.el9cp.x86_64
nfs-ganesha-rgw-5.3-1.el9cp.x86_64

Not hitting the crash anymore.Locking works as expected

Client 1
======
\# ./a.out /mnt/ganesha/file1 
opening /mnt/ganesha/file1
opened; hit Enter to lock... 
locking
locked; hit Enter to write... 
Write succeeeded 
locked; hit Enter to unlock... 
unlocking


# ./a.out /mnt/ganesha/file1 
opening /mnt/ganesha/file1
opened; hit Enter to lock... 
locking
locked; hit Enter to write... 
Write succeeeded 
locked; hit Enter to unlock... 
unlocking


Client 2:
=======

#  ./a.out /mnt/ganesha/file1
opening /mnt/ganesha/file1
opened; hit Enter to lock... 
locking
locked; hit Enter to write... 
Write succeeeded 
locked; hit Enter to unlock... 
unlocking


#  ./a.out /mnt/ganesha/file1
opening /mnt/ganesha/file1
opened; hit Enter to lock... 
locking
locked; hit Enter to write... 
Write succeeeded 
locked; hit Enter to unlock... 
unlocking

Comment 7 Frank Filz 2023-07-18 18:47:37 UTC

Can we close this one then?

Comment 8 Manisha Saini 2023-07-19 22:58:12 UTC

Hi Frank,

Since the issue is fixed in NFS v5.3,we can move this BZ to on_qa for verification once we have RHCS 7.0 official builds available with nfs v5.3.
QE will validate this issue with RHCS 7.0 builds

Comment 9 Frank Filz 2023-07-19 23:46:00 UTC

Sounds good.

Comment 10 Manisha Saini 2023-09-05 08:41:48 UTC

Hi Frank, 

Since we have fix for this in 7.0 build,Can you move this BZ to ON_QA?

Comment 17 Frank Filz 2023-09-26 23:54:47 UTC

Oops. didn't clear the needinfo

Comment 18 errata-xmlrpc 2023-12-13 15:20:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780