Bug 2216442 - [NFS-Ganesha] File locking test from multiple nfs client on same file is failing with ganesha service crash
Summary: [NFS-Ganesha] File locking test from multiple nfs client on same file is fail...
Keywords:
Status: ASSIGNED
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: NFS-Ganesha
Version: 6.1
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 7.0
Assignee: Frank Filz
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-21 11:50 UTC by Manisha Saini
Modified: 2023-07-19 23:46 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6899 0 None None None 2023-06-21 11:52:49 UTC

Description Manisha Saini 2023-06-21 11:50:34 UTC
Description of problem:
===========

Taking lock on same file from 2 different nfs clients is failing on RHCS 6.1. Core is getting generated and ganesha process crashed and restarted



Version-Release number of selected component (if applicable):
========

[ceph: root@ceph-mani-vt0a3y-node1-installer /]# rpm -qa | grep ganesha
nfs-ganesha-selinux-5.1-1.el9cp.noarch
nfs-ganesha-5.1-1.el9cp.x86_64
nfs-ganesha-ceph-5.1-1.el9cp.x86_64
nfs-ganesha-rados-grace-5.1-1.el9cp.x86_64
nfs-ganesha-rados-urls-5.1-1.el9cp.x86_64
nfs-ganesha-rgw-5.1-1.el9cp.x86_64

How reproducible:
============
3/3


Steps to Reproduce:
===============
1. Deploy RHCS 6.1 cluster

2. Deploy NFS-Ganesha cluster on 2 of the nodes

3. Create a subvolume and export it via ganesha

4. Mount the volume on 2 clients
#sudo mount -t nfs 10.0.210.192:/ceph /mnt/ganesha/

5. Create a file

6. Take lock on file 1 from Client 1 --> Success
   Take lock on file 1 from Client 2 --> Failed which is expected
   Release lock on File 1 from client 1 --> Success
   Try to take lock on file 1 from client 2 --> It got failed. But ideally it should allow the client the acquire lock from client 2

Client 1
---
# ./a.out /mnt/ganesha/5G 
opening /mnt/ganesha/5G
opened; hit Enter to lock... 
locking
locked; hit Enter to write... 
Write succeeeded 
locked; hit Enter to unlock... 
unlocking


Client 2
----
$ ./a.out /mnt/ganesha/4G 
opening /mnt/ganesha/4G
opened; hit Enter to lock... 
locking
fcntl failed (Bad file descriptor)
locked; hit Enter to write... 
Write failed 
locked; hit Enter to unlock... 
unlocking
fcntl UNLOCK failed (Bad file descriptor)

-----
Even after the lock is released from client 1 , client 2 is unable to take lock
-----
$ ./a.out /mnt/ganesha/4G 
opening /mnt/ganesha/4G
opened; hit Enter to lock... 
locking
fcntl failed (Bad file descriptor)
locked; hit Enter to write... 
Write failed 
locked; hit Enter to unlock... 
unlocking
fcntl UNLOCK failed (Bad file descriptor)





Actual results:
======
[ceph: root@ceph-mani-vt0a3y-node1-installer /]# ceph nfs cluster info nfsganesha
{
    "nfsganesha": {
        "virtual_ip": null,
        "backend": [
            {
                "hostname": "ceph-mani-vt0a3y-node4",
                "ip": "10.0.210.192",
                "port": 2049
            },
            {
                "hostname": "ceph-mani-vt0a3y-node6",
                "ip": "10.0.209.80",
                "port": 2049
            }
        ]
    }
}

---------
]# ceph orch ps --daemon_type=nfs
NAME                                              HOST                    PORTS   STATUS         REFRESHED  AGE  MEM USE  MEM LIM  VERSION  IMAGE ID      CONTAINER ID  
nfs.nfsganesha.0.0.ceph-mani-vt0a3y-node4.xwjsss  ceph-mani-vt0a3y-node4  *:2049  running (97s)    94s ago  97s    17.0M        -  5.1      b765ab0f02a7  e47becaf8f35  
nfs.nfsganesha.1.0.ceph-mani-vt0a3y-node6.fhhzii  ceph-mani-vt0a3y-node6  *:2049  running (96s)    94s ago  95s    19.0M        -  5.1      b765ab0f02a7  4704f64c9c5e 
-------

[ceph: root@ceph-mani-vt0a3y-node1-installer /]# ceph fs subvolume ls cephfs --group_name nfs
[
    {
        "name": "ganesha1"
    },
    {
        "name": "ganesha2"
    }
]

-----
# ceph fs subvolume getpath cephfs ganesha2 --group_name nfs
/volumes/nfs/ganesha2/fd964897-7046-4062-af4b-f3d0eb191de0


# ceph nfs export create cephfs nfsganesha /ceph cephfs --path=/volumes/nfs/ganesha2/fd964897-7046-4062-af4b-f3d0eb191de0
{
    "bind": "/ceph",
    "fs": "cephfs",
    "path": "/volumes/nfs/ganesha2/fd964897-7046-4062-af4b-f3d0eb191de0",
    "cluster": "nfsganesha",
    "mode": "RW"
}



# ceph nfs export get nfsganesha /ceph
{
  "export_id": 1,
  "path": "/volumes/nfs/ganesha2/fd964897-7046-4062-af4b-f3d0eb191de0",
  "cluster_id": "nfsganesha",
  "pseudo": "/ceph",
  "access_type": "RW",
  "squash": "none",
  "security_label": true,
  "protocols": [
    4
  ],
  "transports": [
    "TCP"
  ],
  "fsal": {
    "name": "CEPH",
    "user_id": "nfs.nfsganesha.1",
    "fs_name": "cephfs"
  },
  "clients": []
}



Expected results:
==============
Ganesha service should not crash and it should allow to take lock from client 2 if the lock is already release from client 1


Additional info:
==============
ganesha.log
---------

P :EVENT :-------------------------------------------------
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[reaper] nfs_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 90
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[reaper] nfs_start_grace :STATE :EVENT :grace reload client info completed from backend
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(0) clid count(0)
Jun 20 06:30:41 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 10:30:41 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
Jun 20 07:02:56 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 11:02:56 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[sigmgr] sigmgr_thread :MAIN :EVENT :SIGHUP_HANDLER: Received SIGHUP.... initiating export list reload
Jun 20 07:02:56 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 11:02:56 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[sigmgr] reclaim_reset :FSAL :EVENT :start_reclaim failed: -2
Jun 20 07:02:57 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 11:02:57 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[sigmgr] reread_exports :CONFIG :EVENT :Reread exports complete
Jun 20 14:41:06 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:06 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_27] nfs4_op_create_session :CLIENT ID :WARN :Invalid channel attributes for st_create_session.py:testTooSmallMaxReq
Jun 20 14:41:06 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:06 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_27] nfs4_op_create_session :CLIENT ID :WARN :Invalid channel attributes for st_create_session.py:testTooSmallMaxRS
Jun 20 14:41:07 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:07 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_24] rpc :TIRPC :EVENT :xdr_array_decode:783 ERROR size 2 > max 1
Jun 20 14:41:23 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:23 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_24] nfs4_Compound :NFS4 :MAJ :An empty COMPOUND (no operation in it) was received
Jun 20 14:41:23 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:23 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_34] rpc :TIRPC :EVENT :xdr_opaque_decode:441 ERROR opaque
Jun 20 14:41:23 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:23 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_24] nfs4_Compound :NFS4 :CRIT :Bad Minor Version 50
Jun 20 14:41:23 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:23 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_24] nfs4_Compound :NFS4 :CRIT :Bad Minor Version 50
Jun 20 14:41:24 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 18:41:24 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_24] rpc :TIRPC :EVENT :xdr_array_decode:783 ERROR size 2 > max 1
Jun 20 15:21:36 ceph-mani-vt0a3y-node4 ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss[24815]: 20/06/2023 19:21:36 : epoch 64917fce : ceph-mani-vt0a3y-node4 : ganesha.nfsd-2[svc_170] destroy_fsal_fd :RW LOCK :CRIT :Error 16, Destroy mutex 0x7efec8084fd0 (&fsal_fd->work_mutex) at /builddir/build/BUILD/nfs-ganesha-5.1/src/include/fsal_types.h:1029
Jun 20 15:21:36 ceph-mani-vt0a3y-node4 systemd-coredump[140045]: Process 24819 (ganesha.nfsd) of user 0 dumped core.
Jun 20 15:21:36 ceph-mani-vt0a3y-node4 podman[140051]: 2023-06-20 15:21:36.768359053 -0400 EDT m=+0.043847334 container died e47becaf8f35726d808d19c6d0eee431e81c034301c4f3ea8e454e50a9dd681d (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:406b0370e6931584eefa37039eedd2255168b789f31e2d21f13bec9936b9d3ae, name=ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss, io.openshift.tags=rhceph ceph, io.buildah.version=1.27.3, io.openshift.expose-services=, CEPH_POINT_RELEASE=, com.redhat.license_terms=https://www.redhat.com/agreements, distribution-scope=public, name=rhceph, com.redhat.component=rhceph-container, maintainer=Guillaume Abrioux <gabrioux>, GIT_BRANCH=main, release=176, vcs-ref=dec93361f6f7a22d929d690d9002f0df9a8f6805, GIT_COMMIT=0727c855af939c6f3709e73be703026388413744, RELEASE=main, io.k8s.description=Red Hat Ceph Storage 6, vcs-type=git, version=6, build-date=2023-05-30T00:53:52, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/6-176, GIT_CLEAN=True, description=Red Hat Ceph Storage 6, ceph=True, summary=Provides the latest Red Hat Ceph Storage 6 on RHEL 9 in a fully featured and supported base image., architecture=x86_64, io.k8s.display-name=Red Hat Ceph Storage 6 on RHEL 9, GIT_REPO=https://github.com/ceph/ceph-container.git, vendor=Red Hat, Inc.)
Jun 20 15:21:36 ceph-mani-vt0a3y-node4 podman[140051]: 2023-06-20 15:21:36.800127174 -0400 EDT m=+0.075615456 container remove e47becaf8f35726d808d19c6d0eee431e81c034301c4f3ea8e454e50a9dd681d (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:406b0370e6931584eefa37039eedd2255168b789f31e2d21f13bec9936b9d3ae, name=ceph-20c36492-0f49-11ee-96cd-fa163e5538cc-nfs-nfsganesha-0-0-ceph-mani-vt0a3y-node4-xwjsss, io.openshift.tags=rhceph ceph, com.redhat.license_terms=https://www.redhat.com/agreements, maintainer=Guillaume Abrioux <gabrioux>, version=6, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/6-176, architecture=x86_64, release=176, vendor=Red Hat, Inc., GIT_REPO=https://github.com/ceph/ceph-container.git, io.k8s.display-name=Red Hat Ceph Storage 6 on RHEL 9, distribution-scope=public, GIT_COMMIT=0727c855af939c6f3709e73be703026388413744, vcs-ref=dec93361f6f7a22d929d690d9002f0df9a8f6805, ceph=True, CEPH_POINT_RELEASE=, GIT_BRANCH=main, io.openshift.expose-services=, GIT_CLEAN=True, description=Red Hat Ceph Storage 6, io.k8s.description=Red Hat Ceph Storage 6, io.buildah.version=1.27.3, com.redhat.component=rhceph-container, name=rhceph, summary=Provides the latest Red Hat Ceph Storage 6 on RHEL 9 in a fully featured and supported base image., RELEASE=main, vcs-type=git, build-date=2023-05-30T00:53:52)
Jun 20 15:21:36 ceph-mani-vt0a3y-node4 systemd[1]: ceph-20c36492-0f49-11ee-96cd-fa163e5538cc.0.0.ceph-mani-vt0a3y-node4.xwjsss.service: Main process exited, code=exited, status=134/n/a
Jun 20 15:21:37 ceph-mani-vt0a3y-node4 systemd[1]: ceph-20c36492-0f49-11ee-96cd-fa163e5538cc.0.0.ceph-mani-vt0a3y-node4.xwjsss.service: Failed with result 'exit-code'.
Jun 20 15:21:37 ceph-mani-vt0a3y-node4 systemd[1]: ceph-20c36492-0f49-11ee-96cd-fa163e5538cc.0.0.ceph-mani-vt0a3y-node4.xwjsss.service: Consumed 27.818s CPU time.
Jun 20 15:21:47 ceph-mani-vt0a3y-node4 systemd[1]: ceph-20c36492-0f49-11ee-96cd-fa163e5538cc.0.0.ceph-mani-vt0a3y-node4.xwjsss.service: Scheduled restart job, restart counter is at 1.
Jun 20 15:21:47 ceph-mani-vt0a3y-node4 systemd[1]: Stopped Ceph nfs.nfsganesha.0.0.ceph-mani-vt0a3y-node4.xwjsss for 20c36492-0f49-11ee-96cd-fa163e5538cc.
Jun 20 15:21:47 ceph-mani-vt0a3y-node4 systemd[1]: ceph-20c36492-0f49-11ee-96cd-fa163e5538cc.0.0.ceph-mani-vt0a3y-node4.xwjsss.service: Consumed 27.818s CPU time.
Jun 20 15:21:47 ceph-mani-vt0a3y-node4 systemd[1]: Starting Ceph nfs.nfsganesha.0.0.ceph-mani-vt0a3y-node4.xwjsss for 20c36492-0f49-11ee-96cd-fa163e5538cc...
Jun 20 15:21:47 ceph-mani-vt0a3y-node4 podman[140138]:

Comment 1 Frank Filz 2023-06-21 14:06:29 UTC
Can we get a tcpdump trace of the two clients?

Also, do you have a stack backtrace from the coredump?

Comment 6 Manisha Saini 2023-07-04 18:48:15 UTC
Tested this on

# rpm -qa | grep ganesha
nfs-ganesha-selinux-5.3-1.el9cp.noarch
nfs-ganesha-5.3-1.el9cp.x86_64
nfs-ganesha-ceph-5.3-1.el9cp.x86_64
nfs-ganesha-rados-grace-5.3-1.el9cp.x86_64
nfs-ganesha-rados-urls-5.3-1.el9cp.x86_64
nfs-ganesha-rgw-5.3-1.el9cp.x86_64

Not hitting the crash anymore.Locking works as expected

Client 1
======
\# ./a.out /mnt/ganesha/file1 
opening /mnt/ganesha/file1
opened; hit Enter to lock... 
locking
locked; hit Enter to write... 
Write succeeeded 
locked; hit Enter to unlock... 
unlocking


# ./a.out /mnt/ganesha/file1 
opening /mnt/ganesha/file1
opened; hit Enter to lock... 
locking
locked; hit Enter to write... 
Write succeeeded 
locked; hit Enter to unlock... 
unlocking


Client 2:
=======

#  ./a.out /mnt/ganesha/file1
opening /mnt/ganesha/file1
opened; hit Enter to lock... 
locking
locked; hit Enter to write... 
Write succeeeded 
locked; hit Enter to unlock... 
unlocking


#  ./a.out /mnt/ganesha/file1
opening /mnt/ganesha/file1
opened; hit Enter to lock... 
locking
locked; hit Enter to write... 
Write succeeeded 
locked; hit Enter to unlock... 
unlocking

Comment 7 Frank Filz 2023-07-18 18:47:37 UTC
Can we close this one then?

Comment 8 Manisha Saini 2023-07-19 22:58:12 UTC
Hi Frank,

Since the issue is fixed in NFS v5.3,we can move this BZ to on_qa for verification once we have RHCS 7.0 official builds available with nfs v5.3.
QE will validate this issue with RHCS 7.0 builds

Comment 9 Frank Filz 2023-07-19 23:46:00 UTC
Sounds good.


Note You need to log in before you can comment on or make changes to this bug.