Bug 2217540

Summary: [NFS-Ganesha] Ganesha process getting crashed while writing from client 1 and performing lookup from client 2
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manisha Saini <msaini>
Component: NFS-GaneshaAssignee: Frank Filz <ffilz>
Status: ON_QA --- QA Contact: Manisha Saini <msaini>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 6.1CC: cephqe-warriors, ffilz, kkeithle, tserlin, vereddy
Target Milestone: ---   
Target Release: 7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-18.2.0-10.el9cp, nfs-ganesha-5.5-1.el9cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Manisha Saini 2023-06-26 15:44:50 UTC
Description of problem:

===========

Export 5 subvolume via cephFS (Ganesha1,ganesha2,ganesha3,ganesha4,ganesha5). Delete 1 export (ganesha4). 
Mount the export --> ganesha1 on 2 clients via NFS 4.2 say client 1 and Client 2

Copy the tar file on mount point from Client 1 and perform lookup from Client 2 on mount point while the copy file operation is in process.



Version-Release number of selected component (if applicable):
==========

[ceph: root@ceph-mani-oo0maz-node1-installer /]# rpm -qa | grep nfs
libnfsidmap-2.5.4-18.el9.x86_64
nfs-utils-2.5.4-18.el9.x86_64
nfs-ganesha-selinux-5.1-1.el9cp.noarch
nfs-ganesha-5.1-1.el9cp.x86_64
nfs-ganesha-ceph-5.1-1.el9cp.x86_64
nfs-ganesha-rados-grace-5.1-1.el9cp.x86_64
nfs-ganesha-rados-urls-5.1-1.el9cp.x86_64
nfs-ganesha-rgw-5.1-1.el9cp.x86_64


How reproducible:
=========
3/3


Steps to Reproduce:
1.Create ganesha cluster on 2 RHCS nodes

# ceph nfs cluster info nfsganesha 
{
    "nfsganesha": {
        "virtual_ip": null,
        "backend": [
            {
                "hostname": "ceph-mani-oo0maz-node5",
                "ip": "10.0.208.192",
                "port": 2049
            },
            {
                "hostname": "ceph-mani-oo0maz-node6",
                "ip": "10.0.210.195",
                "port": 2049
            }
        ]
    }
}

2.Create CephFS filesystem and create and export 5 subvolume via ganesha

# ceph fs volume ls
[
    {
        "name": "cephfs"
    }
]


# ceph fs subvolumegroup ls cephfs
[
    {
        "name": "ganesha4"
    },
    {
        "name": "ganesha1"
    },
    {
        "name": "ganesha2"
    },
    {
        "name": "ganesha5"
    },
    {
        "name": "ganesha3"
    }
]

# ceph nfs export ls nfsganesha
[
  "/ganesha1",
  "/ganesha2",
  "/ganesha3",
  "/ganesha4",
  "/ganesha5"
]


3.Delete export /ganesha4

4. Mount the export on 2 clients (Client 1 and client 2) via NFS v 4.2

5. Copy the file on client 1 on mount point and at the same time perform lookup from client 2

Client 1:

[root@ceph-mani-oo0maz-node11 ganesha1]# cp /root/linux-6.4.tar.xz /mnt/ganesha1/
[root@ceph-mani-oo0maz-node11 ganesha1]# 

Client 2:

[root@ceph-mani-oo0maz-node10 ganesha1]# ls
f2  linux-6.4.tar.xz

Actual results:
=====
NFS-Ganesha process getting crash and dumped core while performing lookup from client 2

Expected results:
====
NFS-Ganesha process should not crash


Additional info:
=====

[root@ceph-mani-oo0maz-node5 coredump]# lldb -c core.ganesha\\x2enfsd.0.76643ba0b43d472c8c2e29f59a62ae7e.60786.1687792030000000
(lldb) target create --core "core.ganesha\\x2enfsd.0.76643ba0b43d472c8c2e29f59a62ae7e.60786.1687792030000000"
Core file '/var/lib/systemd/coredump/core.ganesha\x2enfsd.0.76643ba0b43d472c8c2e29f59a62ae7e.60786.1687792030000000' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'ganesha.nfsd', stop reason = signal SIGABRT
  * frame #0: 0x00007fd34677f54c



------

ganesha.log
-----

P :EVENT :-------------------------------------------------
Jun 26 10:36:35 ceph-mani-oo0maz-node5 ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm[52149]: 26/06/2023 14:36:35 : epoch 6499a269 : ceph-mani-oo0maz-node5 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
Jun 26 10:36:35 ceph-mani-oo0maz-node5 ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm[52149]: 26/06/2023 14:36:35 : epoch 6499a269 : ceph-mani-oo0maz-node5 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
Jun 26 10:36:45 ceph-mani-oo0maz-node5 ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm[52149]: 26/06/2023 14:36:45 : epoch 6499a269 : ceph-mani-oo0maz-node5 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(2) clid count(2)
Jun 26 10:36:45 ceph-mani-oo0maz-node5 ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm[52149]: 26/06/2023 14:36:45 : epoch 6499a269 : ceph-mani-oo0maz-node5 : ganesha.nfsd-2[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
Jun 26 10:58:44 ceph-mani-oo0maz-node5 ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm[52149]: 26/06/2023 14:58:44 : epoch 6499a269 : ceph-mani-oo0maz-node5 : ganesha.nfsd-2[svc_69] destroy_fsal_fd :RW LOCK :CRIT :Error 16, Destroy mutex 0x7fa78c019f60 (&fsal_fd->work_mutex) at /builddir/build/BUILD/nfs-ganesha-5.1/src/include/fsal_types.h:1029
Jun 26 10:58:44 ceph-mani-oo0maz-node5 systemd-coredump[60575]: Process 52153 (ganesha.nfsd) of user 0 dumped core.
Jun 26 10:58:44 ceph-mani-oo0maz-node5 podman[60580]: 2023-06-26 10:58:44.661759627 -0400 EDT m=+0.042498076 container died 392faf4f13539082c382554bbb70d8764e6ed4144499eb42f85df4fec95c1a00 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:0eb98763b77938a11cb62e8414119ed35a739f02077f2b6b0489f76d80a63e67, name=ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm, GIT_REPO=https://github.com/ceph/ceph-container.git, description=Red Hat Ceph Storage 6, ceph=True, io.buildah.version=1.29.0, vcs-ref=dec93361f6f7a22d929d690d9002f0df9a8f6805, RELEASE=main, build-date=2023-06-23T19:14:24, com.redhat.component=rhceph-container, distribution-scope=public, release=179, architecture=x86_64, io.k8s.description=Red Hat Ceph Storage 6, CEPH_POINT_RELEASE=, com.redhat.license_terms=https://www.redhat.com/agreements, maintainer=Guillaume Abrioux <gabrioux>, name=rhceph, vcs-type=git, GIT_COMMIT=0727c855af939c6f3709e73be703026388413744, summary=Provides the latest Red Hat Ceph Storage 6 on RHEL 9 in a fully featured and supported base image., url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/6-179, io.openshift.expose-services=, GIT_BRANCH=main, vendor=Red Hat, Inc., GIT_CLEAN=True, io.k8s.display-name=Red Hat Ceph Storage 6 on RHEL 9, io.openshift.tags=rhceph ceph, version=6)
Jun 26 10:58:44 ceph-mani-oo0maz-node5 podman[60580]: 2023-06-26 10:58:44.679685425 -0400 EDT m=+0.060423839 container remove 392faf4f13539082c382554bbb70d8764e6ed4144499eb42f85df4fec95c1a00 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:0eb98763b77938a11cb62e8414119ed35a739f02077f2b6b0489f76d80a63e67, name=ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm, GIT_CLEAN=True, io.openshift.tags=rhceph ceph, ceph=True, summary=Provides the latest Red Hat Ceph Storage 6 on RHEL 9 in a fully featured and supported base image., vcs-ref=dec93361f6f7a22d929d690d9002f0df9a8f6805, GIT_BRANCH=main, io.openshift.expose-services=, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/6-179, architecture=x86_64, com.redhat.component=rhceph-container, build-date=2023-06-23T19:14:24, vendor=Red Hat, Inc., distribution-scope=public, release=179, io.k8s.display-name=Red Hat Ceph Storage 6 on RHEL 9, io.k8s.description=Red Hat Ceph Storage 6, com.redhat.license_terms=https://www.redhat.com/agreements, io.buildah.version=1.29.0, RELEASE=main, description=Red Hat Ceph Storage 6, maintainer=Guillaume Abrioux <gabrioux>, name=rhceph, GIT_COMMIT=0727c855af939c6f3709e73be703026388413744, CEPH_POINT_RELEASE=, vcs-type=git, GIT_REPO=https://github.com/ceph/ceph-container.git, version=6)
Jun 26 10:58:44 ceph-mani-oo0maz-node5 systemd[1]: ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b.0.0.ceph-mani-oo0maz-node5.qwtotm.service: Main process exited, code=exited, status=134/n/a
Jun 26 10:58:45 ceph-mani-oo0maz-node5 systemd[1]: ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b.0.0.ceph-mani-oo0maz-node5.qwtotm.service: Failed with result 'exit-code'.
Jun 26 10:58:45 ceph-mani-oo0maz-node5 systemd[1]: ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b.0.0.ceph-mani-oo0maz-node5.qwtotm.service: Consumed 3.488s CPU time.
Jun 26 10:58:55 ceph-mani-oo0maz-node5 systemd[1]: ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b.0.0.ceph-mani-oo0maz-node5.qwtotm.service: Scheduled restart job, restart counter is at 3.
Jun 26 10:58:55 ceph-mani-oo0maz-node5 systemd[1]: Stopped Ceph nfs.nfsganesha.0.0.ceph-mani-oo0maz-node5.qwtotm for 7f3277c8-1419-11ee-96b4-fa163eb1880b.
Jun 26 10:58:55 ceph-mani-oo0maz-node5 systemd[1]: ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b.0.0.ceph-mani-oo0maz-node5.qwtotm.service: Consumed 3.488s CPU time.
Jun 26 10:58:55 ceph-mani-oo0maz-node5 systemd[1]: Starting Ceph nfs.nfsganesha.0.0.ceph-mani-oo0maz-node5.qwtotm for 7f3277c8-1419-11ee-96b4-fa163eb1880b...
Jun 26 10:58:55 ceph-mani-oo0maz-node5 podman[60773]:
Jun 26 10:58:55 ceph-mani-oo0maz-node5 podman[60773]: 2023-06-26 10:58:55.334297003 -0400 EDT m=+0.046869916 container create 901730a39b6c781fe3071ac5148b315d7d5d62ed695c9341990dc3cac0649f61 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:0eb98763b77938a11cb62e8414119ed35a739f02077f2b6b0489f76d80a63e67, name=ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm, summary=Provides the latest Red Hat Ceph Storage 6 on RHEL 9 in a fully featured and supported base image., com.redhat.license_terms=https://www.redhat.com/agreements, ceph=True, vendor=Red Hat, Inc., io.buildah.version=1.29.0, architecture=x86_64, io.k8s.description=Red Hat Ceph Storage 6, maintainer=Guillaume Abrioux <gabrioux>, io.k8s.display-name=Red Hat Ceph Storage 6 on RHEL 9, version=6, release=179, name=rhceph, GIT_CLEAN=True, io.openshift.tags=rhceph ceph, RELEASE=main, io.openshift.expose-services=, CEPH_POINT_RELEASE=, description=Red Hat Ceph Storage 6, vcs-type=git, distribution-scope=public, GIT_COMMIT=0727c855af939c6f3709e73be703026388413744, build-date=2023-06-23T19:14:24, GIT_BRANCH=main, com.redhat.component=rhceph-container, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/6-179, vcs-ref=dec93361f6f7a22d929d690d9002f0df9a8f6805, GIT_REPO=https://github.com/ceph/ceph-container.git)

Comment 3 Manisha Saini 2023-09-05 08:42:43 UTC
No observing the issue qith RHCS 7.0 build. Do we have RCA for same? Can we move this to ON_QA?

Comment 4 Frank Filz 2023-09-05 16:01:44 UTC
This is almost certainly the same root cause as Bug 2216442.