2217540 – [NFS-Ganesha] Ganesha process getting crashed while writing from client 1 and performing lookup from client 2

Bug 2217540 - [NFS-Ganesha] Ganesha process getting crashed while writing from client 1 and performing lookup from client 2

Summary: [NFS-Ganesha] Ganesha process getting crashed while writing from client 1 and...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	NFS-Ganesha
Sub Component:
Version:	6.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	7.0
Assignee:	Frank Filz
QA Contact:	Manisha Saini
Docs Contact:	Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:	2237662
TreeView+	depends on / blocked

Reported:	2023-06-26 15:44 UTC by Manisha Saini
Modified:	2023-12-13 15:20 UTC (History)
CC List:	7 users (show)
Fixed In Version:	ceph-18.2.0-10.el9cp, nfs-ganesha-5.5-1.el9cp
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-13 15:20:28 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-6923	0	None	None	None	2023-06-26 15:45:22 UTC
Red Hat Product Errata	RHBA-2023:7780	0	None	None	None	2023-12-13 15:20:32 UTC

Description Manisha Saini 2023-06-26 15:44:50 UTC

Description of problem:

===========

Export 5 subvolume via cephFS (Ganesha1,ganesha2,ganesha3,ganesha4,ganesha5). Delete 1 export (ganesha4). 
Mount the export --> ganesha1 on 2 clients via NFS 4.2 say client 1 and Client 2

Copy the tar file on mount point from Client 1 and perform lookup from Client 2 on mount point while the copy file operation is in process.



Version-Release number of selected component (if applicable):
==========

[ceph: root@ceph-mani-oo0maz-node1-installer /]# rpm -qa | grep nfs
libnfsidmap-2.5.4-18.el9.x86_64
nfs-utils-2.5.4-18.el9.x86_64
nfs-ganesha-selinux-5.1-1.el9cp.noarch
nfs-ganesha-5.1-1.el9cp.x86_64
nfs-ganesha-ceph-5.1-1.el9cp.x86_64
nfs-ganesha-rados-grace-5.1-1.el9cp.x86_64
nfs-ganesha-rados-urls-5.1-1.el9cp.x86_64
nfs-ganesha-rgw-5.1-1.el9cp.x86_64


How reproducible:
=========
3/3


Steps to Reproduce:
1.Create ganesha cluster on 2 RHCS nodes

# ceph nfs cluster info nfsganesha 
{
    "nfsganesha": {
        "virtual_ip": null,
        "backend": [
            {
                "hostname": "ceph-mani-oo0maz-node5",
                "ip": "10.0.208.192",
                "port": 2049
            },
            {
                "hostname": "ceph-mani-oo0maz-node6",
                "ip": "10.0.210.195",
                "port": 2049
            }
        ]
    }
}

2.Create CephFS filesystem and create and export 5 subvolume via ganesha

# ceph fs volume ls
[
    {
        "name": "cephfs"
    }
]


# ceph fs subvolumegroup ls cephfs
[
    {
        "name": "ganesha4"
    },
    {
        "name": "ganesha1"
    },
    {
        "name": "ganesha2"
    },
    {
        "name": "ganesha5"
    },
    {
        "name": "ganesha3"
    }
]

# ceph nfs export ls nfsganesha
[
  "/ganesha1",
  "/ganesha2",
  "/ganesha3",
  "/ganesha4",
  "/ganesha5"
]


3.Delete export /ganesha4

4. Mount the export on 2 clients (Client 1 and client 2) via NFS v 4.2

5. Copy the file on client 1 on mount point and at the same time perform lookup from client 2

Client 1:

[root@ceph-mani-oo0maz-node11 ganesha1]# cp /root/linux-6.4.tar.xz /mnt/ganesha1/
[root@ceph-mani-oo0maz-node11 ganesha1]# 

Client 2:

[root@ceph-mani-oo0maz-node10 ganesha1]# ls
f2  linux-6.4.tar.xz

Actual results:
=====
NFS-Ganesha process getting crash and dumped core while performing lookup from client 2

Expected results:
====
NFS-Ganesha process should not crash


Additional info:
=====

[root@ceph-mani-oo0maz-node5 coredump]# lldb -c core.ganesha\\x2enfsd.0.76643ba0b43d472c8c2e29f59a62ae7e.60786.1687792030000000
(lldb) target create --core "core.ganesha\\x2enfsd.0.76643ba0b43d472c8c2e29f59a62ae7e.60786.1687792030000000"
Core file '/var/lib/systemd/coredump/core.ganesha\x2enfsd.0.76643ba0b43d472c8c2e29f59a62ae7e.60786.1687792030000000' (x86_64) was loaded.
(lldb) bt
* thread #1, name = 'ganesha.nfsd', stop reason = signal SIGABRT
  * frame #0: 0x00007fd34677f54c



------

ganesha.log
-----

P :EVENT :-------------------------------------------------
Jun 26 10:36:35 ceph-mani-oo0maz-node5 ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm[52149]: 26/06/2023 14:36:35 : epoch 6499a269 : ceph-mani-oo0maz-node5 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
Jun 26 10:36:35 ceph-mani-oo0maz-node5 ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm[52149]: 26/06/2023 14:36:35 : epoch 6499a269 : ceph-mani-oo0maz-node5 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
Jun 26 10:36:45 ceph-mani-oo0maz-node5 ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm[52149]: 26/06/2023 14:36:45 : epoch 6499a269 : ceph-mani-oo0maz-node5 : ganesha.nfsd-2[reaper] nfs_try_lift_grace :STATE :EVENT :check grace:reclaim complete(2) clid count(2)
Jun 26 10:36:45 ceph-mani-oo0maz-node5 ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm[52149]: 26/06/2023 14:36:45 : epoch 6499a269 : ceph-mani-oo0maz-node5 : ganesha.nfsd-2[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
Jun 26 10:58:44 ceph-mani-oo0maz-node5 ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm[52149]: 26/06/2023 14:58:44 : epoch 6499a269 : ceph-mani-oo0maz-node5 : ganesha.nfsd-2[svc_69] destroy_fsal_fd :RW LOCK :CRIT :Error 16, Destroy mutex 0x7fa78c019f60 (&fsal_fd->work_mutex) at /builddir/build/BUILD/nfs-ganesha-5.1/src/include/fsal_types.h:1029
Jun 26 10:58:44 ceph-mani-oo0maz-node5 systemd-coredump[60575]: Process 52153 (ganesha.nfsd) of user 0 dumped core.
Jun 26 10:58:44 ceph-mani-oo0maz-node5 podman[60580]: 2023-06-26 10:58:44.661759627 -0400 EDT m=+0.042498076 container died 392faf4f13539082c382554bbb70d8764e6ed4144499eb42f85df4fec95c1a00 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:0eb98763b77938a11cb62e8414119ed35a739f02077f2b6b0489f76d80a63e67, name=ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm, GIT_REPO=https://github.com/ceph/ceph-container.git, description=Red Hat Ceph Storage 6, ceph=True, io.buildah.version=1.29.0, vcs-ref=dec93361f6f7a22d929d690d9002f0df9a8f6805, RELEASE=main, build-date=2023-06-23T19:14:24, com.redhat.component=rhceph-container, distribution-scope=public, release=179, architecture=x86_64, io.k8s.description=Red Hat Ceph Storage 6, CEPH_POINT_RELEASE=, com.redhat.license_terms=https://www.redhat.com/agreements, maintainer=Guillaume Abrioux <gabrioux>, name=rhceph, vcs-type=git, GIT_COMMIT=0727c855af939c6f3709e73be703026388413744, summary=Provides the latest Red Hat Ceph Storage 6 on RHEL 9 in a fully featured and supported base image., url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/6-179, io.openshift.expose-services=, GIT_BRANCH=main, vendor=Red Hat, Inc., GIT_CLEAN=True, io.k8s.display-name=Red Hat Ceph Storage 6 on RHEL 9, io.openshift.tags=rhceph ceph, version=6)
Jun 26 10:58:44 ceph-mani-oo0maz-node5 podman[60580]: 2023-06-26 10:58:44.679685425 -0400 EDT m=+0.060423839 container remove 392faf4f13539082c382554bbb70d8764e6ed4144499eb42f85df4fec95c1a00 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:0eb98763b77938a11cb62e8414119ed35a739f02077f2b6b0489f76d80a63e67, name=ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm, GIT_CLEAN=True, io.openshift.tags=rhceph ceph, ceph=True, summary=Provides the latest Red Hat Ceph Storage 6 on RHEL 9 in a fully featured and supported base image., vcs-ref=dec93361f6f7a22d929d690d9002f0df9a8f6805, GIT_BRANCH=main, io.openshift.expose-services=, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/6-179, architecture=x86_64, com.redhat.component=rhceph-container, build-date=2023-06-23T19:14:24, vendor=Red Hat, Inc., distribution-scope=public, release=179, io.k8s.display-name=Red Hat Ceph Storage 6 on RHEL 9, io.k8s.description=Red Hat Ceph Storage 6, com.redhat.license_terms=https://www.redhat.com/agreements, io.buildah.version=1.29.0, RELEASE=main, description=Red Hat Ceph Storage 6, maintainer=Guillaume Abrioux <gabrioux>, name=rhceph, GIT_COMMIT=0727c855af939c6f3709e73be703026388413744, CEPH_POINT_RELEASE=, vcs-type=git, GIT_REPO=https://github.com/ceph/ceph-container.git, version=6)
Jun 26 10:58:44 ceph-mani-oo0maz-node5 systemd[1]: ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b.0.0.ceph-mani-oo0maz-node5.qwtotm.service: Main process exited, code=exited, status=134/n/a
Jun 26 10:58:45 ceph-mani-oo0maz-node5 systemd[1]: ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b.0.0.ceph-mani-oo0maz-node5.qwtotm.service: Failed with result 'exit-code'.
Jun 26 10:58:45 ceph-mani-oo0maz-node5 systemd[1]: ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b.0.0.ceph-mani-oo0maz-node5.qwtotm.service: Consumed 3.488s CPU time.
Jun 26 10:58:55 ceph-mani-oo0maz-node5 systemd[1]: ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b.0.0.ceph-mani-oo0maz-node5.qwtotm.service: Scheduled restart job, restart counter is at 3.
Jun 26 10:58:55 ceph-mani-oo0maz-node5 systemd[1]: Stopped Ceph nfs.nfsganesha.0.0.ceph-mani-oo0maz-node5.qwtotm for 7f3277c8-1419-11ee-96b4-fa163eb1880b.
Jun 26 10:58:55 ceph-mani-oo0maz-node5 systemd[1]: ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b.0.0.ceph-mani-oo0maz-node5.qwtotm.service: Consumed 3.488s CPU time.
Jun 26 10:58:55 ceph-mani-oo0maz-node5 systemd[1]: Starting Ceph nfs.nfsganesha.0.0.ceph-mani-oo0maz-node5.qwtotm for 7f3277c8-1419-11ee-96b4-fa163eb1880b...
Jun 26 10:58:55 ceph-mani-oo0maz-node5 podman[60773]:
Jun 26 10:58:55 ceph-mani-oo0maz-node5 podman[60773]: 2023-06-26 10:58:55.334297003 -0400 EDT m=+0.046869916 container create 901730a39b6c781fe3071ac5148b315d7d5d62ed695c9341990dc3cac0649f61 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:0eb98763b77938a11cb62e8414119ed35a739f02077f2b6b0489f76d80a63e67, name=ceph-7f3277c8-1419-11ee-96b4-fa163eb1880b-nfs-nfsganesha-0-0-ceph-mani-oo0maz-node5-qwtotm, summary=Provides the latest Red Hat Ceph Storage 6 on RHEL 9 in a fully featured and supported base image., com.redhat.license_terms=https://www.redhat.com/agreements, ceph=True, vendor=Red Hat, Inc., io.buildah.version=1.29.0, architecture=x86_64, io.k8s.description=Red Hat Ceph Storage 6, maintainer=Guillaume Abrioux <gabrioux>, io.k8s.display-name=Red Hat Ceph Storage 6 on RHEL 9, version=6, release=179, name=rhceph, GIT_CLEAN=True, io.openshift.tags=rhceph ceph, RELEASE=main, io.openshift.expose-services=, CEPH_POINT_RELEASE=, description=Red Hat Ceph Storage 6, vcs-type=git, distribution-scope=public, GIT_COMMIT=0727c855af939c6f3709e73be703026388413744, build-date=2023-06-23T19:14:24, GIT_BRANCH=main, com.redhat.component=rhceph-container, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/6-179, vcs-ref=dec93361f6f7a22d929d690d9002f0df9a8f6805, GIT_REPO=https://github.com/ceph/ceph-container.git)

Comment 3 Manisha Saini 2023-09-05 08:42:43 UTC

No observing the issue qith RHCS 7.0 build. Do we have RCA for same? Can we move this to ON_QA?

Comment 4 Frank Filz 2023-09-05 16:01:44 UTC

This is almost certainly the same root cause as Bug 2216442.

Comment 10 Manisha Saini 2023-09-20 19:59:38 UTC

Verified this BZ with

[ceph: root@argo016 /]# rpm -qa | grep ganesha
nfs-ganesha-selinux-5.5-1.el9cp.noarch
nfs-ganesha-5.5-1.el9cp.x86_64
nfs-ganesha-rgw-5.5-1.el9cp.x86_64
nfs-ganesha-ceph-5.5-1.el9cp.x86_64
nfs-ganesha-rados-grace-5.5-1.el9cp.x86_64
nfs-ganesha-rados-urls-5.5-1.el9cp.x86_64

Steps peformed:
1.Configure continerized ganesha
2. Export 5 subvolume via NFS (Ganesha1,ganesha2,ganesha3,ganesha4,ganesha5). Delete 1 export (ganesha4). 
3. Mount the export --> ganesha1 on 2 clients  say client 1 and Client 2
4. Copy the tar file on mount point from Client 1 and perform lookup from Client 2 on mount point while the copy file operation is in process.

No crashes were observed.Moving this BZ to verified state

Comment 12 Frank Filz 2023-10-11 21:05:50 UTC

As a bug introduced by the async/nonblocking work, I don't think this requires doc text. Please advise on how to proceed.

Comment 13 errata-xmlrpc 2023-12-13 15:20:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780

Note You need to log in before you can comment on or make changes to this bug.