Bug 2342266

Summary: [NFS-Ganesha] Ganesha crashed in setNode_pc (qos_block=0x0, client_addr=<optimized out>, node=0x7f3b74012ac0) after enabling the QoS feature
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manisha Saini <msaini>
Component: NFS-GaneshaAssignee: Naresh <nchillar>
Status: CLOSED ERRATA QA Contact: Manisha Saini <msaini>
Severity: urgent Docs Contact: Rivka Pollack <rpollack>
Priority: unspecified    
Version: 8.1CC: cephqe-warriors, deepatil, ffilz, kkeithle, nchillar, rpollack, tserlin, vdas
Target Milestone: ---   
Target Release: 8.0z3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: nfs-ganesha-6.5-5.el9cp Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2025-04-07 15:26:06 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Manisha Saini 2025-01-27 15:25:25 UTC
Description of problem:
=================

Deploy NFS-Ganesha using nfs-ganesha-6.5-1.2.el9cp.
Add QoS-related parameters to the ganesha.conf file, create the export, and initiate IO operations from the mount point.
The NFS-Ganesha service crashed and generated a core dump.

---
Core was generated by `/usr/bin/ganesha.nfsd -F -L STDERR -N NIV_EVENT'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  setNode_pc (qos_block=0x0, client_addr=<optimized out>, node=0x7f3b74012ac0) at /usr/src/debug/nfs-ganesha-6.5-1.2.el9cp.x86_64/src/MainNFSD/nfs_qos.c:385
385		node->combined_rw_bw_control = qos_block->combined_rw_bw_control;
[Current thread is 1 (LWP 77)]
(gdb) bt
#0  setNode_pc (qos_block=0x0, client_addr=<optimized out>, node=0x7f3b74012ac0) at /usr/src/debug/nfs-ganesha-6.5-1.2.el9cp.x86_64/src/MainNFSD/nfs_qos.c:385
#1  pspc_allocate_and_init_client (qos_block=0x0, client_addr=<optimized out>) at /usr/src/debug/nfs-ganesha-6.5-1.2.el9cp.x86_64/src/MainNFSD/nfs_qos.c:434
#2  pspc_alloc_init_add_client (qos_block=0x0, client_addr=<optimized out>, head=0x7f3b740089d8) at /usr/src/debug/nfs-ganesha-6.5-1.2.el9cp.x86_64/src/MainNFSD/nfs_qos.c:453
#3  QoS_Process_pspc (op_type=1, data=0x7f3b7400ccd0, caller_data=0x7f3b7400afd0, size=<optimized out>) at /usr/src/debug/nfs-ganesha-6.5-1.2.el9cp.x86_64/src/MainNFSD/nfs_qos.c:831
#4  QoS_Process (size=<optimized out>, caller_data=0x7f3b7400afd0, data=0x7f3b7400ccd0, op_type=1) at /usr/src/debug/nfs-ganesha-6.5-1.2.el9cp.x86_64/src/MainNFSD/nfs_qos.c:858
#5  0x00007f3b931eef94 in nfs4_op_write (op=0x7f3b7400c6b0, data=0x7f3b7400ccd0, resp=0x7f3b74012a10) at /usr/src/debug/nfs-ganesha-6.5-1.2.el9cp.x86_64/src/Protocols/NFS/nfs4_op_write.c:501
#6  0x00007f3b931d0485 in process_one_op (data=data@entry=0x7f3b7400ccd0, status=status@entry=0x7f3b397f864c) at /usr/src/debug/nfs-ganesha-6.5-1.2.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:905
#7  0x00007f3b931d1288 in nfs4_Compound (arg=<optimized out>, req=0x7f3b74009670, res=0x7f3b74012680) at /usr/src/debug/nfs-ganesha-6.5-1.2.el9cp.x86_64/src/Protocols/NFS/nfs4_Compound.c:1386
#8  0x00007f3b93150bc5 in nfs_rpc_process_request (reqdata=<optimized out>, retry=<optimized out>) at /usr/src/debug/nfs-ganesha-6.5-1.2.el9cp.x86_64/src/MainNFSD/nfs_worker_thread.c:1479
#9  0x00007f3b92ea25e7 in svc_request (xprt=0x7f3b40001330, xdrs=<optimized out>) at /usr/src/debug/libntirpc-6.3-1.el9cp.x86_64/src/svc_rqst.c:1229
#10 0x00007f3b92ea6e5a in svc_rqst_xprt_task_recv (wpe=<optimized out>) at /usr/src/debug/libntirpc-6.3-1.el9cp.x86_64/src/svc_rqst.c:1210
#11 0x00007f3b92ea991b in svc_rqst_epoll_loop (wpe=0x564e192190b8) at /usr/src/debug/libntirpc-6.3-1.el9cp.x86_64/src/svc_rqst.c:1585
#12 0x00007f3b92eb2cbc in work_pool_thread (arg=0x7f3b40016460) at /usr/src/debug/libntirpc-6.3-1.el9cp.x86_64/src/work_pool.c:187
#13 0x00007f3b92f52d22 in pthread_detach.5 () from /lib64/libc.so.6
#14 0x0000000000000000 in ?? ()
(gdb)
-----

ganesha.log
----
Jan 27 12:27:42 ceph-manisaini-su4kp8-node2 ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a-nfs-nfsganesha-0-0-ceph-manisaini-su4kp8-node2-wdszsj[329222]: 27/01/2025 12:27:42 : epoch 67977bbd : ceph-manisaini-su4kp8-node2 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
Jan 27 12:27:42 ceph-manisaini-su4kp8-node2 ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a-nfs-nfsganesha-0-0-ceph-manisaini-su4kp8-node2-wdszsj[329222]: 27/01/2025 12:27:42 : epoch 67977bbd : ceph-manisaini-su4kp8-node2 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :             NFS SERVER INITIALIZED
Jan 27 12:27:42 ceph-manisaini-su4kp8-node2 ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a-nfs-nfsganesha-0-0-ceph-manisaini-su4kp8-node2-wdszsj[329222]: 27/01/2025 12:27:42 : epoch 67977bbd : ceph-manisaini-su4kp8-node2 : ganesha.nfsd-2[main] nfs_start :NFS STARTUP :EVENT :-------------------------------------------------
Jan 27 12:27:46 ceph-manisaini-su4kp8-node2 systemd-coredump[329301]: Process 329226 (ganesha.nfsd) of user 0 dumped core.

                                                                      Stack trace of thread 46:
                                                                      #0  0x00007f82a294ed11 n/a (/usr/lib64/libganesha_nfsd.so.6.5 + 0x5ad11)
                                                                      ELF object binary architecture: AMD x86-64
Jan 27 12:27:46 ceph-manisaini-su4kp8-node2 podman[329306]: 2025-01-27 12:27:46.779033541 +0000 UTC m=+0.035736089 container died 2cc5d0c75c5318042b9492be0a36dbc079f50f0bf0c8e925dba4ca7ec7059d07 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5c3a4ff92a3205922f1b4d25b43864013bd145a415cf922ff2e4fb33db5818e7, name=ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a-nfs-nfsganesha-0-0-ceph-manisaini-su4kp8-node2-wdszsj, summary=Provides the latest Red Hat Ceph Storage 8 on RHEL 9 in a fully featured and supported base image., GIT_CLEAN=True, GIT_REPO=https://github.com/ceph/ceph-container.git, CEPH_POINT_RELEASE=, GIT_BRANCH=main, com.redhat.license_terms=https://www.redhat.com/agreements, io.k8s.description=Red Hat Ceph Storage 8, io.openshift.tags=rhceph ceph, vcs-type=git, vcs-ref=6bc17c430374b15a8dee08107281b6a4fa5b8ce9, RELEASE=main, ceph=True, com.redhat.component=rhceph-container, build-date=2025-01-20T13:41:51, io.openshift.expose-services=, vendor=Red Hat, Inc., distribution-scope=public, version=8, name=rhceph, release=228, description=Red Hat Ceph Storage 8, io.buildah.version=1.33.8, maintainer=Guillaume Abrioux <gabrioux>, GIT_COMMIT=55ad0f204a1d654ee565abf874aecad0cc209d0e, architecture=x86_64, io.k8s.display-name=Red Hat Ceph Storage 8 on RHEL 9, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/8-228)
Jan 27 12:27:46 ceph-manisaini-su4kp8-node2 podman[329306]: 2025-01-27 12:27:46.81613975 +0000 UTC m=+0.072842290 container remove 2cc5d0c75c5318042b9492be0a36dbc079f50f0bf0c8e925dba4ca7ec7059d07 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:5c3a4ff92a3205922f1b4d25b43864013bd145a415cf922ff2e4fb33db5818e7, name=ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a-nfs-nfsganesha-0-0-ceph-manisaini-su4kp8-node2-wdszsj, GIT_REPO=https://github.com/ceph/ceph-container.git, io.k8s.display-name=Red Hat Ceph Storage 8 on RHEL 9, vcs-type=git, summary=Provides the latest Red Hat Ceph Storage 8 on RHEL 9 in a fully featured and supported base image., name=rhceph, description=Red Hat Ceph Storage 8, version=8, com.redhat.license_terms=https://www.redhat.com/agreements, maintainer=Guillaume Abrioux <gabrioux>, GIT_CLEAN=True, vcs-ref=6bc17c430374b15a8dee08107281b6a4fa5b8ce9, distribution-scope=public, GIT_BRANCH=main, CEPH_POINT_RELEASE=, RELEASE=main, io.buildah.version=1.33.8, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/8-228, release=228, io.openshift.expose-services=, com.redhat.component=rhceph-container, ceph=True, io.k8s.description=Red Hat Ceph Storage 8, io.openshift.tags=rhceph ceph, build-date=2025-01-20T13:41:51, GIT_COMMIT=55ad0f204a1d654ee565abf874aecad0cc209d0e, architecture=x86_64, vendor=Red Hat, Inc.)
Jan 27 12:27:46 ceph-manisaini-su4kp8-node2 systemd[1]: ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a.0.0.ceph-manisaini-su4kp8-node2.wdszsj.service: Main process exited, code=exited, status=139/n/a
Jan 27 12:27:47 ceph-manisaini-su4kp8-node2 systemd[1]: ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a.0.0.ceph-manisaini-su4kp8-node2.wdszsj.service: Failed with result 'exit-code'.
Jan 27 12:27:47 ceph-manisaini-su4kp8-node2 systemd[1]: ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a.0.0.ceph-manisaini-su4kp8-node2.wdszsj.service: Consumed 1.225s CPU time.
Jan 27 12:27:57 ceph-manisaini-su4kp8-node2 systemd[1]: ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a.0.0.ceph-manisaini-su4kp8-node2.wdszsj.service: Scheduled restart job, restart counter is at 5.
Jan 27 12:27:57 ceph-manisaini-su4kp8-node2 systemd[1]: Stopped Ceph nfs.nfsganesha.0.0.ceph-manisaini-su4kp8-node2.wdszsj for 2fefc25c-d8bb-11ef-a07d-fa163e4cf23a.
Jan 27 12:27:57 ceph-manisaini-su4kp8-node2 systemd[1]: ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a.0.0.ceph-manisaini-su4kp8-node2.wdszsj.service: Consumed 1.225s CPU time.
Jan 27 12:27:57 ceph-manisaini-su4kp8-node2 systemd[1]: ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a.0.0.ceph-manisaini-su4kp8-node2.wdszsj.service: Start request repeated too quickly.
Jan 27 12:27:57 ceph-manisaini-su4kp8-node2 systemd[1]: ceph-2fefc25c-d8bb-11ef-a07d-fa163e4cf23a.0.0.ceph-manisaini-su4kp8-node2.wdszsj.service: Failed with result 'exit-code'.
Jan 27 12:27:57 ceph-manisaini-su4kp8-node2 systemd[1]: Failed to start Ceph nfs.nfsganesha.0.0.ceph-manisaini-su4kp8-node2.wdszsj for 2fefc25c-d8bb-11ef-a07d-fa163e4cf23a.
----------


Version-Release number of selected component (if applicable):

# ceph --version
ceph version 19.2.0-61.el9cp (1addfd37086eff688a3ec62ee4b6aa98d5982a31) squid (stable)

# rpm -qa | grep nfs
libnfsidmap-2.5.4-27.el9.x86_64
nfs-utils-2.5.4-27.el9.x86_64
nfs-ganesha-selinux-6.5-1.2.el9cp.noarch
nfs-ganesha-6.5-1.2.el9cp.x86_64
nfs-ganesha-rgw-6.5-1.2.el9cp.x86_64
nfs-ganesha-ceph-6.5-1.2.el9cp.x86_64
nfs-ganesha-rados-grace-6.5-1.2.el9cp.x86_64
nfs-ganesha-rados-urls-6.5-1.2.el9cp.x86_64


How reproducible:
=================
1/1


Steps to Reproduce:
==================

1. Create nfs ganesha cluster on ceph 

2. Enable Qos in ganesha.conf

# ceph config-key  get mgr/cephadm/services/nfs/ganesha.conf
# {{ cephadm_managed }}
NFS_CORE_PARAM {
        Enable_NLM = {{ enable_nlm }};
        Enable_RQUOTA = false;
        Protocols = 3, 4;
        mount_path_pseudo = true;
        Allow_Set_Io_Flusher_Fail = true;
        Enable_UDP = false;
        NFS_Port = {{ port }};
{% if bind_addr %}
        Bind_addr = {{ bind_addr }};
{% endif %}
{% if haproxy_hosts %}
        HAProxy_Hosts = {{ haproxy_hosts|join(", ") }};
{% endif %}
}
QOS_DEFAULT_CONFIG {
        enable_qos = true;
        enable_tokens = false;
        enable_bw_control = true;
        combined_rw_bw_control = true;
        combined_rw_token_control = false;
        qos_type = 3;
        max_export_write_bw = 41943040;
        max_export_read_bw = 83886080;
        max_client_write_bw = 10485760;
        max_client_read_bw = 20971520;
        max_export_read_tokens = 0;
        max_export_write_tokens = 0;
        max_client_read_tokens = 0;
        max_client_write_tokens = 0;
        export_read_tokens_renew_time = 0;
        export_write_tokens_renew_time = 0;
        client_read_tokens_renew_time = 0;
        client_write_tokens_renew_time = 0;
}
NFSv4 {
        Delegations = false;
        RecoveryBackend = 'rados_cluster';
        Minor_Versions = 1, 2;
{% if nfs_idmap_conf %}
        IdmapConf = "{{ nfs_idmap_conf }}";
{% endif %}
}

RADOS_KV {
        UserId = "{{ user }}";
        nodeid = "{{ nodeid }}";
        pool = "{{ pool }}";
        namespace = "{{ namespace }}";
}

RADOS_URLS {
        UserId = "{{ user }}";
        watch_url = "{{ url }}";
}

RGW {
        cluster = "ceph";
        name = "client.{{ rgw_user }}";
}

%url    {{ url }}

3. Mount the export on NFS client and create a file using dd command

Actual results:
=========
NFS crashed and dumped core

# ceph orch ps | grep nfs
nfs.nfsganesha.0.0.ceph-manisaini-su4kp8-node2.wdszsj  ceph-manisaini-su4kp8-node2            *:2049            error             3m ago  17h        -        -  <unknown>        <unknown>     <unknown>


Expected results:
=========
NFS should not crash


Additional info:
===========

# ceph -s
  cluster:
    id:     2fefc25c-d8bb-11ef-a07d-fa163e4cf23a
    health: HEALTH_WARN
            1 failed cephadm daemon(s)

  services:
    mon: 3 daemons, quorum ceph-manisaini-su4kp8-node1-installer,ceph-manisaini-su4kp8-node3,ceph-manisaini-su4kp8-node2 (age 17h)
    mgr: ceph-manisaini-su4kp8-node1-installer.mvcgix(active, since 5h), standbys: ceph-manisaini-su4kp8-node3.muaqsz
    mds: 1/1 daemons up, 1 standby
    osd: 18 osds: 18 up (since 17h), 18 in (since 5d)
    rgw: 2 daemons active (2 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   9 pools, 721 pgs
    objects: 268 objects, 44 MiB
    usage:   2.0 GiB used, 268 GiB / 270 GiB avail
    pgs:     721 active+clean

# ceph health detail
HEALTH_WARN 1 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon nfs.nfsganesha.0.0.ceph-manisaini-su4kp8-node2.wdszsj on ceph-manisaini-su4kp8-node2 is in error state


# ls
'core.ganesha\x2enfsd.0.0c4c1c25a83b44da8097219002c185d3.326625.1737980809000000.zst'
'core.ganesha\x2enfsd.0.0c4c1c25a83b44da8097219002c185d3.328786.1737980822000000.zst'
'core.ganesha\x2enfsd.0.0c4c1c25a83b44da8097219002c185d3.328937.1737980836000000.zst'
'core.ganesha\x2enfsd.0.0c4c1c25a83b44da8097219002c185d3.329088.1737980849000000.zst'
'core.ganesha\x2enfsd.0.0c4c1c25a83b44da8097219002c185d3.329226.1737980865000000.zst'

Comment 10 errata-xmlrpc 2025-04-07 15:26:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.0 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:3635

Comment 11 Red Hat Bugzilla 2025-08-06 04:25:07 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days