2338406 – [NFS-Ganesha] Ganesha crashed during a failover operation, with the ingress mode configured as haproxy-protocol.

Bug 2338406 - [NFS-Ganesha] Ganesha crashed during a failover operation, with the ingress mode configured as haproxy-protocol.

Summary: [NFS-Ganesha] Ganesha crashed during a failover operation, with the ingress m...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	NFS-Ganesha
Sub Component:
Version:	8.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	urgent
Target Milestone:	---
Target Release:	8.1
Assignee:	Sachin Punadikar
QA Contact:	Manisha Saini
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	2346115 (view as bug list)
Depends On:
Blocks:	2406405
TreeView+	depends on / blocked

Reported:	2025-01-16 07:48 UTC by Manisha Saini
Modified:	2025-10-26 15:27 UTC (History)
CC List:	15 users (show)
Fixed In Version:	ceph-19.2.1-179.el9cp, libntirpc-6.3-3; nfs-ganesha-6.5-17.el9cp
Doc Type:	Known Issue
Doc Text:	.NFS clients observe outages during active node failures with active-passive HA configurations Currently, connected NFS clients observe an outage if an active host fails with an active-passive HA configuration. The outage occurs as a result of NFS being started on another host. As a workaround, use the following steps: 1. From the NFS working node, open the `ganesha.conf` file. ---- /var/lib/ceph/${FSID}/nfs.${DAEMON_ID}/etc/ganesha/ganesha.conf ---- 2. Add the `virtual_ip` in the `HAProxy_Hosts` list within `NFS_CORE_PARAM`. 3. Restart NFS. ---- ceph orch restart nfs.NFS_CLUSTER_NAME ---- IMPORTANT: Do not redeploy the NFS cluster. Redeploying wipes out the configuration changes made to `ganesha.conf`. After restart, the NFS functions and NFS clients work again, as expected.
Clone Of:
Clones:	2406405 (view as bug list)
Environment:
Last Closed:	2025-06-26 12:21:27 UTC
Embargoed:
Dependent Products:
Flags:	rpollack: needinfo+

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-10467	0	None	None	None	2025-01-16 07:48:37 UTC
Red Hat Product Errata	RHSA-2025:9775	0	None	None	None	2025-06-26 12:21:31 UTC

Description Manisha Saini 2025-01-16 07:48:02 UTC

Description of problem:
===================
Ganesha crashed and dumped core file while performing failover operations when the ingress mode is set to "haproxy-protocol" on Stretched cluster deployed with NFS. IO's failed with Input/Output error on clients

[root@cali024 ~]# ls -lart /var/lib/systemd/coredump
total 21720
drwxr-xr-x. 7 root root    4096 Sep 10 12:05  ..
-rw-r-----. 1 root root 4299741 Jan 15 19:21 'core.ganesha\x2enfsd.0.1b34fcef4c464a569a178fc8e49c5980.113640.1736968862000000.zst'
-rw-r-----. 1 root root 2582534 Jan 15 19:21 'core.ganesha\x2enfsd.0.1b34fcef4c464a569a178fc8e49c5980.127457.1736968878000000.zst'
-rw-r-----. 1 root root 2575077 Jan 15 19:21 'core.ganesha\x2enfsd.0.1b34fcef4c464a569a178fc8e49c5980.127721.1736968894000000.zst'
-rw-r-----. 1 root root 4290951 Jan 15 19:28 'core.ganesha\x2enfsd.0.1b34fcef4c464a569a178fc8e49c5980.127975.1736969305000000.zst'
-rw-r-----. 1 root root 4197354 Jan 15 19:32 'core.ganesha\x2enfsd.0.1b34fcef4c464a569a178fc8e49c5980.129280.1736969557000000.zst'
-rw-r-----. 1 root root 4251991 Jan 15 19:34 'core.ganesha\x2enfsd.0.1b34fcef4c464a569a178fc8e49c5980.130664.1736969675000000.zst'
drwxr-xr-x. 2 root root    4096 Jan 15 19:34  .


ganesha.log
--------
2[svc_16] nfs_rpc_process_request :DISP :WARN :HAProxy connection from ::ffff:10.8.130.31:42422 rejected
Jan 15 19:34:34 cali024 ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842-nfs-nfsganesha-0-1-cali024-ldwpmh[130659]: 15/01/2025 19:34:34 : epoch 67880d60 : cali024 : ganesha.nfsd-2[svc_16] nfs_rpc_process_request :DISP :WARN :HAProxy connection from ::ffff:10.8.130.31:51286 rejected
Jan 15 19:34:34 cali024 ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842-nfs-nfsganesha-0-1-cali024-ldwpmh[130659]: 15/01/2025 19:34:34 : epoch 67880d60 : cali024 : ganesha.nfsd-2[svc_16] nfs_rpc_process_request :DISP :WARN :HAProxy connection from ::ffff:10.8.130.31:51298 rejected
Jan 15 19:34:34 cali024 ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842-nfs-nfsganesha-0-1-cali024-ldwpmh[130659]: 15/01/2025 19:34:34 : epoch 67880d60 : cali024 : ganesha.nfsd-2[svc_16] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7f6538002ef0 fd 38 proxy header rest len failed header rlen = % (will set dead)
Jan 15 19:34:36 cali024 systemd-coredump[130930]: Process 130664 (ganesha.nfsd) of user 0 dumped core.

                                                  Stack trace of thread 57:
                                                  #0  0x00007f6599de8536 n/a (/usr/lib64/libntirpc.so.6.0.1 + 0x22536)
                                                  #1  0x0000000000000000 n/a (n/a + 0x0)
                                                  #2  0x00007f6599df2c90 n/a (/usr/lib64/libntirpc.so.6.0.1 + 0x2cc90)
                                                  ELF object binary architecture: AMD x86-64
Jan 15 19:34:36 cali024 podman[130992]: 2025-01-15 19:34:36.219059684 +0000 UTC m=+0.021312121 container died 3b3476705cdfbbb41fef3375d4d2b49adb2993e7c9b0e94581374576e6d31de1 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:11b3c0122e92da541a439a2628837c08de5693a964484c3c3c864da54e040bce, name=ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842-nfs-nfsganesha-0-1-cali024-ldwpmh, vendor=Red Hat, Inc., io.k8s.description=Red Hat Ceph Storage 8, io.openshift.tags=rhceph ceph, architecture=x86_64, GIT_COMMIT=55ad0f204a1d654ee565abf874aecad0cc209d0e, io.buildah.version=1.33.8, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/8-215, maintainer=Guillaume Abrioux <gabrioux>, vcs-type=git, GIT_REPO=https://github.com/ceph/ceph-container.git, name=rhceph, description=Red Hat Ceph Storage 8, io.k8s.display-name=Red Hat Ceph Storage 8 on RHEL 9, vcs-ref=6bc17c430374b15a8dee08107281b6a4fa5b8ce9, version=8, release=215, GIT_CLEAN=True, com.redhat.component=rhceph-container, build-date=2025-01-09T21:26:54, CEPH_POINT_RELEASE=, distribution-scope=public, summary=Provides the latest Red Hat Ceph Storage 8 on RHEL 9 in a fully featured and supported base image., GIT_BRANCH=main, io.openshift.expose-services=, ceph=True, com.redhat.license_terms=https://www.redhat.com/agreements, RELEASE=main)
Jan 15 19:34:36 cali024 podman[130992]: 2025-01-15 19:34:36.228607841 +0000 UTC m=+0.030860270 container remove 3b3476705cdfbbb41fef3375d4d2b49adb2993e7c9b0e94581374576e6d31de1 (image=registry-proxy.engineering.redhat.com/rh-osbs/rhceph@sha256:11b3c0122e92da541a439a2628837c08de5693a964484c3c3c864da54e040bce, name=ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842-nfs-nfsganesha-0-1-cali024-ldwpmh, GIT_BRANCH=main, url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhceph/images/8-215, GIT_CLEAN=True, GIT_COMMIT=55ad0f204a1d654ee565abf874aecad0cc209d0e, name=rhceph, GIT_REPO=https://github.com/ceph/ceph-container.git, version=8, vcs-type=git, io.openshift.tags=rhceph ceph, vcs-ref=6bc17c430374b15a8dee08107281b6a4fa5b8ce9, architecture=x86_64, io.k8s.display-name=Red Hat Ceph Storage 8 on RHEL 9, io.k8s.description=Red Hat Ceph Storage 8, build-date=2025-01-09T21:26:54, release=215, io.buildah.version=1.33.8, summary=Provides the latest Red Hat Ceph Storage 8 on RHEL 9 in a fully featured and supported base image., com.redhat.component=rhceph-container, io.openshift.expose-services=, RELEASE=main, description=Red Hat Ceph Storage 8, CEPH_POINT_RELEASE=, distribution-scope=public, ceph=True, com.redhat.license_terms=https://www.redhat.com/agreements, vendor=Red Hat, Inc., maintainer=Guillaume Abrioux <gabrioux>)
Jan 15 19:34:36 cali024 systemd[1]: ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842.0.1.cali024.ldwpmh.service: Main process exited, code=exited, status=139/n/a
Jan 15 19:34:36 cali024 systemd[1]: ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842.0.1.cali024.ldwpmh.service: Failed with result 'exit-code'.
Jan 15 19:34:36 cali024 systemd[1]: ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842.0.1.cali024.ldwpmh.service: Consumed 1.511s CPU time.
Jan 15 19:34:46 cali024 systemd[1]: ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842.0.1.cali024.ldwpmh.service: Scheduled restart job, restart counter is at 6.
Jan 15 19:34:46 cali024 systemd[1]: Stopped Ceph nfs.nfsganesha.0.1.cali024.ldwpmh for 2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842.
Jan 15 19:34:46 cali024 systemd[1]: ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842.0.1.cali024.ldwpmh.service: Consumed 1.511s CPU time.
Jan 15 19:34:46 cali024 systemd[1]: ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842.0.1.cali024.ldwpmh.service: Start request repeated too quickly.
Jan 15 19:34:46 cali024 systemd[1]: ceph-2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842.0.1.cali024.ldwpmh.service: Failed with result 'exit-code'.
Jan 15 19:34:46 cali024 systemd[1]: Failed to start Ceph nfs.nfsganesha.0.1.cali024.ldwpmh for 2edf8cd4-d28b-11ef-b7c0-ac1f6b0a1842.



[ceph: root@argo022 /]# ceph orch ps | grep nfs
haproxy.nfs.nfsganesha.cali021.yjbcwr     cali021  *:2049,9049       running (12h)     6m ago  12h     106M        -  2.4.22-f8e3218   dcd07853693c  4632c83b7d71
keepalived.nfs.nfsganesha.cali021.cvvyox  cali021                    running (12h)     6m ago  12h    1640k        -  2.2.8            8d8e2a795e75  c3b2c6263bdc
nfs.nfsganesha.0.1.cali024.ldwpmh         cali024  *:12049           error             6m ago  12h        -        -  <unknown>        <unknown>     <unknown>
[ceph: root@argo022 /]#


IO's failed on client
==========
tar: linux-6.4/Documentation/devicetree: Cannot change ownership to uid 0, gid 0: Input/output error
tar: linux-6.4/Documentation/devicetree: Cannot change mode to rwxrwxr-x: Input/output error
tar: linux-6.4/Documentation: Cannot utime: Input/output error
tar: linux-6.4/Documentation: Cannot change ownership to uid 0, gid 0: Input/output error
tar: linux-6.4/Documentation: Cannot change mode to rwxrwxr-x: Input/output error
tar: linux-6.4: Cannot utime: Input/output error
tar: linux-6.4: Cannot change ownership to uid 0, gid 0: Input/output error
tar: linux-6.4: Cannot change mode to rwxrwxr-x: Input/output error
tar: Error is not recoverable: exiting now

Version-Release number of selected component (if applicable):
=================

# ceph --version
ceph version 19.2.0-58.el9cp (d4780295dda9ca9810286810bd5254843dbd55e5) squid (stable)

# rpm -qa | grep nfs
libnfsidmap-2.5.4-27.el9.x86_64
nfs-utils-2.5.4-27.el9.x86_64
nfs-ganesha-selinux-6.0-8.el9cp.noarch
nfs-ganesha-6.0-8.el9cp.x86_64
nfs-ganesha-rgw-6.0-8.el9cp.x86_64
nfs-ganesha-ceph-6.0-8.el9cp.x86_64
nfs-ganesha-rados-grace-6.0-8.el9cp.x86_64
nfs-ganesha-rados-urls-6.0-8.el9cp.x86_64

How reproducible:
============
1/1


Steps to Reproduce:
=================
1. Deployed 2 site stretched cluster (DC1 and DC2) with 1 tie breaker mon in DC3.
2. NFS cluster is deployed with HA (cali021 is present in DC1 and cali024 is present in DC2)

[ceph: root@argo022 /]# ceph nfs cluster create nfsganesha "1 cali021 cali024" --ingress-mode haproxy-protocol  --ingress --virtual-ip 10.8.130.31/22

[ceph: root@argo022 /]# ceph nfs cluster info nfsganesha
{
  "nfsganesha": {
    "backend": [
      {
        "hostname": "cali021",
        "ip": "10.8.130.21",
        "port": 12049
      }
    ],
    "monitor_port": 9049,
    "port": 2049,
    "virtual_ip": "10.8.130.31"
  }
}

[ceph: root@argo022 /]# ceph orch ps | grep nfs
haproxy.nfs.nfsganesha.cali021.yjbcwr     cali021  *:2049,9049       running (13s)     6s ago  13s    37.5M        -  2.4.22-f8e3218   dcd07853693c  78e01f109e66
keepalived.nfs.nfsganesha.cali021.cvvyox  cali021                    running (12s)     6s ago  12s    1631k        -  2.2.8            8d8e2a795e75  e42f68b4c225
nfs.nfsganesha.0.0.cali021.aolabd         cali021  *:12049           running (15s)     6s ago  15s    60.5M        -  6.0              51e7488dd867  3d6f154c39f6

3. Created 1 subvolume group, 2 subvolume and 2 NFS exports

[ceph: root@argo022 /]# ceph nfs export create cephfs nfsganesha /ganesha2 cephfs --path=/volumes/ganeshagroup/vol2/6c075800-d6e1-4bf6-8236-82e11c8f8928
{
  "bind": "/ganesha2",
  "cluster": "nfsganesha",
  "fs": "cephfs",
  "mode": "RW",
  "path": "/volumes/ganeshagroup/vol2/6c075800-d6e1-4bf6-8236-82e11c8f8928"
}
[ceph: root@argo022 /]# ceph nfs export create cephfs nfsganesha /ganesha1 cephfs --path=/volumes/ganeshagroup/vol1/6b2b0883-d1a3-42d2-af7b-638961c3a8ae
{
  "bind": "/ganesha1",
  "cluster": "nfsganesha",
  "fs": "cephfs",
  "mode": "RW",
  "path": "/volumes/ganeshagroup/vol1/6b2b0883-d1a3-42d2-af7b-638961c3a8ae"
}

4. Mount the 2 exports on 2 clients and run linux untars on NFS v4.2 mount in parallel

5. Perform failover when IO's are running on client. (Rebooted node - cali021)


Actual results:
==============
Ganesha crashed on cali024 and IO'S failed with input/output error



Expected results:
================
Ganesha should not crash


Additional info:
================

Comment 56 errata-xmlrpc 2025-06-26 12:21:27 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat Ceph Storage 8.1 security, bug fix, and enhancement updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2025:9775

Note You need to log in before you can comment on or make changes to this bug.