Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2365590

Summary: [NFS-Ganesha][HAProxy-Protocol] VIP failback to original node causes I/O failures with "Remote I/O error" on clients
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manisha Saini <msaini>
Component: CephadmAssignee: Shweta Bhosale <shbhosal>
Status: CLOSED ERRATA QA Contact: Manisha Saini <msaini>
Severity: high Docs Contact: Rivka Pollack <rpollack>
Priority: high    
Version: 8.1CC: akane, cephqe-warriors, gouthamr, kkeithle, rpollack, sabose, spunadik, tserlin
Target Milestone: ---   
Target Release: 8.1z1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-19.2.1-232.el9cp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2370197 (view as bug list) Environment:
Last Closed: 2025-08-18 14:00:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2370197    

Description Manisha Saini 2025-05-12 05:31:18 UTC
Description of problem:
=========
An NFS-Ganesha setup with HAProxy and VIP-based access is configured in active-passive mode. 
Two NFS exports are created and mounted via the VIP on two Linux clients. When the active Ganesha node (holding the VIP) is rebooted, the VIP correctly fails over to the standby node, and I/O continues as expected after the grace period.

However, once the original node (Node1) comes back online and reclaims the VIP, I/O operations on the clients hang temporarily and eventually fail with "Remote I/O error".

Perform below steps --> 
===================

1. Reboot the active node - Node1 - Cali016 (on which the VIP was assigned)

Observation --> VIP is now assigned to standby Node (Node2- Cali020) and IO's continue on mount point after ganesha comes out of Grace period

2. Now When Node1 comes back online, VIP (10.8.130.51/22) now moves to node1.
IO's are hung on mount point for some time and later the IO's failed with "Remote I/O error"

---------------
tar: Unexpected EOF in archive
tar: linux-6.4/arch/powerpc/configs: Cannot utime: Remote I/O error
tar: linux-6.4/arch/powerpc/configs: Cannot change ownership to uid 0, gid 0: Remote I/O error
tar: linux-6.4/arch/powerpc/configs: Cannot change mode to rwxrwxr-x: Remote I/O error
tar: linux-6.4/arch/powerpc: Cannot utime: Remote I/O error
tar: linux-6.4/arch/powerpc: Cannot change ownership to uid 0, gid 0: Remote I/O error
tar: linux-6.4/arch/powerpc: Cannot change mode to rwxrwxr-x: Remote I/O error
tar: linux-6.4/arch/arm64/boot/dts/arm: Cannot stat: Remote I/O error

------------

ls mount point
----
[root@argo022 ganesha]# ls
ls: cannot open directory '.': Remote I/O error

------

[ceph: root@cali013 /]# ceph orch ps | grep nfs
haproxy.nfs.nfsganesha.cali016.uebnvn     cali016  *:2049,9049       running (21m)    54s ago  35m    42.3M        -  2.4.22-f8e3218    6c223bddea69  925fe8677810
keepalived.nfs.nfsganesha.cali016.gqrqtf  cali016                    running (21m)    54s ago  35m    1551k        -  2.2.8             09859a486cb9  2536958c3e0f
nfs.nfsganesha.0.2.cali020.gqxkvj         cali020  *:12049           running (23m)    55s ago  23m     263M        -  6.5               c0866a09d082  131f65723caf


Version-Release number of selected component (if applicable):
==================================
# ceph --version
ceph version 19.2.1-188.el9cp (834ac46f780fbdc2ac4ba4851a36db6df3c1aa6f) squid (stable)

# rpm -qa | grep nfs
libnfsidmap-2.5.4-27.el9_5.1.x86_64
nfs-utils-2.5.4-27.el9_5.1.x86_64
nfs-ganesha-selinux-6.5-13.el9cp.noarch
nfs-ganesha-6.5-13.el9cp.x86_64
nfs-ganesha-ceph-6.5-13.el9cp.x86_64
nfs-ganesha-rados-grace-6.5-13.el9cp.x86_64
nfs-ganesha-rados-urls-6.5-13.el9cp.x86_64
nfs-ganesha-rgw-6.5-13.el9cp.x86_64


How reproducible:
=============
1/1


Steps to Reproduce:
===================

1. Configure NFS-Ganesha with HAProxy protocol and VIP-based failover.

2. Create two NFS exports.

3. Mount both exports on two clients using the VIP (e.g. 10.8.130.51).

4. On both clients, run linux untars

5. Reboot the active Ganesha node (Node1) which holds the VIP.

6. Observe VIP moves to standby node (Node2), and I/O continues post-grace period.

7. Allow Node1 to come back online.

8. Observe VIP moves back to Node1.

Actual results:
===========

Upon Node1's return, the VIP moves back to it.

I/O on the clients temporarily hangs and then fails with:
Remote I/O error


Expected results:
============
VIP failback to Node1 should occur without any hangs or errors


Additional info:
===============

# ceph nfs cluster info nfsganesha
{
  "nfsganesha": {
    "backend": [
      {
        "hostname": "cali020",
        "ip": "10.8.130.20",
        "port": 12049
      }
    ],
    "ingress_mode": "haproxy-protocol",
    "monitor_port": 9049,
    "port": 2049,
    "virtual_ip": "10.8.130.51"
  }
}

# ceph orch ls | grep nfs
ingress.nfs.nfsganesha     10.8.130.51:2049,9049      2/2  9m ago     44m  cali016;cali020;count:1
nfs.nfsganesha             ?:12049                    1/1  9m ago     44m  cali016;cali020;count:1


ganesha.log
------

May 12 05:27:42 cali020 ceph-49a987b4-2b40-11f0-aadb-b49691cee574-nfs-nfsganesha-0-2-cali020-gqxkvj[2401871]: 12/05/2025 05:27:42 : epoch 68217efc : cali020 : ganesha.nfsd-2[svc_117] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7fe21400c870 fd 86 proxy header rest len failed header rlen = % (will set dead)
May 12 05:27:44 cali020 ceph-49a987b4-2b40-11f0-aadb-b49691cee574-nfs-nfsganesha-0-2-cali020-gqxkvj[2401871]: 12/05/2025 05:27:44 : epoch 68217efc : cali020 : ganesha.nfsd-2[svc_124] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7fe210005820 fd 86 proxy header rest len failed header rlen = % (will set dead)
May 12 05:27:46 cali020 ceph-49a987b4-2b40-11f0-aadb-b49691cee574-nfs-nfsganesha-0-2-cali020-gqxkvj[2401871]: 12/05/2025 05:27:46 : epoch 68217efc : cali020 : ganesha.nfsd-2[svc_68] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7fe208005890 fd 86 proxy header rest len failed header rlen = % (will set dead)
May 12 05:27:48 cali020 ceph-49a987b4-2b40-11f0-aadb-b49691cee574-nfs-nfsganesha-0-2-cali020-gqxkvj[2401871]: 12/05/2025 05:27:48 : epoch 68217efc : cali020 : ganesha.nfsd-2[svc_117] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7fe21400c870 fd 86 proxy header rest len failed header rlen = % (will set dead)
May 12 05:27:50 cali020 ceph-49a987b4-2b40-11f0-aadb-b49691cee574-nfs-nfsganesha-0-2-cali020-gqxkvj[2401871]: 12/05/2025 05:27:50 : epoch 68217efc : cali020 : ganesha.nfsd-2[svc_117] rpc :TIRPC :EVENT :handle_haproxy_header: 0x7fe210005820 fd 86 proxy header rest len failed header rlen = % (will set dead)



Cali016 -- ganesha.conf (before reboot)
----
[root@cali016 coredump]# cat /var/lib/ceph/49a987b4-2b40-11f0-aadb-b49691cee574/nfs.nfsganesha.0.0.cali016.lmoeqa/etc/ganesha/ganesha.conf
# This file is generated by cephadm.
NFS_CORE_PARAM {
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 3, 4;
        mount_path_pseudo = true;
        Enable_UDP = false;
        NFS_Port = 12049;
        allow_set_io_flusher_fail = true;
        HAProxy_Hosts = 10.8.130.13, 2620:52:0:880:b696:91ff:fece:e574, 10.8.130.15, 2620:52:0:880:b696:91ff:fecd:e0dc, 10.8.130.16, 2620:52:0:880:b696:91ff:fece:e6b8, 10.8.130.20, 2620:52:0:880:b696:91ff:fecd:e0a0, 10.8.130.19, 2620:52:0:880:b696:91ff:fece:e7d4, 10.8.130.51;
        Monitoring_Port = 9587;
}

NFSv4 {
        Delegations = false;
        RecoveryBackend = "rados_cluster";
        Minor_Versions = 1, 2;
        Server_Scope = "49a987b4-2b40-11f0-aadb-b49691cee574-nfsganesha";
        IdmapConf = "/etc/ganesha/idmap.conf";
        Virtual_Server = false;
}

RADOS_KV {
        UserId = "nfs.nfsganesha.0.0.cali016.lmoeqa";
        nodeid = 0;
        pool = ".nfs";
        namespace = "nfsganesha";
}

RADOS_URLS {
        UserId = "nfs.nfsganesha.0.0.cali016.lmoeqa";
        watch_url = "rados://.nfs/nfsganesha/conf-nfs.nfsganesha";
}

RGW {
        cluster = "ceph";
        name = "client.nfs.nfsganesha.0.0.cali016.lmoeqa-rgw";
}

%url    rados://.nfs/nfsganesha/conf-nfs.nfsganesha


------
Cali020 - Ganesha.conf (Post failover)

# cat ganesha.conf
# This file is generated by cephadm.
NFS_CORE_PARAM {
        Enable_NLM = false;
        Enable_RQUOTA = false;
        Protocols = 3, 4;
        mount_path_pseudo = true;
        Enable_UDP = false;
        NFS_Port = 12049;
        allow_set_io_flusher_fail = true;
        HAProxy_Hosts = 10.8.130.13, 2620:52:0:880:b696:91ff:fece:e574, 10.8.130.15, 2620:52:0:880:b696:91ff:fecd:e0dc, 10.8.130.16, 2620:52:0:880:b696:91ff:fece:e6b8, 10.8.130.20, 2620:52:0:880:b696:91ff:fecd:e0a0, 10.8.130.19, 2620:52:0:880:b696:91ff:fece:e7d4, 10.8.130.51;
        Monitoring_Port = 9587;
}

NFSv4 {
        Delegations = false;
        RecoveryBackend = "rados_cluster";
        Minor_Versions = 1, 2;
        Server_Scope = "49a987b4-2b40-11f0-aadb-b49691cee574-nfsganesha";
        IdmapConf = "/etc/ganesha/idmap.conf";
        Virtual_Server = false;
}

RADOS_KV {
        UserId = "nfs.nfsganesha.0.2.cali020.gqxkvj";
        nodeid = 0;
        pool = ".nfs";
        namespace = "nfsganesha";
}

RADOS_URLS {
        UserId = "nfs.nfsganesha.0.2.cali020.gqxkvj";
        watch_url = "rados://.nfs/nfsganesha/conf-nfs.nfsganesha";
}

RGW {
        cluster = "ceph";
        name = "client.nfs.nfsganesha.0.2.cali020.gqxkvj-rgw";
}

Comment 15 errata-xmlrpc 2025-08-18 14:00:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 8.1 security and bug fix updates), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2025:14015